Kintan Saha | Reinforcement Learning & Computer Vision Research projects

Reliable Critics: Monotonic Improvement and Convergence Guarantees for Reinforcement Learning

May 2025 – Present | Advisor: Prof Gugan Thoppe and Prof. Aditya Gopalan

This project builds upon Reliable Critics, a reinforcement learning framework that ensures stable, monotonic policy improvement by addressing the unreliability of critic estimates in actor-critic methods. By using lower confidence bounds and restricting policy updates to regions where the critic is accurate, the method guarantees safe and theoretically sound improvements, similar to Conservative Policy Iteration. Compatible with deep function approximation, Reliable Critics outperforms PPO and SAC on continuous control tasks, especially under high noise and approximation errors.

Building on this foundation, this project aims to extend the Reliable Policy Iteration (RPI) framework across multiple state-of-the-art (SOTA) algorithms such as PPO, TD3, and DDPG, and benchmark them on diverse environments including Atari, MuJoCo, and MiniGrid. A novel plug-and-play RPI-based loss function was designed to integrate with existing Deep RL algorithms. Extensive experiments were conducted on the sparse-reward MiniGrid environment, along with ablation studies, to establish new baselines and advance the state-of-the-art in reliable deep reinforcement learning.

Deep Reinforcement Learning Theoretical guarantees in Reinforcement Learning Policy Iteration

Feed Forward Deblurring in 3DGS

May 2025 – Present | Prof. Venkatesh Babu

This project proposes a generalizable, scene-agnostic deblurring framework for 3D Gaussian Splatting (3DGS) pipelines, addressing a key limitation in current 3DGS-based rendering systems. While 3DGS has emerged as a powerful representation for real-time, photorealistic 3D scene rendering, its visual fidelity degrades significantly when input images are blurry. Existing deblurring techniques for 3DGS are predominantly scene-specific, requiring per-scene optimization or fine-tuning, which severely limits scalability and generalization.

To overcome this, we aim to develop a lightweight feed-forward deblurring module that can be plugged directly into foundation 3DGS models such as NoPoSplat and Dust3R, without any scene-specific retraining or per-scene supervision. The goal is to enable fast and scalable deblurring of point cloud representations generated from blurry image sets, without sacrificing real-time performance or rendering quality.

3D Gaussian Splatting Novel View Synthesis Scene Reconstruction Scene Deblurring

Towards Uncertainty-aware Alignment

Jan 2025 – April 2025 | Advisor: Prof. Aditya Gopalan

This project investigates and addresses a key limitation in preference-based alignment of LLMs: the instability of reward models trained on human feedback. In standard RLHF pipelines, a policy is optimized against a reward model trained on preference data. However, we empirically show that reward models trained on the same dataset can produce inconsistent outputs, leading to overfitting and degraded policy performance.

To understand this phenomenon, we develop a theoretical model showing that variance in reward model estimates can lead to unsafe policy updates, increasing the likelihood of performance regressions. In response, we propose a variance-aware policy optimization framework, which incorporates uncertainty estimates from the reward model into the policy training objective. The framework introduces a regularization term that penalizes updates in regions of high reward variance, making the policy learning process more robust.

We provide theoretical guarantees showing that this approach reduces the risk of producing worse policies than the baseline. Experiments across various LLMs and reward model setups confirm that our method significantly improves alignment stability and generalization compared to standard variance-unaware pipelines. This work is currently under review at NeurIPS 2025.

The paper can be found here.

Reinforcement Learning through Human Feedback(RLHF) LLM Alignment Reinforcement Learning LLMs Preference-based Reinforcement Learning

HinglishEval: Evaluating the Effectiveness of Code-generation Models on Hinglish Prompts

Jan 2024 – June 2024 | Advisor: Prof Viraj Kumar

Code-generation Models are Large Language Models (LLMs) that are fine-tuned to generate code from natural-language prompts. Prior work shows that such models can democratize programming by allowing novice programmers to generate accurate code for simple coding tasks by providing clear English-language prompts. In this project, we explored whether this democratization can extend to novice programmers who lack proficiency in English but are able to craft clear prompts in another language. Specifically, we considered prompts in Hinglish, a mixture of Hindi and English that many students in India are comfortable with. We made two contributions. First, we proposed a semi-automated technique to translate English prompts into Hinglish, and we used this technique to create HinglishEval: a Hinglish translation of the widely used code-generation benchmark HumanEval. Second, we compared the performance of several popular open- and closed-source code-generation models on Hinglish and English prompts. Our findings suggest that although code-generation models are generally more effective at generating accurate code for English prompts, their efficacy with Hinglish prompts is also promising.

This work resulted in publication of the paper - Evaluating the Effectiveness of Code-Generation Models on Hinglish Prompts

Code Generation LLM Multilingual LLM Evaluation Metrics Computer Science Education