ResearchπŸ’‘

Learning Robust Optimal Policy from Strongly Correlated and Corrupted Observations
(ICML 2026, NeuRIPS 2025, IEEE CDC 2024)

Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop novel, robust Q-learning algorithms that uses historical reward data to construct a robust empirical Bellman operator at each time step, without requiring knowledge of the true reward statistics. Finally, we prove finite-time convergence rates that matches known state-of-the-art bounds up to a small inevitable error term that scales with the adversarial corruption fraction. We further prove an information-theoretic lower bound, implying that dependence on the corruption fraction is inherent.




Mathematical Guarantees and Fundamental Limits of Policy Evaluation under Adversarial Influences and Markovian Data (AISTATS 2025)

Provable Adversarial Robustness in Federated Multi-Agent Reinforcement Learning under Strong Communication Constraints (ACC 2026)

We consider a federated reinforcement learning setting involving M agents, all of whom interact with a common Markov Decision Process (MDP). The agents exchange information via a central server to learn the optimal value function. Our goal is to understand to what extent one can hope for collaborative sample-complexity speedups in such a setting, when a small fraction of the agents are adversarial and can act arbitrarily. To that end, we propose πšπš˜πš‹πšžπšœπš π™΅πšŽπš-πš€, a federated Q-learning algorithm that blends ideas from both model-based and model-free RL, along with the median-of-means device from robust statistics. We prove that with high-probability, πšπš˜πš‹πšžπšœπš π™΅πšŽπš-πš€ (i) guarantees exact convergence to the optimal value function in the limit of infinite samples, despite corruption, and (ii) enjoys near-optimal finite-time rates that benefit from collaboration. Perhaps surprisingly, our approach requires constant rounds of communication to achieve each of the above guarantees, a feature of independent interest in FL where communication is the major bottleneck.


β–  πšπš˜πš‹πšžπšœπš π™΅πšŽπš-πš€: Robust Federated Q-Learning with Almost No Communication. [Accepted at the 2026 American Control Conference ACC 2026] [Project]

β–  πš…πšπ™³πš€: Variance Reduced Q-Learning over time-varying Networks. [Accepted at the 2026 American Control Conference ACC 2026] [Project]