skip to content

Mathematical Research at the University of Cambridge

 

: Our aim is to learn about policy gradient methods for solving reinforcement learning (RL) problems modelled using the Markov decision problem (MDP) framework with general (possibly continuous, possibly infinite dimensional) state and action spaces. We will focus mainly on theoretical convergence of mirror descent with direct parametrisation and natural-gradient descent when employing log-linear parametrisation. For our purposes solving an RL problem means that we find a (nearly) optimal policy in a situation where the transition dynamics and costs are unknown but we can repeatedly interact with some system (or environment simulator).
There are two main approaches to solving RL problems: action-value methods which learn the state-action value function (the Q-function) and then select actions based on this. Their convergence is understood Watkins and Dayan [1992], [Sutton and Barto, 2018, Ch. 6] and will not be discussed here. Policy gradient methods directly update the policy by stepping in the direction of the gradient of the value function and have a long history for which the reader is referred to [Sutton and Barto, 2018, Ch. 13]. Their convergence is only understood in specific settings, as we will see below. The focus here is to cover generic (Polish) state and action spaces. We will touch upon the popular PPO algorithm Schulman et al. [2017] and explain the difficulties arising when trying to prove convergence of PPO.
Many related and interesting questions will not be touched upon: convergence of actor-critic methods, convergence in presence of Monte-Carlo errors, regret, off-policy gradient methods, near continuous time RL.
Large parts of what is presented here in particular on mirror descent and natural-gradient descent is from Kerimkulov et al. [2025]. This work was itself inspired by the recent results of Mei et al. [2021], Lan [2023] and Cayci et al. [2021].

Further information

Time:

07Nov
Nov 7th 2025
09:30 to 12:30

Venue:

Enigma Room, The Alan Turing Institute

Speaker:

David Siska (University of Edinburgh)

Series:

Isaac Newton Institute Seminar Series