Duality in RL and the [*]-DICE Family (part 2) | TransferLab

The goal of reinforcement learning is to find an optimal policy. Two commonly used classes of methods are i) policy gradient and ii) dynamic programming based methods. In this talk I will describe an alternative solution strategy based on LP duality. It is based on the duality of Q-values and state-action visitations.

This talk addresses the offline setting, in which the agent learns from a fixed dataset without interacting with the environment. After recapitulating the LP duality in RL I will introduce the Fenchel-Rockafellar Duality that is the backbone of a variety of recent methods for offline policy evaluation and offline optimisation. Two core difficulties of offline RL are the distribution shift in data and the generalisation/extrapolation problem. We will how regularising the primal variables enforces better generalisation, whereas regularisation of the dual variables alleviates distribution shift.

We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including …

References