Normalizing Flows for Policy Representation in Reinforcement Learning

In part 3 on Normalizing Flows, we will discuss how Reinforcement Learning could benefit from this class of methods for policy representation:

Using Normalizing Flows to represent stochastic policies in reinforcement learning:

  • I assume that everyone has a basic understanding of RL. Therefore, I will only briefly review the fundamentals (state, action, environment, …)

  • The objective of RL is to maximize some quantity of return, usually the expected (discounted) sum of rewards per time step.

  • Actions are taken w.r.t. a current policy

    $$\pi(a_t | s_t)$$

  • We focus on stochastic policies. In this case, the policy is a (sometimes complicated) probability distribution. Our RL algorithm tries to find a parametrized policy that most closely resembles an optimal policy w.r.t. return. There are many ways to optimize that policy, e.g. with policy gradients.

  • Most RL papers nowadays represent their modeled policy with a NN that outputs parameters of a fixed distribution, e.g. mu, sigma of normal distribution.

  • I present the idea how normalizing flows can be used to represent RL policies.


In this series