## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Meta-Gradient Reinforcement Learning with an Objective Discovered Online

NIPS 2020, (2020)

EI

Keywords

Abstract

Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an...More

Code:

Data:

Introduction

- Recent advances in supervised and unsupervised learning have been driven by a transition from handcrafted expert features to deep representations [14]; these are typically learned by gradient descent on a suitable objective function to adjust a rich parametric function approximator.
- The authors applied the algorithm for online discovery of an off-policy learning objective to independent training runs on each of 57 classic Atari games.
- The authors describe the proposed algorithm for online learning of reinforcement learning objectives using meta-gradients.
- The authors train the meta-network using an end-to-end meta-gradient algorithm, so as to learn an update target that leads to good subsequent performance.

Highlights

- Recent advances in supervised and unsupervised learning have been driven by a transition from handcrafted expert features to deep representations [14]; these are typically learned by gradient descent on a suitable objective function to adjust a rich parametric function approximator
- Reinforcement learning (RL) has largely embraced the transition from handcrafting features to handcrafting objectives: deep function approximation has been successfully combined with ideas such as temporal difference (TD)-learning [29, 33], Q-learning [41, 22], double Q-learning [35, 36], n-step updates [31, 13], general value functions [32, 17], distributional value functions [7, 3], policy gradients [42, 20] and a variety of off-policy actor-critics [8, 10, 28]
- We proposed an algorithm that allows reinforcement learning (RL) agents to learn their own objective during online interactions with their environment
- The nature of the meta-network, and the objective of the RL algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target
- Our results in toy domains demonstrate that FRODO can successfully discover how to address key issues in RL, such as bootstrapping and non-stationarity, through online adaptation of its objective
- Our results in Atari demonstrate that FRODO can successfully discover and adapt off-policy learning objectives that are distinct from, and performed better than, strong benchmark RL algorithms

Results

- A different way to learn the RL objective is to directly parameterise a loss by a meta-network [2, 19], rather than the target of a loss.
- Τi+M , τi+M+1}, the authors apply multiple steps of gradient descent updates to the agent θ according to the inner losses Liηnner(τi, θi).
- The meta-gradient algorithm above can be applied to any differentiable component of the update rule, for example to learn the discount factor γ and bootstrapping factor λ [43], intrinsic rewards [46, 45], and auxiliary tasks [38].
- The authors apply meta-gradients to learn the meta-parameters of the update target gη online, where η are the parameters of a neural network.
- After M updates, the authors compute the outer loss Louter from a validation trajectory τ as the squared difference between the predicted value and a canonical multi-step bootstrapped return G(τ ), as used in classic RL:
- It receives the rewards Rt, discounts γt, and, as in the motivating examples from Section 4, the values from future time-steps v(St+1), to allow bootstrapping from the learned predictions.
- This allows the inner loss to potentially discover off-policy algorithms, by constructing suitable off-policy update targets for the policy and value function.
- The authors applied the FRODO algorithm to learn a target online, using an outer loss based on the actorcritic algorithm IMPALA [10], and using a consistency loss was included with c = 0.1.
- In Figure 3a the authors see that the meta-gradient algorithm learned slowly and gradually to discover an effective objective.
- Over time the meta-gradient algorithm learned to learn more rapidly, overtaking the actor-critic baseline and achieving significantly stronger final results.

Conclusion

- The objective, the target used to update the policy and value function, is parameterised by a deep neural meta-network.
- The nature of the meta-network, and the objective of the RL algorithm, is discovered by meta-gradient descent over the sequence of updates based upon the discovered target.
- The authors' results in Atari demonstrate that FRODO can successfully discover and adapt off-policy learning objectives that are distinct from, and performed better than, strong benchmark RL algorithms.
- These examples illustrate the generality of the proposed method, and suggest its potential both to recover existing concepts, and to discover new concepts for RL algorithms

- Table1: Detailed hyper-parameters for Atari experiments

Related work

- The idea of learning to learn by gradient descent has a long history. In supervised learning, IDBD and SMD [30, 27] used a meta-gradient approach to adapt the learning rate online so as to optimise future performance. “Learning by gradient descent to learn by gradient descent" [1] used meta-gradients, offline and over multiple lifetimes, to learn a gradient-based optimiser, parameterised by a “black-box” neural network. MAML [11] and REPTILE [23] also use meta-gradients, offline and over multiple lifetimes, to learn initial parameters that can be optimised more efficiently.

In reinforcement learning, methods such as meta reinforcement learning [39] and RL2 [9] allow a recurrent network to jointly represent, in its activations, both the agent’s representation of state and also its internal parameters. Xu et al [43] introduced metagradients as a general but efficient approach for optimising the meta-parameters of gradient-based RL agents. This approach has since been applied to many different meta-parameters of RL algorithms, such as the discount γ and bootstrapping parameter λ [43], intrinsic rewards [46, 45], auxiliary tasks [38], off-policy corrections [44], and to parameterise returns as a linear combination of rewards [40] (without any bootstrapping). The metagradient approach has also been applied, offline and over multiple lifetimes, to black-box parameterisations, via deep neural networks, of the entire RL algorithm [2, 19] and [24] (contemporaneous work); evolutionary approaches have also been applied [16].

Reference

- M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
- S. Bechtle, A. Molchanov, Y. Chebotar, E. Grefenstette, L. Righetti, G. Sukhatme, and F. Meier. Meta-learning via learned loss. arXiv preprint arXiv:1906.05374, 2019.
- M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In ICML, 2017.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018.
- D. Budden, M. Hessel, J. Quan, and S. Kapturowski. RLax: Reinforcement Learning in JAX, 2020.
- K.-J. Chung and M. J. Sobel. Discounted mdp’s: distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 1987.
- T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. In ICML, 2012.
- Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML, 2018.
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
- T. Hennigan, T. Cai, T. Norman, and I. Babuschkin. Haiku: Sonnet for JAX, 2020.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735– 1780, 1997.
- R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
- M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ICLR, 2016.
- N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, et al. In-Datacenter performance analysis of a Tensor Processing Unit. ISCA, 2017.
- L. Kirsch, S. van Steenkiste, and J. Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. ICLR, 2020.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb. 2015.
- A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
- J. Oh, M. Hessel, W. Czarnecki, Z. Xu, H. van Hasselt, S. Singh, and D. Silver. Discovering reinforcement learning algorithms. arXiv preprint, 2020.
- M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994.
- G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
- N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. ICANN, 1999.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- R. S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176, 1992.
- R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In L. Sonenberg, P. Stone, K. Tumer, and P. Yolum, editors, AAMAS, pages 761–768. IFAAMAS, 2011.
- G. Tesauro. Temporal difference learning and TD-Gammon. Commun. ACM, 38(3):58–68, Mar. 1995.
- T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- H. van Hasselt. Double Q-learning. In Advances in neural information processing systems, pages 2613–2621, 2010.
- H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI, 2016.
- H. van Seijen, H. van Hasselt, S. Whiteson, and M. Wiering. A theoretical and empirical analysis of Expected Sarsa. In IEEE symposium on adaptive dynamic programming and reinforcement learning, pages 177–184. IEEE, 2009.
- V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pages 9306–9317, 2019.
- J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Y. Wang, Q. Ye, and T.-Y. Liu. Beyond exponentially discounted sum: Automatic learning of return function. arXiv preprint arXiv:1905.11591, 2019.
- C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Z. Xu, H. P. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. In Advances in neural information processing systems, pages 2396–2407, 2018.
- T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, and S. Singh. Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928, 2020.
- Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. van Hasselt, D. Silver, and S. Singh. What can learned intrinsic rewards capture? ICML, 2020.
- Z. Zheng, J. Oh, and S. Singh. On learning intrinsic rewards for policy gradient methods. In Advances in Neural Information Processing Systems, pages 4644–4654, 2018.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn