Journal Home > Volume 28 , Issue 1

Reinforcement Learning (RL) algorithms work well with well-defined rewards, but they fail with sparse/deceptive rewards and require additional exploration strategies. This work introduces a deep exploration method based on the Upper Confidence Bound (UCB) bonus. The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic. Based on the conclusion of the regret bound under the linear Markov decision process approximation, we use the feature matrix to calculate the UCB bonus for deep exploration. The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations. Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy. We design a simple task, "swim", to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments. Then, we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks. The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.


menu
Abstract
Full text
Outline
About this article

LiFE: Deep Exploration via Linear-Feature Bonus in Continuous Control

Show Author's information Jiantao Qiu1Yu Wang1( )
Department of Electronic Engineering and Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, China

Abstract

Reinforcement Learning (RL) algorithms work well with well-defined rewards, but they fail with sparse/deceptive rewards and require additional exploration strategies. This work introduces a deep exploration method based on the Upper Confidence Bound (UCB) bonus. The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic. Based on the conclusion of the regret bound under the linear Markov decision process approximation, we use the feature matrix to calculate the UCB bonus for deep exploration. The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations. Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy. We design a simple task, "swim", to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments. Then, we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks. The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.

Keywords: Neural Network (NN), Reinforcement Learning (RL), Upper Confidence Bound (UCB)

References(24)

[1]
S. Fujimoto, H. Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proc. 35th Int. Conf. Machine Learning, Stockholmsmässan, Stockholm, Sweden, 2018, pp. 1587–1596.
[2]
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., Soft actor-critic algorithms and applications, arXiv preprint arXiv:1812.05905, 2019.
[3]
Schulman J., Wolski F., Dhariwal P., Radford A., and Klimov O. , Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347, 2017.
[4]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[5]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971, 2019.
[6]
M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, Parameter space noise for exploration, arXiv preprint arXiv:1706.01905, 2018.
[7]
T. Haarnoja, H. R. Tang, P. Abbeel, and S. Levine, Reinforcement learning with deep energy-based policies, in Proc. 34th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1352–1361.
[8]
P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Mach. Learn., vol. 47, nos. 2&3, pp. 235–256, 2002.
[9]
J. Y. Audibert, R. Munos, and C. Szepesvári, Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theoret. Comput. Sci., vol. 410, no. 19, pp. 1876–1902, 2009.
[10]
W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol. 25, nos. 3&4, pp. 285–294, 1933.
[11]
D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen, A tutorial on thompson sampling, Found. Trends Mach. Learn., vol. 11, no. 1, pp. 1–96, 2018.
[12]
I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, Deep exploration via bootstrapped DQN, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 4033–4041.
[13]
I. Osband, J. Aslanides, and A. Cassirer, Randomized prior functions for deep reinforcement learning, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 8626–8638.
[14]
M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, Unifying count-based exploration and intrinsic motivation, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 1479–1487.
[15]
G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, Count-based exploration with neural density models, in Proc. 34th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 2721–2730.
[16]
M. C. Machado, M. G. Bellemare, and M. Bowling, Count-based exploration with the successor representation, in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 4, pp. 5125–5133, 2020.
[17]
Cai Q., Yang Z. R., Jin C., and Wang Z. R., Provably efficient exploration in policy optimization, in Proc. 37th Int. Conf. Machine Learning, Vienna, Austria, 2020, pp. 12831294.
[18]
Brockman G., Cheung V., Pettersson L., Schneider J., Schulman J. , Tang J., and Zaremba W., OpenAI gym, arXiv preprint arXiv:1606.01540, 2016.
[19]
E. Todorov, T. Erez, and Y. Tassa, MuJoCo: A physics engine for model-based control, in Proc. of the 2012 IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 2012, pp. 5026–5033.
[20]
K. Azizzadenesheli, E. Brunskill, and A. Anandkumar, Efficient exploration through Bayesian deep Q-networks, in Proc. of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 2018, pp. 1–9.
[21]
B. O’Donoghue, I. Osband, R. Munos, and V. Mnih, The uncertainty bellman equation and exploration, in Proc. 35th Int. Conf. Machine Learning, Stockholmsmässan, Sweden, 2018, pp. 3839–3848.
[22]
Strehl A. L. and Littman M. L., An analysis of model-based interval estimation for Markov decision processes, J. Comput. Syst. Sci., vol. 74, no. 8, pp. 13091331, 2008.10.1016/j.jcss.2007.08.009
[23]
J. Martin, S. S. Narayanan, T. Everitt, and M. Hutter, Count-based exploration in feature space for reinforcement learning, in Proc. 26th Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 2471–2478.
[24]
A. W. Moore, Efficient memory-based learning for robot control, PhD dissertation, University of Cambridge, Cambridge, MA, USA, 1990.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 04 July 2021
Accepted: 23 August 2021
Published: 21 July 2022
Issue date: February 2023

Copyright

© The author(s) 2023.

Acknowledgements

This research was supported by Tsinghua University–Meituan Joint Institute for Digital Life, Tsinghua EE Xilinx AI Research Fund, Tsinghua EE Independent Research Project, Beijing National Research Center for Information Science and Technology (BNRist), and Beijing Innovation Center for Future Chips.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return