LiFE: Deep Exploration via Linear-Feature Bonus in Continuous Control

Jiantao Qiu; Yu Wang

doi:10.26599/TST.2021.9010063

Tsinghua Science and Technology 2023, 28(1): 155-166 https://doi.org/10.26599/TST.2021.9010063

Open Access | Issue | Published: 21 July 2022

LiFE: Deep Exploration via Linear-Feature Bonus in Continuous Control

Show Author's Information Hide Author's Information Jiantao Qiu^¹, Yu Wang^¹(

)

1 Department of Electronic Engineering and Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, China

Keywords:

Neural Network (NN), Reinforcement Learning (RL), Upper Confidence Bound (UCB)

Cite this article:

Qiu J, Wang Y. LiFE: Deep Exploration via Linear-Feature Bonus in Continuous Control. Tsinghua Science and Technology, 2023, 28(1): 155-166. https://doi.org/10.26599/TST.2021.9010063

Download citation

EndNote(RIS)

BibTeX

467

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Reinforcement Learning (RL) algorithms work well with well-defined rewards, but they fail with sparse/deceptive rewards and require additional exploration strategies. This work introduces a deep exploration method based on the Upper Confidence Bound (UCB) bonus. The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic. Based on the conclusion of the regret bound under the linear Markov decision process approximation, we use the feature matrix to calculate the UCB bonus for deep exploration. The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations. Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy. We design a simple task, "swim", to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments. Then, we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks. The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.

Full text

Abstract

Full text

Outline

About this article

LiFE: Deep Exploration via Linear-Feature Bonus in Continuous Control

Show Author's information Hide Author's Information Jiantao Qiu^¹, Yu Wang^¹(

)

1 Department of Electronic Engineering and Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing 100084, China

Abstract

Keywords: Neural Network (NN), Reinforcement Learning (RL), Upper Confidence Bound (UCB)

References(24)

[1]

S. Fujimoto, H. Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proc. 35^th Int. Conf. Machine Learning, Stockholmsmässan, Stockholm, Sweden, 2018, pp. 1587–1596.

Google Scholar

[2]

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., Soft actor-critic algorithms and applications, arXiv preprint arXiv:1812.05905, 2019.

Google Scholar

[3]

Schulman

, Wolski

, Dhariwal

, Radford

, and Klimov

, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347, 2017.

Google Scholar

[4]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529–533, 2015.

DOI Google Scholar

[5]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971, 2019.

Google Scholar

[6]

M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, Parameter space noise for exploration, arXiv preprint arXiv:1706.01905, 2018.

Google Scholar

[7]

T. Haarnoja, H. R. Tang, P. Abbeel, and S. Levine, Reinforcement learning with deep energy-based policies, in Proc. 34^th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1352–1361.

Google Scholar

[8]

P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Mach. Learn., vol. 47, nos. 2&3, pp. 235–256, 2002.

DOI Google Scholar

[9]

J. Y. Audibert, R. Munos, and C. Szepesvári, Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theoret. Comput. Sci., vol. 410, no. 19, pp. 1876–1902, 2009.

DOI Google Scholar

[10]

W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol. 25, nos. 3&4, pp. 285–294, 1933.

DOI Google Scholar

[11]

D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen, A tutorial on thompson sampling, Found. Trends Mach. Learn., vol. 11, no. 1, pp. 1–96, 2018.

DOI Google Scholar

[12]

I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, Deep exploration via bootstrapped DQN, in Proc. 30^th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 4033–4041.

Google Scholar

[13]

I. Osband, J. Aslanides, and A. Cassirer, Randomized prior functions for deep reinforcement learning, in Proc. 32^nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 8626–8638.

Google Scholar

[14]

M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, Unifying count-based exploration and intrinsic motivation, in Proc. 30^th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 1479–1487.

Google Scholar

[15]

G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, Count-based exploration with neural density models, in Proc. 34^th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 2721–2730.

Google Scholar

[16]

M. C. Machado, M. G. Bellemare, and M. Bowling, Count-based exploration with the successor representation, in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 4, pp. 5125–5133, 2020.

DOI Google Scholar

[17]

Cai

, Yang

Z. R.

, Jin

, and Wang

Z. R.

, Provably efficient exploration in policy optimization, in Proc. 37^th Int. Conf. Machine Learning, Vienna, Austria, 2020, pp. 1283–1294.

Google Scholar

[18]

Brockman

, Cheung

, Pettersson

, Schneider

, Schulman

, Tang

, and Zaremba

, OpenAI gym, arXiv preprint arXiv:1606.01540, 2016.

Google Scholar

[19]

E. Todorov, T. Erez, and Y. Tassa, MuJoCo: A physics engine for model-based control, in Proc. of the 2012 IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 2012, pp. 5026–5033.

DOI Google Scholar

[20]

K. Azizzadenesheli, E. Brunskill, and A. Anandkumar, Efficient exploration through Bayesian deep Q-networks, in Proc. of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 2018, pp. 1–9.

DOI Google Scholar

[21]

B. O’Donoghue, I. Osband, R. Munos, and V. Mnih, The uncertainty bellman equation and exploration, in Proc. 35^th Int. Conf. Machine Learning, Stockholmsmässan, Sweden, 2018, pp. 3839–3848.

Google Scholar

[22]

Strehl

A. L.

and Littman

M. L.

, An analysis of model-based interval estimation for Markov decision processes, J. Comput. Syst. Sci., vol. 74, no. 8, pp. 1309–1331, 2008.10.1016/j.jcss.2007.08.009

DOI Google Scholar

[23]

J. Martin, S. S. Narayanan, T. Everitt, and M. Hutter, Count-based exploration in feature space for reinforcement learning, in Proc. 26^th Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 2471–2478.

DOI Google Scholar

[24]

A. W. Moore, Efficient memory-based learning for robot control, PhD dissertation, University of Cambridge, Cambridge, MA, USA, 1990.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 04 July 2021

Accepted: 23 August 2021

Published: 21 July 2022

Issue date: February 2023

Copyright

Acknowledgements

This research was supported by Tsinghua University–Meituan Joint Institute for Digital Life, Tsinghua EE Xilinx AI Research Fund, Tsinghua EE Independent Research Project, Beijing National Research Center for Information Science and Technology (BNRist), and Beijing Innovation Center for Future Chips.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).