Journal Home > Volume 2

Offline reinforcement learning (RL) is a data-driven learning paradigm for sequential decision making. Mitigating the overestimation of values originating from out-of-distribution (OOD) states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL. To tackle this problem, some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy. However, since these generated states or state-action pairs are not guaranteed to be OOD, staying conservative on them may adversely affect the in-distribution ones. In this paper, we propose an OOD state-conservative offline RL method (OSCAR), which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset, and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states. In this way, we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement. We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states. OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS. Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return, substantially outperforming existing offline RL methods.


menu
Abstract
Full text
Outline
About this article

OSCAR: OOD State-Conservative Offline Reinforcement Learning for Sequential Decision Making

Show Author's information Yi Ma1Chao Wang2Chen Chen3Jinyi Liu1Zhaopeng Meng1Yan Zheng1Jianye Hao1,4( )
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
Lab for High Technology, Tsinghua University, Beijing 100084, China
Department of Automation, Tsinghua University, Beijing 100084, China
Noah’s Ark Lab, Huawei Technologics, Co., Ltd., Beijing 100084, China

Abstract

Offline reinforcement learning (RL) is a data-driven learning paradigm for sequential decision making. Mitigating the overestimation of values originating from out-of-distribution (OOD) states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL. To tackle this problem, some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy. However, since these generated states or state-action pairs are not guaranteed to be OOD, staying conservative on them may adversely affect the in-distribution ones. In this paper, we propose an OOD state-conservative offline RL method (OSCAR), which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset, and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states. In this way, we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement. We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states. OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS. Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return, substantially outperforming existing offline RL methods.

Keywords: decision making, offline reinforcement learning, out-of-distribution

References(36)

[1]
A. Mandlekar, F. Ramos, B. Boots, S. Savarese, F. F. Li, A. Garg, and D. Fox, IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data, in Proc. 2020 IEEE Int. Conf. Robotics and Automation (ICRA), Paris, France, 2020, pp. 4414–4420.
DOI
[2]
X. Hao, Z. Peng, Y. Ma, G. Wang, J. Jin, J. Hao, S. Chen, R. Bai, M. Xie, M. Xu, et al., Dynamic knapsack optimization towards efficient multi-channel sequential advertising, in Proc. 37th Int. Conf. Machine Learning (ICML), virtual, 2020, pp. 4060–4070.
[3]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[4]
M. Zhou, J. Luo, J. Villella, Y. Yang, D. Rusu, J. Miao, W. Zhang, M. Alban, I. Fadakar, Z. Chen, et al., Smarts: An open-source scalable multi-agent RL training school for autonomous driving, in Proc. 4th Conf. Robot Learning (CoRL), Cambridge, MA, USA, 2020, pp. 264–285.
[5]
L. Wang, W. Zhang, X. He, and H. Zha, Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation, in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, London, UK, 2018, pp. 2447–2456.
DOI
[6]
S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv: 2005.01643, 2020.
[7]
S. Fujimoto, D. Meger, and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36th Int. Conf. Machine Learning (ICML), Long Beach, CA, USA, 2019, pp. 2052–2062.
[8]
A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine, Stabilizing off-policy Q-learning via bootstrapping error reduction, in Proc. 33rd Conf. Nerual Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 11761–11771.
[9]
Y. Wu, G. Tucker, and O. Nachum, Behavior regularized offline reinforcement learning, arXiv preprint arXiv: 1911.11361, 2019.
[10]
N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, and M. Riedmiller, Keep doing what worked: Behavioral modelling priors for offline reinforcement learning, arXiv preprint arXiv: 2002.08396, 2020.
[11]
A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.
[12]
T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Finn, COMBO: Conservative offline model-based policy optimization, arXiv preprint arXiv: 2102.08363, 2021.
[13]
J. Li, C. Tang, M. Tomizuka, and W. Zhan, Dealing with the unknown: Pessimistic offline reinforcement learning, in Proc. 5th Conf. Robot Learning (CoRL), London, UK, 2021, pp. 1455–1464.
[14]
J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, D4RL: Datasets for deep data-driven reinforcement learning, arXiv preprint arXiv: 2004.07219, 2020.
[15]
Y. Wu, S. Zhai, N. Srivastava, J. M. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh, Uncertainty weighted actor-critic for offline reinforcement learning, in Proc. 38th Int. Conf. Machine Learning (ICML), virtual, 2021, pp. 11319–11328.
[16]
G. An, S. Moon, J. H. Kim, and H. O. Song, Uncertainty-based offline reinforcement learning with diversified Q-ensemble, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021.
[17]
S. Vernekar, A. Gaurav, V. Abdelzad, T. Denouden, R. Salay, and K. Czarnecki, Out-of-distribution detection in classifiers via generation, arXiv preprint arXiv: 1910.04241, 2019.
[18]
S. Fujimoto and S. S. Gu, A minimalist approach to offline reinforcement learning, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 20132–20145.
[19]
S. Verma, J. Fu, S. Yang, and S. Levine, CHAI: A CHatbot AI for task-oriented dialogue with offline reinforcement learning, in Proc. 2022 Conf. North American Chapter Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 2022, pp. 4471–4491.
DOI
[20]
N. Jaques, A. Ghandeharioun, J. Shen, C. Ferguson, À. Lapedriza, N. J. Jones, S. Gu, and R. W. Picard, Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, arXiv preprint arXiv: 1907.00456, 2019.
[21]
W. Zhou, S. Bajracharya, and D. Held, PLAS: Latent action space for offline reinforcement learning, arXiv preprint arXiv: 2011.07213, 2020.
[22]
X. B. Peng, A. Kumar, G. Zhang, and S. Levine, Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, arXiv preprint arXiv: 1910.00177, 2019.
[23]
R. S. Sutton, A. R. Mahmood, and M. White, An emphatic approach to the problem of off-policy temporal-difference learning, arXiv preprint arXiv: 1503.04269, 2015.
[24]
O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans, AlgaeDICE: Policy gradient from arbitrary experience, arXiv preprint arXiv: 1912.02074, 2019.
[25]
R. Agarwal, D. Schuurmans, and M. Norouzi, An optimistic perspective on offline reinforcement learning, in Proc. 37th Int. Conf. Machine Learning (ICML), virtual, 2020, pp. 104–114.
[26]
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, MOReL: Model-based offline reinforcement learning, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.
[27]
T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma, MOPO: Model-based offline policy optimization, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.
[28]
T. Matsushima, H. Furuta, Y. Matsuo, O. Nachum, and S. Gu, Deployment-efficient reinforcement learning via model-based offline optimization, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[29]
A. Argenson and G. Dulac-Arnold, Model based offline planning, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[30]
J. Buckman, C. Gelada, and M. G. Bellemare, The importance of pessimism in fixed-dataset policyoptimization, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[31]
T. Seno and M. Imai, d3rlpy: An offline deep reinforcement library, in Proc. Offline Reinforcement Learning Workshop at Neural Information Processing Systems, virtual, 2021.
[32]
Z. Wang, A. Novikov, K. Zolna, J. Merel, J. T. Springenberg, S. E. Reed, B. Shahriari, N. Y. Siegel, Ç. Gülçehre, N. Heess, et al., Critic regularized regression, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.
[33]
I. Kostrikov, A. Nair, and S. Levine, Offline reinforcement learning with implicit Q-learning, in Proc. Offline Reinforcement Learning Workshop at Neural Information Processing Systems, virtual, 2021.
[34]
L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, Decision transformer: Reinforcement learning via sequence modeling, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 15084–15097.
[35]
M. Janner, Q. Li, and S. Levine, Offline reinforcement learning as one big sequence modeling problem, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 1273–1286.
[36]
C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang, Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning, presented at the 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 11 July 2023
Accepted: 31 August 2023
Published: 09 November 2023
Issue date: December 2023

Copyright

© The author(s) 2023.

Acknowledgements

Acknowledgment

This work was supported by the National Key R&D Program of China (No. 2022ZD0116402) and the National Natural Science Foundation of China (No. 62106172).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return