AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (1 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Article | Open Access

OSCAR: OOD State-Conservative Offline Reinforcement Learning for Sequential Decision Making

Yi Ma^¹, Chao Wang^², Chen Chen^³, Jinyi Liu^¹, Zhaopeng Meng^¹, Yan Zheng^¹, Jianye Hao^{¹^,⁴}(

)

1College of Intelligence and Computing, Tianjin University, Tianjin 300350, China

2Lab for High Technology, Tsinghua University, Beijing 100084, China

3Department of Automation, Tsinghua University, Beijing 100084, China

4Noah’s Ark Lab, Huawei Technologics, Co., Ltd., Beijing 100084, China

Show Author Information

Abstract

Offline reinforcement learning (RL) is a data-driven learning paradigm for sequential decision making. Mitigating the overestimation of values originating from out-of-distribution (OOD) states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL. To tackle this problem, some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy. However, since these generated states or state-action pairs are not guaranteed to be OOD, staying conservative on them may adversely affect the in-distribution ones. In this paper, we propose an OOD state-conservative offline RL method (OSCAR), which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset, and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states. In this way, we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement. We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states. OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS. Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return, substantially outperforming existing offline RL methods.

Keywords

decision making offline reinforcement learning out-of-distribution

References

[1]

A. Mandlekar, F. Ramos, B. Boots, S. Savarese, F. F. Li, A. Garg, and D. Fox, IRIS: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data, in Proc. 2020 IEEE Int. Conf. Robotics and Automation (ICRA), Paris, France, 2020, pp. 4414–4420.

Crossref

[2]

X. Hao, Z. Peng, Y. Ma, G. Wang, J. Jin, J. Hao, S. Chen, R. Bai, M. Xie, M. Xu, et al., Dynamic knapsack optimization towards efficient multi-channel sequential advertising, in Proc. 37th Int. Conf. Machine Learning (ICML), virtual, 2020, pp. 4060–4070.

[3]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529–533, 2015.

Crossref Google Scholar

[4]

M. Zhou, J. Luo, J. Villella, Y. Yang, D. Rusu, J. Miao, W. Zhang, M. Alban, I. Fadakar, Z. Chen, et al., Smarts: An open-source scalable multi-agent RL training school for autonomous driving, in Proc. 4th Conf. Robot Learning (CoRL), Cambridge, MA, USA, 2020, pp. 264–285.

[5]

L. Wang, W. Zhang, X. He, and H. Zha, Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation, in Proc. 24th ACM SIGKDD Int. Conf. Knowledge Discovery & Data Mining, London, UK, 2018, pp. 2447–2456.

Crossref

[6]

S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv: 2005.01643, 2020.

[7]

S. Fujimoto, D. Meger, and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36th Int. Conf. Machine Learning (ICML), Long Beach, CA, USA, 2019, pp. 2052–2062.

[8]

A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine, Stabilizing off-policy Q-learning via bootstrapping error reduction, in Proc. 33rd Conf. Nerual Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 11761–11771.

[9]

Y. Wu, G. Tucker, and O. Nachum, Behavior regularized offline reinforcement learning, arXiv preprint arXiv: 1911.11361, 2019.

[10]

N. Y. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R. Hafner, and M. Riedmiller, Keep doing what worked: Behavioral modelling priors for offline reinforcement learning, arXiv preprint arXiv: 2002.08396, 2020.

[11]

A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.

[12]

T. Yu, A. Kumar, R. Rafailov, A. Rajeswaran, S. Levine, and C. Finn, COMBO: Conservative offline model-based policy optimization, arXiv preprint arXiv: 2102.08363, 2021.

[13]

J. Li, C. Tang, M. Tomizuka, and W. Zhan, Dealing with the unknown: Pessimistic offline reinforcement learning, in Proc. 5th Conf. Robot Learning (CoRL), London, UK, 2021, pp. 1455–1464.

[14]

J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, D4RL: Datasets for deep data-driven reinforcement learning, arXiv preprint arXiv: 2004.07219, 2020.

[15]

Y. Wu, S. Zhai, N. Srivastava, J. M. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh, Uncertainty weighted actor-critic for offline reinforcement learning, in Proc. 38th Int. Conf. Machine Learning (ICML), virtual, 2021, pp. 11319–11328.

[16]

G. An, S. Moon, J. H. Kim, and H. O. Song, Uncertainty-based offline reinforcement learning with diversified Q-ensemble, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021.

[17]

S. Vernekar, A. Gaurav, V. Abdelzad, T. Denouden, R. Salay, and K. Czarnecki, Out-of-distribution detection in classifiers via generation, arXiv preprint arXiv: 1910.04241, 2019.

[18]

S. Fujimoto and S. S. Gu, A minimalist approach to offline reinforcement learning, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 20132–20145.

[19]

S. Verma, J. Fu, S. Yang, and S. Levine, CHAI: A CHatbot AI for task-oriented dialogue with offline reinforcement learning, in Proc. 2022 Conf. North American Chapter Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 2022, pp. 4471–4491.

Crossref

[20]

N. Jaques, A. Ghandeharioun, J. Shen, C. Ferguson, À. Lapedriza, N. J. Jones, S. Gu, and R. W. Picard, Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, arXiv preprint arXiv: 1907.00456, 2019.

[21]

W. Zhou, S. Bajracharya, and D. Held, PLAS: Latent action space for offline reinforcement learning, arXiv preprint arXiv: 2011.07213, 2020.

[22]

X. B. Peng, A. Kumar, G. Zhang, and S. Levine, Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, arXiv preprint arXiv: 1910.00177, 2019.

[23]

R. S. Sutton, A. R. Mahmood, and M. White, An emphatic approach to the problem of off-policy temporal-difference learning, arXiv preprint arXiv: 1503.04269, 2015.

[24]

O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans, AlgaeDICE: Policy gradient from arbitrary experience, arXiv preprint arXiv: 1912.02074, 2019.

[25]

R. Agarwal, D. Schuurmans, and M. Norouzi, An optimistic perspective on offline reinforcement learning, in Proc. 37th Int. Conf. Machine Learning (ICML), virtual, 2020, pp. 104–114.

[26]

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, MOReL: Model-based offline reinforcement learning, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.

[27]

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma, MOPO: Model-based offline policy optimization, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.

[28]

T. Matsushima, H. Furuta, Y. Matsuo, O. Nachum, and S. Gu, Deployment-efficient reinforcement learning via model-based offline optimization, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.

[29]

A. Argenson and G. Dulac-Arnold, Model based offline planning, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.

[30]

J. Buckman, C. Gelada, and M. G. Bellemare, The importance of pessimism in fixed-dataset policyoptimization, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.

[31]

T. Seno and M. Imai, d3rlpy: An offline deep reinforcement library, in Proc. Offline Reinforcement Learning Workshop at Neural Information Processing Systems, virtual, 2021.

[32]

Z. Wang, A. Novikov, K. Zolna, J. Merel, J. T. Springenberg, S. E. Reed, B. Shahriari, N. Y. Siegel, Ç. Gülçehre, N. Heess, et al., Critic regularized regression, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), virtual, 2020.

[33]

I. Kostrikov, A. Nair, and S. Levine, Offline reinforcement learning with implicit Q-learning, in Proc. Offline Reinforcement Learning Workshop at Neural Information Processing Systems, virtual, 2021.

[34]

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, Decision transformer: Reinforcement learning via sequence modeling, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 15084–15097.

[35]

M. Janner, Q. Li, and S. Levine, Offline reinforcement learning as one big sequence modeling problem, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 1273–1286.

[36]

C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang, Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning, presented at the 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.

CAAI Artificial Intelligence Research

Volume 2,
2023

Article number: 9150020

DOI: 10.26599/AIR.2023.9150020

Cite this article:

Ma Y, Wang C, Chen C, et al. OSCAR: OOD State-Conservative Offline Reinforcement Learning for Sequential Decision Making. CAAI Artificial Intelligence Research, 2023, 2: 9150020. https://doi.org/10.26599/AIR.2023.9150020

Part of a topical collection:

Decision Intelligence

802

Views

Downloads

Crossref

Google Scholar
Citation

Altmetrics

Received: 11 July 2023

Accepted: 31 August 2023

Published: 09 November 2023

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).