Offline reinforcement learning (RL) is a data-driven learning paradigm for sequential decision making. Mitigating the overestimation of values originating from out-of-distribution (OOD) states induced by the distribution shift between the learning policy and the previously-collected offline dataset lies at the core of offline RL. To tackle this problem, some methods underestimate the values of states given by learned dynamics models or state-action pairs with actions sampled from policies different from the behavior policy. However, since these generated states or state-action pairs are not guaranteed to be OOD, staying conservative on them may adversely affect the in-distribution ones. In this paper, we propose an OOD state-conservative offline RL method (OSCAR), which aims to address the limitation by explicitly generating reliable OOD states that are located near the manifold of the offline dataset, and then design a conservative policy evaluation approach that combines the vanilla Bellman error with a regularization term that only underestimates the values of these generated OOD states. In this way, we can prevent the value errors of OOD states from propagating to in-distribution states through value bootstrapping and policy improvement. We also theoretically prove that the proposed conservative policy evaluation approach guarantees to underestimate the values of OOD states. OSCAR along with several strong baselines is evaluated on the offline decision-making benchmarks D4RL and autonomous driving benchmark SMARTS. Experimental results show that OSCAR outperforms the baselines on a large portion of the benchmarks and attains the highest average return, substantially outperforming existing offline RL methods.
- Article type
- Year
- Co-author
Multiagent deep reinforcement learning (MA-DRL) has received increasingly wide attention. Most of the existing MA-DRL algorithms, however, are still inefficient when faced with the non-stationarity due to agents changing behavior consistently in stochastic environments. This paper extends the weighted double estimator to multiagent domains and proposes an MA-DRL framework, named Weighted Double Deep Q-Network (WDDQN). By leveraging the weighted double estimator and the deep neural network, WDDQN can not only reduce the bias effectively but also handle scenarios with raw visual inputs. To achieve efficient cooperation in multiagent domains, we introduce a lenient reward network and scheduled replay strategy. Empirical results show that WDDQN outperforms an existing DRL algorithm (double DQN) and an MA-DRL algorithm (lenient Q-learning) regarding the averaged reward and the convergence speed and is more likely to converge to the Pareto-optimal Nash equilibrium in stochastic cooperative environments.