Scholar - SciOpen

Reinforcement Learning (RL) algorithms work well with well-defined rewards, but they fail with sparse/deceptive rewards and require additional exploration strategies. This work introduces a deep exploration method based on the Upper Confidence Bound (UCB) bonus. The proposed method can be plugged into actor-critic algorithms that use deep neural networks as a critic. Based on the conclusion of the regret bound under the linear Markov decision process approximation, we use the feature matrix to calculate the UCB bonus for deep exploration. The proposed method is equivalent to the count-based exploration method in special cases and is suitable for general situations. Our method uses the last d-dimensional feature vector in the critic network and is easy to deploy. We design a simple task, "swim", to demonstrate the principle of the proposed method to achieve exploration in sparse/deceptive reward environments. Then, we perform an empirical evaluation on sparse/deceptive reward version gym environments and Ackermann robot control tasks. The evaluation results verify that the proposed algorithm can perform effective deep explorations in sparse/deceptive reward tasks.