Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making

Liwei Dong; Ni Li; Guanghong Gong; Xin Lin

doi:10.26599/TST.2023.9010100

| Sign up

PDF (13 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making

Liwei Dong^¹, Ni Li^², Guanghong Gong^³(), Xin Lin^¹()

1School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China

2School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China; and also with Zhongguancun Laboratory, Beijing 100191, China

3School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; and also with State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China

Show Author Information

Abstract

Reinforcement Learning (RL) has emerged as a promising data-driven solution for wargaming decision-making. However, two domain challenges still exist: (1) dealing with discrete-continuous hybrid wargaming control and (2) accelerating RL deployment with rich offline data. Existing RL methods fail to handle these two issues simultaneously, thereby we propose a novel offline RL method targeting hybrid action space. A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way. This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy. Critically, a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions. Our method demonstrates superior performance and generality across different tasks, particularly in typical realistic wargaming scenarios.

Keywords

offline Reinforcement Learning (RL)wargaming decision-making hybrid action space

References

[1]

R. R. Hill and J. O. Miller, A history of United States military simulation, in Proc. 2017 Winter Simulation Conf. (WSC), Las Vegas, NV, USA, 2017, pp. 346–364.

Crossref

[2]

J. Appleget, An introduction to wargaming and modeling and simulation, in Simulation and Wargaming, C. Turnitsa, C. Blais, and A. Tolk, Eds. Hoboken, NJ, USA: John Wiley & Sons, 2021, pp. 1−22.

Crossref

[3]

S. Wang and Y. Liu, Modeling and simulation of CGF aerial targets for simulation training, in Proc. Int. Conf. Computer Intelligent Systems and Network Remote Control (CISNRC 2020). doi: 10.12783/dtcse/cisnr2020/35167 .

Crossref

[4]

Ö. F. Arar and K. Ayan, A flexible rule-based framework for pilot performance analysis in air combat simulation systems, Turk. J. Elec. Eng. Comp. Sci., vol. 21, no. 8, pp. 2397–2415, 2013.

Crossref Google Scholar

[5]

C. Huang, H. Zhang, L. Wang, X. Luo, and Y. Song, Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management, J. Mod. Power Syst. Clean Energy, vol. 10, no. 3, pp. 743–754, 2022.

Crossref Google Scholar

[6]

K. Gao, Y. Huang, A. Sadollah, and L. Wang, A review of energy-efficient scheduling in intelligent production systems, Complex Intell. Syst., vol. 6, no. 2, pp. 237–249, 2020.

Crossref Google Scholar

[7]

Y. He, L. Xing, Y. Chen, W. Pedrycz, L. Wang, and G. Wu, A generic Markov decision process model and reinforcement learning method for scheduling agile earth observation satellites, IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 3, pp. 1463–1474, 2022.

Crossref Google Scholar

[8]

K. Zhu and T. Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.

Crossref Google Scholar

[9]

K. Zhao and L. Ning, Hybrid navigation method for multiple robots facing dynamic obstacles, Tsinghua Science and Technology, vol. 27, no. 6, pp. 894–901, 2022.

Crossref Google Scholar

[10]

X. Hao, C. Xu, L. Xie, and H. Li, Optimizing the perceptual quality of time-domain speech enhancement with reinforcement learning, Tsinghua Science and Technology, vol. 27, no. 6, pp. 939–947, 2022.

Crossref Google Scholar

[11]

L. Wang, Z. Pan, and J. Wang, A review of reinforcement learning based intelligent optimization for manufacturing scheduling, Complex System Modeling and Simulation, vol. 1, no. 4, pp. 257–270, 2021.

Crossref Google Scholar

[12]

M. Tan, Z. Zhang, Y. Ren, I. Richard, and Y. Zhang, Multi-agent system for electric vehicle charging scheduling in parking lots, Complex System Modeling and Simulation, vol. 3, no. 2, pp. 129–142, 2023.

Crossref Google Scholar

[13]

Z. Liao and S. Li, Solving nonlinear equations systems with an enhanced reinforcement learning based differential evolution, Complex System Modeling and Simulation, vol. 2, no. 1, pp. 78–95, 2022.

Crossref Google Scholar

[14]

W. Shi, Y. H. Feng, G. Q. Cheng, H. L. Huang, J. C. Huang, Z. Liu, and W. He, Research on multi-aircraft cooperative air combat method based on deep reinforcement learning, (in Chinese), Acta Autom. Sin., vol. 47, no. 7, pp. 1610–1623, 2021.

Google Scholar

[15]

B. Yuksek, U. M. Demirezen, and G. Inalhan, Development of UCAV fleet autonomy by reinforcement learning in a wargame simulation environment, in Proc. AIAA Scitech 2021 Forum. doi: 10.2514/6.2021-0175 .

Crossref

[16]

Y. Sun, B. Yuan, Q. Xiang, J. Zhou, J. Yu, D. Dai, and X. Zhou, Intelligent decision-making and human language communication based on deep reinforcement learning in a wargame environment, IEEE Trans. Hum. Mach. Syst., vol. 53, pp. 201–214, 2023.

Crossref Google Scholar

[17]

S. Fujimoto, D. Meger, and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36^th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2052–2062.

[18]

W. Zhou, S. Bajracharya, and D. Held, PLAS: Latent action space for offline reinforcement learning, in Proc. 2020 4^th Conf. Robot Learning, Cambridge, MA, USA, 2021, pp. 1719–1735.

[19]

S. Fujimoto and S. Gu, A minimalist approach to offline reinforcement learning, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 20132–20145.

[20]

A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 100.

[21]

C. Zhao, K. Huang, and C. Yuan, DCE: Offline reinforcement learning with double conservative estimates, in Proc. 11^th Int. Conf. Learning Representations.

[22]

Y. Wu, G. Tucker, and O. Nachum, Behavior regularized offline reinforcement learning, arXiv preprint arXiv: 1911.11361, 2019.

[23]

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma, MOPO: Model-based offline policy optimization, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1185.

[24]

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, MOReL: Model-based offline reinforcement learning, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1830.

[25]

B. Li, H. Tang, Y. Zheng, J. Hao, P. Li, Z. Wang, Z. Meng, and L. Wang, HyAR: Addressing discrete-continuous action reinforcement learning via hybrid action representation, in Proc. the 10^th Int. Conf. Learning Representations, Virtual Event.

[26]

J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space, arXiv preprint arXiv: 1810.06394, 2018.

[27]

W. Masson, P. Ranchod, and G. Konidaris, Reinforcement learning with parameterized actions, in Proc. 10^th AAAI Conf. Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 1934–1940.

Crossref

[28]

C. J. Bester, S. D. James, and G. D. Konidaris, Multi-pass Q-networks for deep reinforcement learning with parameterised action spaces, arXiv preprint arXiv: 1905.04388, 2019.

[29]

M. Hausknecht and P. Stone, Deep reinforcement learning in parameterized action space, in Proc. the 4^th Int. Conf. Learning Representations, San Juan, PR, USA.

[30]

Z. Fan, R. Su, W. Zhang, and Y. Yu, Hybrid actor-critic reinforcement learning in parameterized action space, in Proc. 28^th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 2279–2285.

Crossref

[31]

X. Lou, Q. Yin, J. Zhang, C. Yu, Z. He, N. Cheng, and K. Huang, Offline reinforcement learning with representations for actions, Inf. Sci., vol. 610, pp. 746–758, 2022.

Crossref Google Scholar

[32]

S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv: 2005.01643, 2020.

[33]

R. Agarwal, D. Schuurmans, and M. Norouzi, An optimistic perspective on offline reinforcement learning, in Proc. 37^th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 104–114.

[34]

Y. Guo, S. Feng, N. Le Roux, E. Chi, H. Lee, and M. Chen, Batch reinforcement learning through continuation method, in Proc. the 9^th Int. Conf. Learning Representations, Virtual Event, https://openreview.net/forum?id=po-DLlBuAuz, 2021.

[35]

P. Swazinna, S. Udluft, D. Hein, and T. Runkler, Comparing model-free and model-based algorithms for offline reinforcement learning, IFAC-Papers On Line, vol. 55, no. 15, pp. 19–26, 2022.

Crossref Google Scholar

[36]

D. P. Kingma and M. Welling, Auto-encoding variational Bayes, arXiv preprint arXiv: 1312.6114, 2022.

[37]

W. Whitney, R. Agarwal, K. Cho, and A. Gupta, Dynamics-aware embeddings, in Proc. the 8^th Int. Conf. Learning Representations, Addis Ababa, Ethiopia.

[38]

C. Huang, K. Dong, H. Huang, S. Tang, and Z. Zhang, Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization, J. Syst. Eng. Electron., vol. 29, no. 1, pp. 86–97, 2018.

Crossref Google Scholar

[39]

M. Masek, C. P. Lam, L. Benke, L. Kelly, and M. Papasimeon, Discovering emergent agent behaviour with evolutionary finite state machines, in Proc. 21^st Int. Conf. PRIMA 2018 : Principles and Practice of Multi-Agent Systems, Tokyo, Japan, 2018, pp. 19–34.

Crossref

[40]

R. S. Sutton and A. G. Barto, Reinforcement Learning : An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018.

[41]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv: 1509.02971, 2019.

[42]

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Deterministic policy gradient algorithms, in Proc. 31^st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 387–395.

[43]

S. Fujimoto, H. van Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proc. 35^th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1587–31596.

[44]

C. Guo and F. Berkhahn, Entity embeddings of categorical variables, arXiv preprint arXiv: 1604.06737, 2016.

[45]

A. Grosnit, R. Tutunov, A. M. Maraval, R. R. Griffiths, A. I. Cowen-Rivers, L. Yang, L. Zhu, W. Lyu, Z. Chen, J. Wang, et al., High-dimensional Bayesian optimisation with variational autoencoders and deep metric learning, arXiv preprint arXiv: 2106.03609, 2021.

[46]

M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and A. C. Courville, Pretraining representations for data-efficient reinforcement learning, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 12686–12699.

[47]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2017.

Tsinghua Science and Technology

Volume 29 Issue 5,
October 2024

Pages 1422-1440

DOI: 10.26599/TST.2023.9010100

Cite this article:

Dong L, Li N, Gong G, et al. Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making. Tsinghua Science and Technology, 2024, 29(5): 1422-1440. https://doi.org/10.26599/TST.2023.9010100