Journal Home > Volume 29 , Issue 5

Reinforcement Learning (RL) has emerged as a promising data-driven solution for wargaming decision-making. However, two domain challenges still exist: (1) dealing with discrete-continuous hybrid wargaming control and (2) accelerating RL deployment with rich offline data. Existing RL methods fail to handle these two issues simultaneously, thereby we propose a novel offline RL method targeting hybrid action space. A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way. This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy. Critically, a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions. Our method demonstrates superior performance and generality across different tasks, particularly in typical realistic wargaming scenarios.


menu
Abstract
Full text
Outline
About this article

Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making

Show Author's information Liwei Dong1Ni Li2Guanghong Gong3( )Xin Lin1( )
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China; and also with Zhongguancun Laboratory, Beijing 100191, China
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; and also with State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China

Abstract

Reinforcement Learning (RL) has emerged as a promising data-driven solution for wargaming decision-making. However, two domain challenges still exist: (1) dealing with discrete-continuous hybrid wargaming control and (2) accelerating RL deployment with rich offline data. Existing RL methods fail to handle these two issues simultaneously, thereby we propose a novel offline RL method targeting hybrid action space. A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way. This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy. Critically, a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions. Our method demonstrates superior performance and generality across different tasks, particularly in typical realistic wargaming scenarios.

Keywords: decision-making, wargaming, hybrid action space, offline Reinforcement Learning (RL)

References(47)

[1]
R. R. Hill and J. O. Miller, A history of United States military simulation, in Proc. 2017 Winter Simulation Conf. (WSC), Las Vegas, NV, USA, 2017, pp. 346–364.
DOI
[2]
J. Appleget, An introduction to wargaming and modeling and simulation, in Simulation and Wargaming, C. Turnitsa, C. Blais, and A. Tolk, Eds. Hoboken, NJ, USA: John Wiley & Sons, 2021, pp. 1−22.
DOI
[3]
S. Wang and Y. Liu, Modeling and simulation of CGF aerial targets for simulation training, in Proc. Int. Conf. Computer Intelligent Systems and Network Remote Control (CISNRC 2020). doi: 10.12783/dtcse/cisnr2020/35167 .
DOI
[4]

Ö. F. Arar and K. Ayan, A flexible rule-based framework for pilot performance analysis in air combat simulation systems, Turk. J. Elec. Eng. Comp. Sci., vol. 21, no. 8, pp. 2397–2415, 2013.

[5]

C. Huang, H. Zhang, L. Wang, X. Luo, and Y. Song, Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management, J. Mod. Power Syst. Clean Energy, vol. 10, no. 3, pp. 743–754, 2022.

[6]

K. Gao, Y. Huang, A. Sadollah, and L. Wang, A review of energy-efficient scheduling in intelligent production systems, Complex Intell. Syst., vol. 6, no. 2, pp. 237–249, 2020.

[7]

Y. He, L. Xing, Y. Chen, W. Pedrycz, L. Wang, and G. Wu, A generic Markov decision process model and reinforcement learning method for scheduling agile earth observation satellites, IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 3, pp. 1463–1474, 2022.

[8]

K. Zhu and T. Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.

[9]

K. Zhao and L. Ning, Hybrid navigation method for multiple robots facing dynamic obstacles, Tsinghua Science and Technology, vol. 27, no. 6, pp. 894–901, 2022.

[10]

X. Hao, C. Xu, L. Xie, and H. Li, Optimizing the perceptual quality of time-domain speech enhancement with reinforcement learning, Tsinghua Science and Technology, vol. 27, no. 6, pp. 939–947, 2022.

[11]

L. Wang, Z. Pan, and J. Wang, A review of reinforcement learning based intelligent optimization for manufacturing scheduling, Complex System Modeling and Simulation, vol. 1, no. 4, pp. 257–270, 2021.

[12]

M. Tan, Z. Zhang, Y. Ren, I. Richard, and Y. Zhang, Multi-agent system for electric vehicle charging scheduling in parking lots, Complex System Modeling and Simulation, vol. 3, no. 2, pp. 129–142, 2023.

[13]

Z. Liao and S. Li, Solving nonlinear equations systems with an enhanced reinforcement learning based differential evolution, Complex System Modeling and Simulation, vol. 2, no. 1, pp. 78–95, 2022.

[14]

W. Shi, Y. H. Feng, G. Q. Cheng, H. L. Huang, J. C. Huang, Z. Liu, and W. He, Research on multi-aircraft cooperative air combat method based on deep reinforcement learning, (in Chinese), Acta Autom. Sin., vol. 47, no. 7, pp. 1610–1623, 2021.

[15]
B. Yuksek, U. M. Demirezen, and G. Inalhan, Development of UCAV fleet autonomy by reinforcement learning in a wargame simulation environment, in Proc. AIAA Scitech 2021 Forum. doi: 10.2514/6.2021-0175 .
DOI
[16]

Y. Sun, B. Yuan, Q. Xiang, J. Zhou, J. Yu, D. Dai, and X. Zhou, Intelligent decision-making and human language communication based on deep reinforcement learning in a wargame environment, IEEE Trans. Hum. Mach. Syst., vol. 53, pp. 201–214, 2023.

[17]
S. Fujimoto, D. Meger, and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36 th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2052–2062.
[18]
W. Zhou, S. Bajracharya, and D. Held, PLAS: Latent action space for offline reinforcement learning, in Proc. 2020 4 th Conf. Robot Learning, Cambridge, MA, USA, 2021, pp. 1719–1735.
[19]
S. Fujimoto and S. Gu, A minimalist approach to offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 20132–20145.
[20]
A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 100.
[21]
C. Zhao, K. Huang, and C. Yuan, DCE: Offline reinforcement learning with double conservative estimates, in Proc. 11 th Int. Conf. Learning Representations.
[22]
Y. Wu, G. Tucker, and O. Nachum, Behavior regularized offline reinforcement learning, arXiv preprint arXiv: 1911.11361, 2019.
[23]
T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma, MOPO: Model-based offline policy optimization, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1185.
[24]
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, MOReL: Model-based offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1830.
[25]
B. Li, H. Tang, Y. Zheng, J. Hao, P. Li, Z. Wang, Z. Meng, and L. Wang, HyAR: Addressing discrete-continuous action reinforcement learning via hybrid action representation, in Proc. the 10 th Int. Conf. Learning Representations, Virtual Event.
[26]
J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space, arXiv preprint arXiv: 1810.06394, 2018.
[27]
W. Masson, P. Ranchod, and G. Konidaris, Reinforcement learning with parameterized actions, in Proc. 10 th AAAI Conf. Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 1934–1940.
DOI
[28]
C. J. Bester, S. D. James, and G. D. Konidaris, Multi-pass Q-networks for deep reinforcement learning with parameterised action spaces, arXiv preprint arXiv: 1905.04388, 2019.
[29]
M. Hausknecht and P. Stone, Deep reinforcement learning in parameterized action space, in Proc. the 4 th Int. Conf. Learning Representations, San Juan, PR, USA.
[30]
Z. Fan, R. Su, W. Zhang, and Y. Yu, Hybrid actor-critic reinforcement learning in parameterized action space, in Proc. 28 th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 2279–2285.
DOI
[31]

X. Lou, Q. Yin, J. Zhang, C. Yu, Z. He, N. Cheng, and K. Huang, Offline reinforcement learning with representations for actions, Inf. Sci., vol. 610, pp. 746–758, 2022.

[32]
S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv: 2005.01643, 2020.
[33]
R. Agarwal, D. Schuurmans, and M. Norouzi, An optimistic perspective on offline reinforcement learning, in Proc. 37 th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 104–114.
[34]
Y. Guo, S. Feng, N. Le Roux, E. Chi, H. Lee, and M. Chen, Batch reinforcement learning through continuation method, in Proc. the 9 th Int. Conf. Learning Representations, Virtual Event, https://openreview.net/forum?id=po-DLlBuAuz, 2021.
[35]

P. Swazinna, S. Udluft, D. Hein, and T. Runkler, Comparing model-free and model-based algorithms for offline reinforcement learning, IFAC-Papers On Line, vol. 55, no. 15, pp. 19–26, 2022.

[36]
D. P. Kingma and M. Welling, Auto-encoding variational Bayes, arXiv preprint arXiv: 1312.6114, 2022.
[37]
W. Whitney, R. Agarwal, K. Cho, and A. Gupta, Dynamics-aware embeddings, in Proc. the 8 th Int. Conf. Learning Representations, Addis Ababa, Ethiopia.
[38]

C. Huang, K. Dong, H. Huang, S. Tang, and Z. Zhang, Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization, J. Syst. Eng. Electron., vol. 29, no. 1, pp. 86–97, 2018.

[39]
M. Masek, C. P. Lam, L. Benke, L. Kelly, and M. Papasimeon, Discovering emergent agent behaviour with evolutionary finite state machines, in Proc. 21 st Int. Conf. PRIMA 2018 : Principles and Practice of Multi-Agent Systems, Tokyo, Japan, 2018, pp. 19–34.
DOI
[40]
R. S. Sutton and A. G. Barto, Reinforcement Learning : An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018.
[41]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv: 1509.02971, 2019.
[42]
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Deterministic policy gradient algorithms, in Proc. 31 st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 387–395.
[43]
S. Fujimoto, H. van Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proc. 35 th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1587–31596.
[44]
C. Guo and F. Berkhahn, Entity embeddings of categorical variables, arXiv preprint arXiv: 1604.06737, 2016.
[45]
A. Grosnit, R. Tutunov, A. M. Maraval, R. R. Griffiths, A. I. Cowen-Rivers, L. Yang, L. Zhu, W. Lyu, Z. Chen, J. Wang, et al., High-dimensional Bayesian optimisation with variational autoencoders and deep metric learning, arXiv preprint arXiv: 2106.03609, 2021.
[46]
M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and A. C. Courville, Pretraining representations for data-efficient reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 12686–12699.
[47]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2017.
Publication history
Copyright
Rights and permissions

Publication history

Received: 24 July 2023
Revised: 31 August 2023
Accepted: 18 September 2023
Published: 02 May 2024
Issue date: October 2024

Copyright

© The Author(s) 2024.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return