PDF (13 MB)
Collect
Submit Manuscript
Open Access

Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China; and also with Zhongguancun Laboratory, Beijing 100191, China
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; and also with State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
Show Author Information

Abstract

Reinforcement Learning (RL) has emerged as a promising data-driven solution for wargaming decision-making. However, two domain challenges still exist: (1) dealing with discrete-continuous hybrid wargaming control and (2) accelerating RL deployment with rich offline data. Existing RL methods fail to handle these two issues simultaneously, thereby we propose a novel offline RL method targeting hybrid action space. A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way. This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy. Critically, a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions. Our method demonstrates superior performance and generality across different tasks, particularly in typical realistic wargaming scenarios.

References

[1]
R. R. Hill and J. O. Miller, A history of United States military simulation, in Proc. 2017 Winter Simulation Conf. (WSC), Las Vegas, NV, USA, 2017, pp. 346–364.
[2]
J. Appleget, An introduction to wargaming and modeling and simulation, in Simulation and Wargaming, C. Turnitsa, C. Blais, and A. Tolk, Eds. Hoboken, NJ, USA: John Wiley & Sons, 2021, pp. 1−22.
[3]
S. Wang and Y. Liu, Modeling and simulation of CGF aerial targets for simulation training, in Proc. Int. Conf. Computer Intelligent Systems and Network Remote Control (CISNRC 2020). doi: 10.12783/dtcse/cisnr2020/35167 .
[4]

Ö. F. Arar and K. Ayan, A flexible rule-based framework for pilot performance analysis in air combat simulation systems, Turk. J. Elec. Eng. Comp. Sci., vol. 21, no. 8, pp. 2397–2415, 2013.

[5]

C. Huang, H. Zhang, L. Wang, X. Luo, and Y. Song, Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management, J. Mod. Power Syst. Clean Energy, vol. 10, no. 3, pp. 743–754, 2022.

[6]

K. Gao, Y. Huang, A. Sadollah, and L. Wang, A review of energy-efficient scheduling in intelligent production systems, Complex Intell. Syst., vol. 6, no. 2, pp. 237–249, 2020.

[7]

Y. He, L. Xing, Y. Chen, W. Pedrycz, L. Wang, and G. Wu, A generic Markov decision process model and reinforcement learning method for scheduling agile earth observation satellites, IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 3, pp. 1463–1474, 2022.

[8]

K. Zhu and T. Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.

[9]

K. Zhao and L. Ning, Hybrid navigation method for multiple robots facing dynamic obstacles, Tsinghua Science and Technology, vol. 27, no. 6, pp. 894–901, 2022.

[10]

X. Hao, C. Xu, L. Xie, and H. Li, Optimizing the perceptual quality of time-domain speech enhancement with reinforcement learning, Tsinghua Science and Technology, vol. 27, no. 6, pp. 939–947, 2022.

[11]

L. Wang, Z. Pan, and J. Wang, A review of reinforcement learning based intelligent optimization for manufacturing scheduling, Complex System Modeling and Simulation, vol. 1, no. 4, pp. 257–270, 2021.

[12]

M. Tan, Z. Zhang, Y. Ren, I. Richard, and Y. Zhang, Multi-agent system for electric vehicle charging scheduling in parking lots, Complex System Modeling and Simulation, vol. 3, no. 2, pp. 129–142, 2023.

[13]

Z. Liao and S. Li, Solving nonlinear equations systems with an enhanced reinforcement learning based differential evolution, Complex System Modeling and Simulation, vol. 2, no. 1, pp. 78–95, 2022.

[14]

W. Shi, Y. H. Feng, G. Q. Cheng, H. L. Huang, J. C. Huang, Z. Liu, and W. He, Research on multi-aircraft cooperative air combat method based on deep reinforcement learning, (in Chinese), Acta Autom. Sin., vol. 47, no. 7, pp. 1610–1623, 2021.

[15]
B. Yuksek, U. M. Demirezen, and G. Inalhan, Development of UCAV fleet autonomy by reinforcement learning in a wargame simulation environment, in Proc. AIAA Scitech 2021 Forum. doi: 10.2514/6.2021-0175 .
[16]

Y. Sun, B. Yuan, Q. Xiang, J. Zhou, J. Yu, D. Dai, and X. Zhou, Intelligent decision-making and human language communication based on deep reinforcement learning in a wargame environment, IEEE Trans. Hum. Mach. Syst., vol. 53, pp. 201–214, 2023.

[17]
S. Fujimoto, D. Meger, and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36 th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2052–2062.
[18]
W. Zhou, S. Bajracharya, and D. Held, PLAS: Latent action space for offline reinforcement learning, in Proc. 2020 4 th Conf. Robot Learning, Cambridge, MA, USA, 2021, pp. 1719–1735.
[19]
S. Fujimoto and S. Gu, A minimalist approach to offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 20132–20145.
[20]
A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 100.
[21]
C. Zhao, K. Huang, and C. Yuan, DCE: Offline reinforcement learning with double conservative estimates, in Proc. 11 th Int. Conf. Learning Representations.
[22]
Y. Wu, G. Tucker, and O. Nachum, Behavior regularized offline reinforcement learning, arXiv preprint arXiv: 1911.11361, 2019.
[23]
T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma, MOPO: Model-based offline policy optimization, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1185.
[24]
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, MOReL: Model-based offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1830.
[25]
B. Li, H. Tang, Y. Zheng, J. Hao, P. Li, Z. Wang, Z. Meng, and L. Wang, HyAR: Addressing discrete-continuous action reinforcement learning via hybrid action representation, in Proc. the 10 th Int. Conf. Learning Representations, Virtual Event.
[26]
J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space, arXiv preprint arXiv: 1810.06394, 2018.
[27]
W. Masson, P. Ranchod, and G. Konidaris, Reinforcement learning with parameterized actions, in Proc. 10 th AAAI Conf. Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 1934–1940.
[28]
C. J. Bester, S. D. James, and G. D. Konidaris, Multi-pass Q-networks for deep reinforcement learning with parameterised action spaces, arXiv preprint arXiv: 1905.04388, 2019.
[29]
M. Hausknecht and P. Stone, Deep reinforcement learning in parameterized action space, in Proc. the 4 th Int. Conf. Learning Representations, San Juan, PR, USA.
[30]
Z. Fan, R. Su, W. Zhang, and Y. Yu, Hybrid actor-critic reinforcement learning in parameterized action space, in Proc. 28 th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 2279–2285.
[31]

X. Lou, Q. Yin, J. Zhang, C. Yu, Z. He, N. Cheng, and K. Huang, Offline reinforcement learning with representations for actions, Inf. Sci., vol. 610, pp. 746–758, 2022.

[32]
S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv: 2005.01643, 2020.
[33]
R. Agarwal, D. Schuurmans, and M. Norouzi, An optimistic perspective on offline reinforcement learning, in Proc. 37 th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 104–114.
[34]
Y. Guo, S. Feng, N. Le Roux, E. Chi, H. Lee, and M. Chen, Batch reinforcement learning through continuation method, in Proc. the 9 th Int. Conf. Learning Representations, Virtual Event, https://openreview.net/forum?id=po-DLlBuAuz, 2021.
[35]

P. Swazinna, S. Udluft, D. Hein, and T. Runkler, Comparing model-free and model-based algorithms for offline reinforcement learning, IFAC-Papers On Line, vol. 55, no. 15, pp. 19–26, 2022.

[36]
D. P. Kingma and M. Welling, Auto-encoding variational Bayes, arXiv preprint arXiv: 1312.6114, 2022.
[37]
W. Whitney, R. Agarwal, K. Cho, and A. Gupta, Dynamics-aware embeddings, in Proc. the 8 th Int. Conf. Learning Representations, Addis Ababa, Ethiopia.
[38]

C. Huang, K. Dong, H. Huang, S. Tang, and Z. Zhang, Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization, J. Syst. Eng. Electron., vol. 29, no. 1, pp. 86–97, 2018.

[39]
M. Masek, C. P. Lam, L. Benke, L. Kelly, and M. Papasimeon, Discovering emergent agent behaviour with evolutionary finite state machines, in Proc. 21 st Int. Conf. PRIMA 2018 : Principles and Practice of Multi-Agent Systems, Tokyo, Japan, 2018, pp. 19–34.
[40]
R. S. Sutton and A. G. Barto, Reinforcement Learning : An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018.
[41]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv: 1509.02971, 2019.
[42]
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Deterministic policy gradient algorithms, in Proc. 31 st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 387–395.
[43]
S. Fujimoto, H. van Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proc. 35 th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1587–31596.
[44]
C. Guo and F. Berkhahn, Entity embeddings of categorical variables, arXiv preprint arXiv: 1604.06737, 2016.
[45]
A. Grosnit, R. Tutunov, A. M. Maraval, R. R. Griffiths, A. I. Cowen-Rivers, L. Yang, L. Zhu, W. Lyu, Z. Chen, J. Wang, et al., High-dimensional Bayesian optimisation with variational autoencoders and deep metric learning, arXiv preprint arXiv: 2106.03609, 2021.
[46]
M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and A. C. Courville, Pretraining representations for data-efficient reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 12686–12699.
[47]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2017.
Tsinghua Science and Technology
Pages 1422-1440
Cite this article:
Dong L, Li N, Gong G, et al. Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making. Tsinghua Science and Technology, 2024, 29(5): 1422-1440. https://doi.org/10.26599/TST.2023.9010100
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return