AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (13 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China; and also with Zhongguancun Laboratory, Beijing 100191, China
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China; and also with State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
Show Author Information

Abstract

Reinforcement Learning (RL) has emerged as a promising data-driven solution for wargaming decision-making. However, two domain challenges still exist: (1) dealing with discrete-continuous hybrid wargaming control and (2) accelerating RL deployment with rich offline data. Existing RL methods fail to handle these two issues simultaneously, thereby we propose a novel offline RL method targeting hybrid action space. A new constrained action representation technique is developed to build a bidirectional mapping between the original hybrid action space and a latent space in a semantically consistent way. This allows learning a continuous latent policy with offline RL with better exploration feasibility and scalability and reconstructing it back to a needed hybrid policy. Critically, a novel offline RL optimization objective with adaptively adjusted constraints is designed to balance the alleviation and generalization of out-of-distribution actions. Our method demonstrates superior performance and generality across different tasks, particularly in typical realistic wargaming scenarios.

References

[1]
R. R. Hill and J. O. Miller, A history of United States military simulation, in Proc. 2017 Winter Simulation Conf. (WSC), Las Vegas, NV, USA, 2017, pp. 346–364.
[2]
J. Appleget, An introduction to wargaming and modeling and simulation, in Simulation and Wargaming, C. Turnitsa, C. Blais, and A. Tolk, Eds. Hoboken, NJ, USA: John Wiley & Sons, 2021, pp. 1−22.
[3]
S. Wang and Y. Liu, Modeling and simulation of CGF aerial targets for simulation training, in Proc. Int. Conf. Computer Intelligent Systems and Network Remote Control (CISNRC 2020). doi: 10.12783/dtcse/cisnr2020/35167 .
[4]

Ö. F. Arar and K. Ayan, A flexible rule-based framework for pilot performance analysis in air combat simulation systems, Turk. J. Elec. Eng. Comp. Sci., vol. 21, no. 8, pp. 2397–2415, 2013.

[5]

C. Huang, H. Zhang, L. Wang, X. Luo, and Y. Song, Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management, J. Mod. Power Syst. Clean Energy, vol. 10, no. 3, pp. 743–754, 2022.

[6]

K. Gao, Y. Huang, A. Sadollah, and L. Wang, A review of energy-efficient scheduling in intelligent production systems, Complex Intell. Syst., vol. 6, no. 2, pp. 237–249, 2020.

[7]

Y. He, L. Xing, Y. Chen, W. Pedrycz, L. Wang, and G. Wu, A generic Markov decision process model and reinforcement learning method for scheduling agile earth observation satellites, IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 3, pp. 1463–1474, 2022.

[8]

K. Zhu and T. Zhang, Deep reinforcement learning based mobile robot navigation: A review, Tsinghua Science and Technology, vol. 26, no. 5, pp. 674–691, 2021.

[9]

K. Zhao and L. Ning, Hybrid navigation method for multiple robots facing dynamic obstacles, Tsinghua Science and Technology, vol. 27, no. 6, pp. 894–901, 2022.

[10]

X. Hao, C. Xu, L. Xie, and H. Li, Optimizing the perceptual quality of time-domain speech enhancement with reinforcement learning, Tsinghua Science and Technology, vol. 27, no. 6, pp. 939–947, 2022.

[11]

L. Wang, Z. Pan, and J. Wang, A review of reinforcement learning based intelligent optimization for manufacturing scheduling, Complex System Modeling and Simulation, vol. 1, no. 4, pp. 257–270, 2021.

[12]

M. Tan, Z. Zhang, Y. Ren, I. Richard, and Y. Zhang, Multi-agent system for electric vehicle charging scheduling in parking lots, Complex System Modeling and Simulation, vol. 3, no. 2, pp. 129–142, 2023.

[13]

Z. Liao and S. Li, Solving nonlinear equations systems with an enhanced reinforcement learning based differential evolution, Complex System Modeling and Simulation, vol. 2, no. 1, pp. 78–95, 2022.

[14]

W. Shi, Y. H. Feng, G. Q. Cheng, H. L. Huang, J. C. Huang, Z. Liu, and W. He, Research on multi-aircraft cooperative air combat method based on deep reinforcement learning, (in Chinese), Acta Autom. Sin., vol. 47, no. 7, pp. 1610–1623, 2021.

[15]
B. Yuksek, U. M. Demirezen, and G. Inalhan, Development of UCAV fleet autonomy by reinforcement learning in a wargame simulation environment, in Proc. AIAA Scitech 2021 Forum. doi: 10.2514/6.2021-0175 .
[16]

Y. Sun, B. Yuan, Q. Xiang, J. Zhou, J. Yu, D. Dai, and X. Zhou, Intelligent decision-making and human language communication based on deep reinforcement learning in a wargame environment, IEEE Trans. Hum. Mach. Syst., vol. 53, pp. 201–214, 2023.

[17]
S. Fujimoto, D. Meger, and D. Precup, Off-policy deep reinforcement learning without exploration, in Proc. 36 th Int. Conf. Machine Learning, Long Beach, CA, USA, 2019, pp. 2052–2062.
[18]
W. Zhou, S. Bajracharya, and D. Held, PLAS: Latent action space for offline reinforcement learning, in Proc. 2020 4 th Conf. Robot Learning, Cambridge, MA, USA, 2021, pp. 1719–1735.
[19]
S. Fujimoto and S. Gu, A minimalist approach to offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 20132–20145.
[20]
A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 100.
[21]
C. Zhao, K. Huang, and C. Yuan, DCE: Offline reinforcement learning with double conservative estimates, in Proc. 11 th Int. Conf. Learning Representations.
[22]
Y. Wu, G. Tucker, and O. Nachum, Behavior regularized offline reinforcement learning, arXiv preprint arXiv: 1911.11361, 2019.
[23]
T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma, MOPO: Model-based offline policy optimization, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1185.
[24]
R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, MOReL: Model-based offline reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, p. 1830.
[25]
B. Li, H. Tang, Y. Zheng, J. Hao, P. Li, Z. Wang, Z. Meng, and L. Wang, HyAR: Addressing discrete-continuous action reinforcement learning via hybrid action representation, in Proc. the 10 th Int. Conf. Learning Representations, Virtual Event.
[26]
J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space, arXiv preprint arXiv: 1810.06394, 2018.
[27]
W. Masson, P. Ranchod, and G. Konidaris, Reinforcement learning with parameterized actions, in Proc. 10 th AAAI Conf. Artificial Intelligence, Phoenix, AZ, USA, 2016, pp. 1934–1940.
[28]
C. J. Bester, S. D. James, and G. D. Konidaris, Multi-pass Q-networks for deep reinforcement learning with parameterised action spaces, arXiv preprint arXiv: 1905.04388, 2019.
[29]
M. Hausknecht and P. Stone, Deep reinforcement learning in parameterized action space, in Proc. the 4 th Int. Conf. Learning Representations, San Juan, PR, USA.
[30]
Z. Fan, R. Su, W. Zhang, and Y. Yu, Hybrid actor-critic reinforcement learning in parameterized action space, in Proc. 28 th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 2279–2285.
[31]

X. Lou, Q. Yin, J. Zhang, C. Yu, Z. He, N. Cheng, and K. Huang, Offline reinforcement learning with representations for actions, Inf. Sci., vol. 610, pp. 746–758, 2022.

[32]
S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv: 2005.01643, 2020.
[33]
R. Agarwal, D. Schuurmans, and M. Norouzi, An optimistic perspective on offline reinforcement learning, in Proc. 37 th Int. Conf. Machine Learning, Virtual Event, 2020, pp. 104–114.
[34]
Y. Guo, S. Feng, N. Le Roux, E. Chi, H. Lee, and M. Chen, Batch reinforcement learning through continuation method, in Proc. the 9 th Int. Conf. Learning Representations, Virtual Event, https://openreview.net/forum?id=po-DLlBuAuz, 2021.
[35]

P. Swazinna, S. Udluft, D. Hein, and T. Runkler, Comparing model-free and model-based algorithms for offline reinforcement learning, IFAC-Papers On Line, vol. 55, no. 15, pp. 19–26, 2022.

[36]
D. P. Kingma and M. Welling, Auto-encoding variational Bayes, arXiv preprint arXiv: 1312.6114, 2022.
[37]
W. Whitney, R. Agarwal, K. Cho, and A. Gupta, Dynamics-aware embeddings, in Proc. the 8 th Int. Conf. Learning Representations, Addis Ababa, Ethiopia.
[38]

C. Huang, K. Dong, H. Huang, S. Tang, and Z. Zhang, Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization, J. Syst. Eng. Electron., vol. 29, no. 1, pp. 86–97, 2018.

[39]
M. Masek, C. P. Lam, L. Benke, L. Kelly, and M. Papasimeon, Discovering emergent agent behaviour with evolutionary finite state machines, in Proc. 21 st Int. Conf. PRIMA 2018 : Principles and Practice of Multi-Agent Systems, Tokyo, Japan, 2018, pp. 19–34.
[40]
R. S. Sutton and A. G. Barto, Reinforcement Learning : An Introduction, 2nd ed. Cambridge, MA, USA: The MIT Press, 2018.
[41]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv: 1509.02971, 2019.
[42]
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, Deterministic policy gradient algorithms, in Proc. 31 st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 387–395.
[43]
S. Fujimoto, H. van Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proc. 35 th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1587–31596.
[44]
C. Guo and F. Berkhahn, Entity embeddings of categorical variables, arXiv preprint arXiv: 1604.06737, 2016.
[45]
A. Grosnit, R. Tutunov, A. M. Maraval, R. R. Griffiths, A. I. Cowen-Rivers, L. Yang, L. Zhu, W. Lyu, Z. Chen, J. Wang, et al., High-dimensional Bayesian optimisation with variational autoencoders and deep metric learning, arXiv preprint arXiv: 2106.03609, 2021.
[46]
M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and A. C. Courville, Pretraining representations for data-efficient reinforcement learning, in Proc. 34 th Int. Conf. Neural Information Processing Systems, Virtual Event, 2021, pp. 12686–12699.
[47]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2017.
Tsinghua Science and Technology
Pages 1422-1440
Cite this article:
Dong L, Li N, Gong G, et al. Offline Reinforcement Learning with Constrained Hybrid Action Implicit Representation Towards Wargaming Decision-Making. Tsinghua Science and Technology, 2024, 29(5): 1422-1440. https://doi.org/10.26599/TST.2023.9010100

608

Views

240

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 24 July 2023
Revised: 31 August 2023
Accepted: 18 September 2023
Published: 02 May 2024
© The Author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return