AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Review | Open Access

State of the Art of Adaptive Dynamic Programming and Reinforcement Learning

Derong Liu1,2( )Mingming Ha3Shan Xue4
Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen 518055, China
Department of Electrical and Computer Engineering, University of Illinois at Chicago, IL 606071, USA
School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
Show Author Information

Abstract

This article introduces the state-of-the-art development of adaptive dynamic programming and reinforcement learning (ADPRL). First, algorithms in reinforcement learning (RL) are introduced and their roots in dynamic programming are illustrated. Adaptive dynamic programming (ADP) is then introduced following a brief discussion of dynamic programming. Researchers in ADP and RL have enjoyed the fast developments of the past decade from algorithms, to convergence and optimality analyses, and to stability results. Several key steps in the recent theoretical developments of ADPRL are mentioned with some future perspectives. In particular, convergence and optimality results of value iteration and policy iteration are reviewed, followed by an introduction to the most recent results on stability analysis of value iteration algorithms.

References

[1]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[2]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of Go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp. 484–489, 2016.

[3]

D. Silver, J. L. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Mastering the game of Go without human knowledge, Nature, vol. 550, no. 7676, pp. 354–359, 2017.

[4]

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science, vol. 362, no. 6419, pp. 1140–1144, 2018.

[5]

M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al., Human-level performance in 3D multiplayer games with population-based reinforcement learning, Science, vol. 364, no. 6443, pp. 859–865, 2019.

[6]

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al., Grandmaster level in StarCraft II using multi-agent reinforcement learning, Nature, vol. 575, no. 7782, pp. 350–354, 2019.

[7]

H. Bastani, K. Drakopoulos, V. Gupta, I. Vlachogiannis, C. Hadjichristodoulou, P. Lagiou, G. Magiorkinis, D. Paraskevis, and S. Tsiodras, Efficient and targeted COVID-19 border testing via reinforcement learning, Nature, vol. 599, no. 7883, pp. 108–113, 2021.

[8]
L. C. Garaffa, M. Basso, A. A. Konzen, and E. P. De Freitas, Reinforcement learning for mobile robotics exploration: A survey, IEEE Trans. Neural Netw Learn. Syst., doi: 10.1109/TNNLS.2021.3124466.
[9]

P. Leinen, M. Esders, K. T. Schütt, C. Wagner, K. R. Müller, and F. S. Tautz, Autonomous robotic nanofabrication with reinforcement learning, Sci. Adv., vol. 6, no. 36, p. eabb6987, 2020.

[10]
W. Zhu, X. Guo, D. Owaki, K. Kutsuzawa, and M. Hayashibe, A survey of sim-to-real transfer techniques applied to reinforcement learning for bioinspired robots, IEEE Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2021.3112718.
[11]

B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, Deep reinforcement learning for autonomous driving: A survey, IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926, 2022.

[12]

S. Aradi, Survey of deep reinforcement learning for motion planning of autonomous vehicles, IEEE Trans. Intell. Transp. Syst., vol. 23, no. 2, pp. 740–759, 2022.

[13]

J. Degrave, F. Felici, J. Buchli, M. Neunert, B. Tracey, F. Carpanese, T. Ewalds, R. Hafner, A. Abdolmaleki, D. De Las casas, et al., Magnetic control of tokamak plasmas through deep reinforcement learning, Nature, vol. 602, no. 7897, pp. 414–419, 2022.

[14]

R. Hafner and M. Riedmiller, Reinforcement learning in feedback control, Mach. Learn., vol. 84, no. 1, pp. 137–169, 2011.

[15]

D. Liu, Y. Xu, Q. Wei, and X. Liu, Residential energy scheduling for variable weather solar energy based on adaptive dynamic programming, IEEE/CAA J. Autom. Sin., vol. 5, no. 1, pp. 36–46, 2018.

[16]

D. Wang, M. Ha, and J. Qiao, Data-driven iterative adaptive critic control toward an urban wastewater treatment plant, IEEE Trans. Ind. Electron., vol. 68, no. 8, pp. 7362–7369, 2021.

[17]

Y. Zhao, Y. Ma, and S. Hu, USV formation and path-following control via deep reinforcement learning with random braking, IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 12, pp. 5468–5478, 2021.

[18]
D. Liu, Q. Wei, D. Wang, X. Yang, and H. Li, Adaptive Dynamic Programming With Applications in Optimal Control. Cham, Switzerland: Springer, 2017.
[19]

D. Liu, X. Yang, D. Wang, and Q. Wei, Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints, IEEE Trans. Cybern., vol. 45, no. 7, pp. 1372–1385, 2015.

[20]

S. Xue, B. Luo, and D. Liu, Event-triggered adaptive dynamic programming for unmatched uncertain nonlinear continuous-time systems, IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 7, pp. 2939–2951, 2021.

[21]

Z. Yan and Y. Xu, Real-time optimal power flow: A Lagrangian based deep reinforcement learning approach, IEEE Trans. Power Syst., vol. 35, no. 4, pp. 3270–3273, 2020.

[22]

N. Wang, Y. Gao, and X. Zhang, Data-driven performance-prescribed reinforcement learning control of an unmanned surface vehicle, IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 12, pp. 5456–5467, 2021.

[23]

D. Liu, D. Wang, and H. Li, Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach, IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 418–428, 2014.

[24]

S. Xue, B. Luo, D. Liu, and Y. Gao, Event-triggered ADP for tracking control of partially unknown constrained uncertain systems, IEEE Trans. Cybern., vol. 52, no. 9, pp. 9001–9012, 2022.

[25]
J. Tromp, Number of legal go positions, http://tromp.github.io/go/legal.html, 2021.
[26]

F. Y. Wang, J. J. Zhang, X. Zheng, X. Wang, Y. Yuan, X. Dai, J. Zhang, and L. Yang, Where does AlphaGo go: From church-Turing thesis to AlphaGo thesis and beyond, IEEE/CAA J. Autom. Sin., vol. 3, no. 2, pp. 113–120, 2016.

[27]

G. Tesauro, Practical issues in temporal difference learning, Mach. Learn., vol. 8, no. 3, pp. 257–277, 1992.

[28]

G. Tesauro, TD-gammon, a self-teaching backgammon program, achieves master-level play, Neural Comput., vol. 6, no. 2, pp. 215–219, 1994.

[29]
J. Schaeffer, J. Culberson, N. Treloar, B. Knight, P. Lu, and D. Szafron, A world championship caliber checkers program, Artif. Intell., vol. 53, nos. 2–3, pp. 273–289, 1992.
[30]
M. Buro, From simple features to sophisticated evaluation functions, in Proc. 1st Int. Conf. Computers and Games, Tsukuba, Japan, 1998, pp. 126–145.
[31]

M. Campbell, A. J. Hoane, and F. H. Hsu, Deep blue, Artif. Intell., vol. 134, no. 1-2, pp. 57–83, 2002.

[32]
C. Moyer, How Google’s AlphaGo beat a Go world champion, https://www.theatlantic.com/technology/archive/2016/03/the-invisible-opponent/475611/, 2016.
[33]
S. Byford, AlphaGo retires from competitive Go after defeating world number one 3–0, https://www.theverge.com/2017/5/27/15704088/alphago-ke-jie-game-3-result-retires-future, 2017.
[34]
S. Shead, Google DeepMind is edging towards a 3-0 victory against world Go champion Ke Jie, https://www.businessinsider.nl/google-deepmind-edges-towards-ke-jie-victory-2017-5/, 2017.
[35]

Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.

[36]

Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[37]

J. Schmidhuber, Deep learning in neural networks: An overview, Neural Netw., vol. 61, pp. 85–117, 2015.

[38]
X. Cai and D. C. Wunsch, A parallel computer-Go player, using HDP method, in Proc. Int. Joint Conf. Neural Networks, Washington, DC, USA, 2001, pp. 2373–2375.
[39]
N. N. Schraudolph, P. Dayan, and T. J. Sejnowski, Temporal difference learning of position evaluation in the game of Go, in Proc. 6th Int. Conf. Neural Information Processing Systems, Denver, CO, USA, 1993, pp. 817–824.
[40]

D. Silver, R. S. Sutton, and M. Müller, Temporal-difference search in computer Go, Mach. Learn., vol. 87, no. 2, pp. 183–219, 2012.

[41]
R. Zaman, D. Prokhorov, and D. C. Wunsch, Adaptive critic design in learning to play game of Go, in Proc. Int. Conf. Neural Networks, Houston, TX, USA, 1997, pp. 1–4.
[42]
R. Zaman and D. C. Wunsch, TD methods applied to mixture of experts for learning 9×9 Go evaluation function, in Proc. Int. Joint Conf. Neural Networks, Washington, DC, USA, 1999, pp. 3734–3739.
[43]

R. Coulom, Computing ELO ratings of move patterns in the game of Go, ICGA J., vol. 30, no. 4, pp. 198–208, 2007.

[44]
M. Enzenberger, Evaluation in Go by a neural network using soft segmentation, in Proc. 10th Int. Conf. Advances in Computer Games, Graz, Austria, 2003, pp. 97–108.
[45]
C. Clark and A. Storkey, Training deep convolutional neural networks to play Go, in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 1766–1774.
[46]
C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, Move evaluation in Go using deep convolutional neural networks, in Proc. 3rd Int. Conf. Learning Representations, San Diego, CA, USA, 2015.
[47]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998.
[48]
A. G. Barto, Reinforcement learning and adaptive critic methods, in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New York, NY, USA: Van Nostrand Reinhold, 1992, pp. 469–491.
[49]

F. L. Lewis and D. Vrabie, Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, 2009.

[50]

F. L. Lewis, D. Liu, and G. G. Lendaris, Guest editorial-Special issue on adaptive dynamic programming and reinforcement learning in feedback control, IEEE Trans. Syst. Man Cybern. B: Cybern, vol. 38, no. 4, pp. 896–897, 2008.

[51]

D. Liu, F. L. Lewis, and Q. Wei, Editorial special issue on adaptive dynamic programming and reinforcement learning, IEEE Trans. Syst. Man Cybern. Syst., vol. 50, no. 11, pp. 3944–3947, 2020.

[52]

L. Buşoniu, T. De Bruin, D. Tolić, J. Kober, and I. Palunko, Reinforcement learning for control: Performance, stability, and deep approximators, Annu. Rev. Control, vol. 46, pp. 8–28, 2018.

[53]

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, Optimal and autonomous control using reinforcement learning: A survey, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2042–2062, 2018.

[54]

D. Liu, S. Xue, B. Zhao, B. Luo, and Q. Wei, Adaptive dynamic programming for control: A survey and recent advances, IEEE Trans. Syst. Man Cybern. Syst., vol. 51, no. 1, pp. 142–160, 2021.

[55]

D. Wang, M. Ha, and M. Zhao, The intelligent critic framework for advanced optimal control, Artif. Intell. Rev., vol. 55, no. 1, pp. 1–22, 2022.

[56]

R. S. Sutton, Learning to predict by the methods of temporal differences, Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988.

[57]
C. J. C. H. Watkins, Learning from delayed rewards, PhD dissertation, Cambridge Univ., Cambridge, UK, 1989.
[58]

C. J. C. H. Watkins and P. Dayan, Q-learning, Mach. Learn., vol. 8, no. 3, pp. 279–292, 1992.

[59]

A. Gosavi, Reinforcement learning: A tutorial survey and recent advances, INFORMS J. Comput., vol. 21, no. 2, pp. 178–192, 2009.

[60]

L. P. Kaelbling, M. L. Littman, and A. W. Moore, Reinforcement learning: A survey, J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.

[61]
G. A. Rummery and M. Niranjan, On-Line Q-Learning Using Connectionist Systems, http://mi.eng.cam.ac.uk/reports/svr-ftp/auto-pdf/rummery_tr166.pdf, 1994.
[62]
R. S. Sutton, Generalization in reinforcement learning: Successful examples using sparse coarse coding, in Proc. 8th Int. Conf. Neural Information Processing Systems, Denver, CO, USA, 1995, pp. 1038–1044.
[63]
R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.
[64]
D. P. Bertsekas, Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA, USA: Athena Scientific, 2005.
[65]
S. E. Dreyfus and A. M. Law, The Art and Theory of Dynamic Programming. New York, NY, USA: Academic Press, 1977.
[66]
F. L. Lewis and V. L. Syrmos, Optimal Control. New York, NY, USA: Wiley, 1995.
[67]

M. C. Weinstein and R. J. Zeckhauser, The optimal consumption of depletable natural resources, Quart. J. Econ., vol. 89, no. 3, pp. 371–392, 1975.

[68]
S. G. Papachristos, Adaptive dynamic programming in inventory control, PhD dissertation, The University of Manchester, Manchester, UK, 1977.
[69]

S. Papachristos, Note-A note on the dynamic inventory problem with unknown demand distribution, Manage. Sci., vol. 23, no. 11, pp. 1248–1251, 1977.

[70]

S. Shields, A review of fault detection methods for large systems, Radio Electron. Eng., vol. 46, no. 6, pp. 276–280, 1976.

[71]

A. G. Barto, S. J. Bradtke, and S. P. Singh, Learning to act using real-time dynamic programming, Artif. Intell., vol. 72, no. 1-2, pp. 81–138, 1995.

[72]
J. J. Murray, C. J. Cox, and R. E. Saeks, The adaptive dynamic programming theorem, in Stability and Control of Dynamical Systems with Applications, D. Liu and P. J. Antsaklis, Eds. Boston, MA USA: Birkhäuser, 2003, pp. 379–394.
[73]

W. H. Hausman and L. J. Thomas, Inventory control with probabilistic demand and periodic withdrawals, Manage. Sci., vol. 18, no. 5-part-1, pp. 265–275, 1972.

[74]

P. J. Werbos, Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research, IEEE Trans. Syst. Man Cybern., vol. 17, no. 1, pp. 7–20, 1987.

[75]

P. J. Werbos, Advanced forecasting methods for global crisis warning and models of intelligence, Gen. Syst., vol. 22, pp. 25–38, 1977.

[76]
P. J. Werbos, A menu of designs for reinforcement learning over time, in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA, USA: MIT Press, 1990, pp. 67–95.
[77]

P. J. Werbos, Consistency of HDP applied to a simple reinforcement learning problem, Neural Netw., vol. 3, no. 2, pp. 179–189, 1990.

[78]
P. J. Werbos, Approximate dynamic programming for real-time control and neural modeling, in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New York, NY, USA: Van Nostrand Reinhold, 1992, pp.493–525.
[79]

D. V. Prokhorov, R. A. Santiago, and D. C. Wunsch, Adaptive critic designs: A case study for neurocontrol, Neural Netw., vol. 8, no. 9, pp. 1367–1372, 1995.

[80]

D. V. Prokhorov and D. C. Wunsch, Adaptive critic designs, IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, 1997.

[81]

A. Al-Tamimi, M. Abu-Khalaf, and F. L. Lewis, Adaptive critic designs for discrete-time zero-sum games with application to H control, IEEE Trans. Syst. Man Cybern. B: Cybern., vol. 37, no. 1, pp. 240–247, 2007.

[82]

S. N. Balakrishnan and V. Biega, Adaptive-critic-based neural networks for aircraft optimal control, J. Guid. Control Dyn., vol. 19, no. 4, pp. 893–898, 1996.

[83]

G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator, IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 764–773, 2002.

[84]

C. Cox, S. Stepniewski, C. Jorgensen, R. Saeks, and C. Lewis, On the design of a neural network autolander, Int. J. Robust Nonlinear Control, vol. 9, no. 14, pp. 1071–1096, 1999.

[85]
J. Dalton and S. N. Balakrishnan, A neighboring optimal adaptive critic for missile guidance, Math. Comput. Modell., vol. 23, nos. 1–2, pp. 175–188, 1996.
[86]

D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, Adaptive critic learning techniques for engine torque and air-fuel ratio control, IEEE Trans. Syst. Man Cybern. B: Cybern., vol. 38, no. 4, pp. 988–993, 2008.

[87]

N. V. Kulkarni and K. KrishnaKumar, Intelligent engine control using an adaptive critic, IEEE Trans. Control Syst. Technol., vol. 11, no. 2, pp. 164–173, 2003.

[88]

D. Liu, Y. Zhang, and H. Zhang, A self-learning call admission control scheme for CDMA cellular networks, IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219–1228, 2005.

[89]
J. Si, L. Yang, and D. Liu, Direct neural dynamic programming, in Handbook of Learning and Approximate Dynamic Programming, J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Eds. New York, NY, USA: Wiley, 2004, pp. 125–151.
[90]

S. Chakraborty and M. G. Simoes, Neural dynamic programming based online controller with a novel trim approach, IEE Proc. Control Theory Appl., vol. 152, no. 1, pp. 95–104, 2005.

[91]

D. Liu and H. Zhang, A neural dynamic programming approach for learning control of failure avoidance problems, Int. J. Intell. Control Syst., vol. 10, no. 1, pp. 21–32, 2005.

[92]
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming: An overview, in Proc. 34th IEEE Conf. Decision and Control, New Orleans, LA, USA, 1995, pp. 560–564.
[93]
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996.
[94]

D. P. Bertsekas, M. L. Homer, D. A. Logan, S. D. Patek, and N. R. Sandell, Missile defense and interceptor allocation by neuro-dynamic programming, IEEE Trans. Syst. Man Cybern. A: Syst. Hum., vol. 30, no. 1, pp. 42–51, 2000.

[95]

P. Marbach, O. Mihatsch, and J. N. Tsitsiklis, Call admission control and routing in integrated services networks using neuro-dynamic programming, IEEE J. Select. Areas Commun., vol. 18, no. 2, pp. 197–208, 2000.

[96]

D. Wang, C. Mu, H. He, and D. Liu, Event-driven adaptive robust control of nonlinear systems with uncertainties through NDP strategy, IEEE Trans. Syst. Man Cybern. Syst., vol. 47, no. 7, pp. 1358–1370, 2017.

[97]

C. Mu, D. Wang, and H. He, Novel iterative neural dynamic programming for data-based approximate optimal control design, Automatica, vol. 81, pp. 240–252, 2017.

[98]

M. Aoki, On optimal and suboptimal policies in the choice of control forces for final-value systems, IRE Trans. Autom. Control, vol. 5, no. 3, pp. 171–178, 1960.

[99]

R. Durbeck, An approximation technique for suboptimal control, IEEE Trans. Autom. Control, vol. 10, no. 2, pp. 144–149, 1965.

[100]

R. J. Leake and R. W. Liu, Construction of suboptimal control sequences, SIAM J. Control, vol. 5, no. 1, pp. 54–63, 1967.

[101]
F. Y. Wang and G. N. Saridis, Suboptimal control for nonlinear stochastic systems, in Proc. 31st IEEE Conf. Decision and Control, Tucson, AZ, USA, 1992, pp. 1856–1861.
[102]

G. N. Saridis and F. Y. Wang, Suboptimal control of nonlinear stochastic systems, Control Theory Adv. Technol., vol. 10, no. 4, pp. 847–871, 1994.

[103]
P. Werbos, ADP: Goals, opportunities and principles, in Handbook of Learning and Approximate Dynamic Programming, J. Si, A. Barto, W. Powell, and D. Wunsch, Eds. New York, NY, USA: Wiley, 2004, pp. 3–44.
[104]
W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York, NY, USA: Wiley, 2007.
[105]
P. J. Werbos, Using ADP to understand and replicate brain intelligence: The next level design, in Proc. IEEE Int. Symp. Approximate Dynamic Programming and Reinforcement Learning, Honolulu, HI, USA, 2007, pp. 209–216.
[106]

P. J. Werbos, Foreword - ADP: The key direction for future research in intelligent control and understanding brain intelligence, IEEE Trans. Syst. Man Cybern. B: Cybern., vol. 38, no. 4, pp. 898–900, 2008.

[107]
X. Bai, D. Zhao, and J. Yi, Coordinated multiple ramps metering based on neuro-fuzzy adaptive dynamic programming, in Proc. Int. Joint Conf. Neural Networks, Atlanta, GA, USA, 2009, pp. 241–248.
[108]
Y. Zhu, D. Zhao, and H. He, Integration of fuzzy controller with adaptive dynamic programming, in Proc. 10th World Congress on Intelligent Control and Automation, Beijing, China, 2012, pp. 310–315.
[109]

H. Zhang, J. Zhang, G. H. Yang, and Y. Luo, Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming, IEEE Trans. Fuzzy Syst., vol. 23, no. 1, pp. 152–163, 2015.

[110]
R. E. Saeks, C. J. Cox, K. Mathia, and A. J. Maren, Asymptotic dynamic programming: Preliminary concepts and results, in Proc. IEEE Int. Conf. Neural Networks, Houston, TX, USA, 1997, pp. 2273–2278.
[111]
S. Haykin, Neural Networks and Learning Machines, 3rd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2009.
[112]
J. M. Zurada, Introduction to Artificial Neural Systems. St. Paul, MN, USA: West, 1992.
[113]
D. Liu, X. Xiong, and Y. Zhang, Action-dependent adaptive critic designs, in Proc. Int. Joint Conf. Neural Networks, Washington, DC, USA, 2001, pp. 990–995.
[114]
G. G. Lendaris and C. Paintz, Training strategies for critic and action neural networks in dual heuristic programming method, in Proc. IEEE Int. Conf. Neural Networks, Houston, TX, USA, 1997, pp. 712–717.
[115]

J. Si and Y. T. Wang, Online learning control by association and reinforcement, IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264–276, 2001.

[116]

H. He, Z. Ni, and J. Fu, A three-network architecture for on-line learning and optimization based on adaptive dynamic programming, Neurocomputing, vol. 78, no. 1, pp. 3–13, 2012.

[117]

R. Padhi, N. Unnikrishnan, X. Wang, and S. N. Balakrishnan, A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems, Neural Netw., vol. 19, no. 10, pp. 1648–1660, 2006.

[118]

B. Lincoln and A. Rantzer, Relaxing dynamic programming, IEEE Trans. Autom. Control, vol. 51, no. 8, pp. 1249–1260, 2006.

[119]

A. Rantzer, Relaxed dynamic programming in switching systems, IEE Proc. Control Theory Appl., vol. 153, no. 5, pp. 567–574, 2006.

[120]

A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof, IEEE Trans. Syst. Man Cybern. B: Cybern., vol. 38, no. 4, pp. 943–949, 2008.

[121]

F. Y. Wang, N. Jin, D. Liu, and Q. Wei, Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound, IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 24–36, 2011.

[122]

D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming, IEEE Trans. Autom. Sci. Eng., vol. 9, no. 3, pp. 628–634, 2012.

[123]

D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming, Automatica, vol. 48, no. 8, pp. 1825–1832, 2012.

[124]

D. Liu, D. Wang, and X. Yang, An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs, Inf. Sci., vol. 220, pp. 331–342, 2013.

[125]

D. Liu and Q. Wei, Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 621–634, 2014.

[126]

D. Liu, Q. Wei, and P. Yan, Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems, IEEE Trans. Syst. Man Cybern. Syst., vol. 45, no. 12, pp. 1577–1591, 2015.

[127]

Q. Wei and D. Liu, A novel iterative θ-adaptive dynamic programming for discrete-time nonlinear systems, IEEE Trans. Autom. Sci. Eng., vol. 11, no. 4, pp. 1176–1190, 2014.

[128]

Q. Zhao, H. Xu, and S. Jagannathan, Near optimal output feedback control of nonlinear discrete-time systems based on reinforcement neural network learning, IEEE/CAA J. Autom. Sin., vol. 1, no. 4, pp. 372–384, 2014.

[129]

X. Zhong, H. He, H. Zhang, and Z. Wang, Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming, IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 12, pp. 2141–2155, 2014.

[130]

A. Sahoo, H. Xu, and S. Jagannathan, Near optimal event-triggered control of nonlinear discrete-time systems using neurodynamic programming, IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 9, pp. 1801–1815, 2016.

[131]

H. Li and D. Liu, Optimal control for discrete-time affine non-linear systems using general value iteration, IET Control Theory Appl., vol. 6, no. 18, pp. 2725–2736, 2012.

[132]

D. Liu, H. Li, and D. Wang, Error bounds of adaptive dynamic programming algorithms for solving undiscounted optimal control problems, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 6, pp. 1323–1334, 2015.

[133]

D. Liu and Q. Wei, Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems, IEEE Trans. Cybern., vol. 43, no. 2, pp. 779–789, 2013.

[134]

Q. Wei and D. Liu, Numerical adaptive learning control scheme for discrete-time non-linear systems, IET Control Theory Appl., vol. 7, no. 11, pp. 1472–1486, 2013.

[135]

Q. Wei and D. Liu, Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems, Neural Comput. Appl., vol. 24, no. 6, pp. 1355–1367, 2014.

[136]

Q. Wei, D. Liu, and X. Yang, Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 4, pp. 866–879, 2015.

[137]

Q. Wei, D. Liu, and Y. Xu, Neuro-optimal tracking control for a class of discrete-time nonlinear systems via generalized value iteration adaptive dynamic programming approach, Soft Comput., vol. 20, no. 2, pp. 697–706, 2016.

[138]

Q. Wei, F. Y. Wang, D. Liu, and X. Yang, Finite-approximation-error-based discrete-time iterative adaptive dynamic programming, IEEE Trans. Cybern., vol. 44, no. 12, pp. 2820–2833, 2014.

[139]

P. Yan, D. Wang, H. Li, and D. Liu, Error bound analysis of Q-function for discounted optimal control problems with policy iteration, IEEE Trans. Syst. Man Cybern. Syst., vol. 47, no. 7, pp. 1207–1216, 2017.

[140]

B. Luo, Y. Yang, H. N. Wu, and T. Huang, Balancing value iteration and policy iteration for discrete-time control, IEEE Trans. Syst. Man Cybern. Syst., vol. 50, no. 11, pp. 3948–3958, 2020.

[141]
M. Ha, D. Wang, and D. Liu, A novel value iteration scheme with adjustable convergence rate, IEEE Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2022.3143527.
[142]

Y. Zhu, D. Zhao, and X. Li, Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data, IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 714–725, 2017.

[143]

C. Li, D. Liu, and D. Wang, Data-based optimal control for weakly coupled nonlinear systems using policy iteration, IEEE Trans. Syst. Man Cybern. Syst., vol. 48, no. 4, pp. 511–521, 2018.

[144]

H. Zhang, Y. Liu, G. Xiao, and H. Jiang, Data-based adaptive dynamic programming for a class of discrete-time systems with multiple delays, IEEE Trans. Syst. Man Cybern. Syst., vol. 50, no. 2, pp. 432–441, 2020.

[145]

N. Lin, R. Chi, and B. Huang, Event-triggered model-free adaptive control, IEEE Trans. Syst. Man Cybern. Syst., vol. 51, no. 6, pp. 3358–3369, 2021.

[146]

B. Luo, Y. Yang, and D. Liu, Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems, IEEE Trans. Cybern., vol. 51, no. 7, pp. 3630–3640, 2021.

[147]

Q. Wei, L. Zhu, R. Song, P. Zhang, D. Liu, and J. Xiao, Model-free adaptive optimal control for unknown nonlinear multiplayer nonzero-sum game, IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 2, pp. 879–892, 2022.

[148]

M. Farjadnasab and M. Babazadeh, Model-free LQR design by Q-function learning, Automatica, vol. 137, p. 110060, 2022.

[149]

C. Mu, D. Wang, and H. He, Data-driven finite-horizon approximate optimal control for discrete-time nonlinear systems using iterative HDP approach, IEEE Trans. Cybern., vol. 48, no. 10, pp. 2948–2961, 2018.

[150]

S. Al-Dabooni and D. C. Wunsch, An improved N-step value gradient learning adaptive dynamic programming algorithm for online learning, IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 4, pp. 1155–1169, 2020.

[151]

B. Luo, D. Liu, H. N. Wu, D. Wang, and F. L. Lewis, Policy gradient adaptive dynamic programming for data-based optimal control, IEEE Trans. Cybern., vol. 47, no. 10, pp. 3341–3354, 2017.

[152]
J. Ye, Y. Bian, B. Luo, M. Hu, B. Xu, and R. Ding, Costate-supplement ADP for model-free optimal control of discrete-time nonlinear systems, IEEE Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2022.3172126.
[153]

Y. Li, Z. Hou, Y. Feng, and R. Chi, Data-driven approximate value iteration with optimality error bound analysis, Automatica, vol. 78, pp. 79–87, 2017.

[154]

Y. Li, C. Yang, Z. Hou, Y. Feng, and C. Yin, Data-driven approximate Q-learning stabilization with optimality error bound analysis, Automatica, vol. 103, pp. 435–442, 2019.

[155]

H. Zhang, K. Zhang, Y. Cai, and J. Han, Adaptive fuzzy fault-tolerant tracking control for partially unknown systems with actuator faults via integral reinforcement learning method, IEEE Trans. Fuzzy Syst., vol. 27, no. 10, pp. 1986–1998, 2019.

[156]

Y. Cao, Y. Song, and C. Wen, Practical tracking control of perturbed uncertain nonaffine systems with full state constraints, Automatica, vol. 110, p. 08608, 2019.

[157]

C. Chen, H. Modares, K. Xie, F. L. Lewis, Y. Wan, and S. Xie, Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics, IEEE Trans. Autom. Control, vol. 64, no. 11, pp. 4423–4438, 2019.

[158]

M. Ha, D. Wang, and D. Liu, Data-based nonaffine optimal tracking control using iterative DHP approach, IFAC-PapersOnLine, vol. 53, no. 2, pp. 4246–4251, 2020.

[159]

K. Zhang, H. Zhang, Y. Mu, and C. Liu, Decentralized tracking optimization control for partially unknown fuzzy interconnected systems via reinforcement learning method, IEEE Trans. Fuzzy Syst., vol. 29, no. 4, pp. 917–926, 2021.

[160]

F. Liu, C. Jiang, and W. Xiao, Multistep prediction-based adaptive dynamic programming sensor scheduling approach for collaborative target tracking in energy harvesting wireless sensor networks, IEEE Trans. Autom. Sci. Eng., vol. 18, no. 2, pp. 693–704, 2021.

[161]

H. Dong, X. Zhao, and B. Luo, Optimal tracking control for uncertain nonlinear systems with prescribed performance via critic-only ADP, IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 1, pp. 561–573, 2022.

[162]

B. Luo, D. Liu, T. Huang, and J. Liu, Output tracking control based on adaptive dynamic programming with multistep policy evaluation, IEEE Trans. Syst. Man Cybern. Syst., vol. 49, no. 10, pp. 2155–2165, 2019.

[163]

C. Li, J. Ding, F. L. Lewis, and T. Chai, A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems, Automatica, vol. 129, p. 109687, 2021.

[164]

M. Ha, D. Wang, and D. Liu, Discounted iterative adaptive critic designs with novel stability analysis for tracking control, IEEE/CAA J. Autom. Sin., vol. 9, no. 7, pp. 1262–1272, 2022.

[165]

W. Xue, P. Kolaric, J. Fan, B. Lian, T. Chai, and F. L. Lewis, Inverse reinforcement learning in tracking control based on inverse optimal control, IEEE Trans. Cybern., vol. 52, no. 10, pp. 10570–10581, 2022.

[166]

W. Zhang, K. Song, X. Rong, and Y. Li, Coarse-to-fine UAV target tracking with deep reinforcement learning, IEEE Trans. Autom. Sci. Eng., vol. 16, no. 4, pp. 1522–1530, 2019.

[167]

Y. Hu, W. Wang, H. Liu, and L. Liu, Reinforcement learning tracking control for robotic manipulator with kernel-based dynamic model, IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 9, pp. 3570–3578, 2020.

[168]

R. Wu, Z. Yao, J. Si, and H. H. Huang, Robotic knee tracking control to mimic the intact human knee profile based on actor-critic reinforcement learning, IEEE/CAA J. Autom. Sin., vol. 9, no. 1, pp. 19–30, 2022.

[169]
S. Cao, L. Sun, J. Jiang, and Z. Zuo, Reinforcement learning-based fixed-time trajectory tracking control for uncertain robotic manipulators with input saturation, IEEE Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2021.3116713.
[170]

N. Wang, Y. Gao, H. Zhao, and C. K. Ahn, Reinforcement learning-based optimal tracking control of an unknown unmanned surface vehicle, IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 7, pp. 3034–3045, 2021.

[171]
S. A. A. Rizvi, A. J. Pertzborn, and Z. Lin, Reinforcement learning based optimal tracking control under unmeasurable disturbances with application to HVAC systems, IEEE Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2021.3085358.
[172]
S. Song, M. Zhu, X. Dai, and D. Gong, Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm, IEEE Trans. Neural Netw. Learn. Syst., doi: 10.1109/TNNLS.2022.3178746.
[173]

M. Lin, B. Zhao, and D. Liu, Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay, IEEE Trans. Syst. Man Cybern. Syst., vol. 52, no. 6, pp. 3692–3703, 2022.

[174]

S. Li, P. Durdevic, and Z. Yang, Model-free H tracking control for de-oiling hydrocyclone systems via off-policy reinforcement learning, Automatica, vol. 133, p. 109862, 2021.

[175]

W. Sun, X. Wang, and C. Zhang, A model-free control strategy for vehicle lateral stability with adaptive dynamic programming, IEEE Trans. Ind. Electron., vol. 67, no. 12, pp. 10693–10701, 2020.

[176]

A. Heydari, Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 9, pp. 4522–4527, 2018.

[177]

A. Heydari, Stability analysis of optimal adaptive control using value iteration with approximation errors, IEEE Trans. Autom. Control, vol. 63, no. 9, pp. 3119–3126, 2018.

[178]

Q. Wei, D. Liu, and H. Lin, Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, IEEE Trans. Cybern., vol. 46, no. 3, pp. 840–853, 2016.

[179]

A. Heydari, Theoretical and numerical analysis of approximate dynamic programming with approximation errors, J. Guid., Control Dyn., vol. 39, no. 2, pp. 301–311, 2016.

[180]
M. Ha, D. Wang, and D. Liu, Offline and online adaptive critic control designs with stability guarantee through value iteration, IEEE Trans. Cybern., doi: 10.1109/TCYB.2021.3107801.
[181]

M. Ha, D. Wang, and D. Liu, Generalized value iteration for discounted optimal control with stability analysis, Syst. Control Lett., vol. 147, p. 104847, 2021.

[182]

K. G. Vamvoudakis and F. L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica, vol. 46, no. 5, pp. 878–888, 2010.

[183]

M. Ha, D. Wang, and D. Liu, Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee, Neural Netw., vol. 144, pp. 176–186, 2021.

[184]

S. Al-Dabooni and D. C. Wunsch, Online model-free n-step HDP with stability analysis, IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 4, pp. 1255–1269, 2020.

[185]

H. Zhang, C. Qin, B. Jiang, and Y. Luo, Online adaptive policy learning algorithm for H state feedback control of unknown affine nonlinear discrete-time systems, IEEE Trans. Cybern., vol. 44, no. 12, pp. 2706–2718, 2014.

[186]

Y. Sokolov, R. Kozma, L. D. Werbos, and P. J. Werbos, Complete stability analysis of a heuristic approximate dynamic programming control design, Automatica, vol. 59, pp. 9–18, 2015.

[187]

S. Al-Dabooni and D. Wunsch, The boundedness conditions for model-free HDP(λ), IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 7, pp. 1928–1942, 2019.

[188]

J. W. Kim, T. H. Oh, S. H. Son, D. H. Jeong, and J. M. Lee, Convergence analysis of the deep neural networks based globalized dual heuristic programming, Automatica, vol. 122, p. 109222, 2020.

[189]
R. A. Santiago and P. Werbos, New progress towards truly brain-like intelligent control, in Proc. World Congress on Neural Networks, San Diego, CA, 1994, pp. 27–33.
[190]

P. J. Werbos, Intelligence in the brain: A theory of how it works and how to build it, Neural Netw., vol. 22, no. 3, pp. 200–212, 2009.

CAAI Artificial Intelligence Research
Pages 93-110
Cite this article:
Liu D, Ha M, Xue S. State of the Art of Adaptive Dynamic Programming and Reinforcement Learning. CAAI Artificial Intelligence Research, 2022, 1(2): 93-110. https://doi.org/10.26599/AIR.2022.9150007

3653

Views

833

Downloads

3

Crossref

Altmetrics

Received: 26 April 2022
Revised: 19 August 2022
Accepted: 14 September 2022
Published: 10 March 2023
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return