动态规划能精确求解哈密顿-雅可比-贝尔曼方程,是求解动态优化问题的有力工具[1]. 但动态规划是一种逆向计算方法,最优控制序列必须在时间上逆向求解,导致计算量和存储量随着控制序列的长度也即解的时间维度的增加而急剧增长. 这称为“维数灾”问题[2]. Werbos提出的自适应动态规划是一种采用函数近似结构迭代求取哈密顿-雅可比-贝尔曼方程近似解的算法[3-6]. 自适应动态规划通过迭代计算,令迭代控制策略和迭代性能指标函数逐渐逼近最优控制策略和最优性能指标函数,避免了动态规划在时间上逆向求解所面临的“维数灾”问题. 自适应动态规划又称为自适应评判设计[7]、近似动态规划[4]、神经动态规划[8]、强化学习[9].
自适应动态规划的迭代方式可分为值迭代和策略迭代[10-11]. 在策略迭代方面,Murray等给出了一种应用于连续时间仿射非线性系统的策略迭代,首次指出从稳定控制策略开始进行策略迭代可保证迭代收敛和系统稳定,并给出了稳定性和收敛性证明. 这是学术界首次给出的策略迭代的数学证明[6]. 文献[12]研究了应用于具有饱和约束的连续非线性系统的策略迭代,并证明了其收敛性. Liu等提出了一种应用于离散非线性系统的策略迭代,并给出了迭代控制策略初始化为稳定控制策略可保证迭代收敛到最优的理论证明[13]. Wang等研究了带误差边界的策略迭代以解决有限时间离散非线性系统的优化问题[14].
在值迭代方面,Al-Tamimi等[15]给出了一种应用于离散时间仿射非线性系统的值迭代,首次指出迭代性能指标函数初始化为零可保证迭代收敛到最优,并给出了数学证明. Zhang等[16]定义了新型的性能指标函数解决离散仿射非线性系统的最优跟踪控制问题,并给出了值迭代的收敛性和最优性证明. 随后Wang等[17]进行了应用于一般离散时间非仿射非线性系统的值迭代的收敛性和最优性证明. 文献[15-17]要求迭代性能指标函数初始化为零,而且迭代控制策略不能保证系统的稳定性. Wei等[18]给出一种保证迭代控制策略是稳定控制策略的迭代性能指标函数的初始化方法,并给出了收敛性证明.
文献[19]证明了半正定的初始的迭代性能指标函数的收敛性,极大地放松了值迭代的初始条件. 本文证明了半正定的初始的迭代性能指标函数的收敛性,证明方法与文献[19]是不相同的.
1 问题描述本文所描述的离散时间非仿射非线性系统为
${{x}_{k + 1}} = {F}({{x}_k},{{u}_k}),\;k = 0,1,2, \cdots ,$
|
(1) |
其中
优化目标是寻找控制序列
$J({{x}_k},{{\underline {{u}}} _k}) = \sum\limits_{t = k}^\infty {U({{x}_t},{{u}_t})} ,$
|
(2) |
其中
根据Bellman最优定理,最优性能指标函数满足哈密顿-雅可比-贝尔曼方程:
${J^*}({{x}_k}) = \mathop {\min }\limits_{u({{{x}}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J^*}({{x}_{k + 1}})\} .$
|
(3) |
式(3)也可表示为
${J^*}({{x}_k}) = U({{x}_k},{{u}^*}({{x}_k})) + {J^*}({{x}_{k + 1}}),$
|
(4) |
其中
${{u}^*}({{x}_k}) = \mathop {\arg \min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J^*}({{x}_{k + 1}})\} ,$
|
(5) |
即
${{u}^*}({{x}_k}) = \mathop {\arg \min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J^*}({{x}_{k + 1}})\} .$
|
(6) |
对于非线性系统,直接求解哈密顿-雅可比-贝尔曼方程是十分困难的. 自适应动态规划可通过迭代间接求解哈密顿-雅可比-贝尔曼方程. 设
${{u}_i}({{x}_k}) = \mathop {\arg \min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J_i}({{x}_{k + 1}})\} ,$
|
(7) |
${J_{i + 1}}({{x}_k}) = \mathop {\min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J_i}({{x}_{k + 1}})\} .$
|
(8) |
定义一个连续函数的集合:
$\varGamma = \left\{ {\varphi ({x}):\varphi ({x}) \geqslant 0,{\rm{ }}\varphi (0) = 0} \right\}.$
|
(9) |
Γ是半正定函数的集合. 本文将以一种有别于文献[19]的新方法证明如果
对于优化控制问题,不但要求
定义1 如果
定义一个连续函数的集合:
$\varPsi = \left\{ {\psi ({{x}_k}):{\rm{ }}\psi ({{x}_k}) = \sum\limits_{t = k}^\infty {U({{x}_t},{v}({{x}_t}))} } \right\},$
|
(10) |
其中
${J_i}({{x}_k}) = \sum\limits_{t = k}^\infty {U({{x}_t},{{v}_i}({{x}_t}))} = U({{x}_k},{{v}_i}({{x}_k})) + {J_i}({{x}_{k + 1}}).$
|
(11) |
下文将首先证明如果
引理1 若
证明 由式(10)可知,当
${J_i}({{x}_k}) = \sum\limits_{t = k}^\infty {U({{x}_t},{{v}_i}({{x}_t}))} \geqslant \mathop {\min }\limits_{{u}({{x}_k})} \sum\limits_{t = k}^\infty {U({{x}_t},{u}({{x}_t}))} = {J^*}({{x}_k}).$
|
(12) |
引理1证毕.
引理2 若
证明 根据式(8)和(11),得
$\begin{array}{l}{J_i}({{x}_k}) = U({{x}_k},{{v}_i}({{x}_k})) + {J_i}({{x}_{k + 1}}) \geqslant \\[6pt]\;\;\;\;\;\;\;\;\mathop {\min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J_i}({{x}_{k + 1}})\} = {J_{i + 1}}({{x}_k}).\end{array}$
|
(13) |
引理2证毕.
引理3 若
证明 由
1) 根据式(8)和(11),得
$\begin{array}{l}{J_{i + 1}}({{x}_k}) = U({{x}_k},{{u}_i}({{x}_k})) + {J_i}({{x}_{k + 1}}) = \\[6pt]\;\;\;\;\;\;\;\;\;\;\displaystyle U({{x}_k},{{u}_i}({{x}_k})) + \sum\limits_{t = k + 1}^\infty {U({{x}_t},{{v}_i}({{x}_t}))}.\end{array}$
|
(14) |
令
${{\upsilon }_{i + 1}}({{x}_t}) = \left\{ \begin{array}{l}{{u}_i}({{x}_t}),{\rm{ }}t = k;\\[6pt]{{v}_i}({{x}_t}),{\rm{ }}t > k.\end{array} \right.$
|
(15) |
则式(14)重写为
${J_{i + 1}}({{x}_k}) = \sum\limits_{t = k}^\infty {U({{x}_t},{{\upsilon }_{i + 1}}({{x}_t}))} .$
|
(16) |
2) 下面证明
${J_i}({{x}_{k + 1}}) - {J_i}({{x}_k}) \leqslant {J_i}({{x}_{k + 1}}) - {J_{i + 1}}({{x}_k}) = - U({{x}_k},{{u}_i}({{x}_k})).$
|
(17) |
根据式(10),
引理3证毕.
引理1、2和3表明,若
定理1 若
证明 采用文献[17]的方法进行证明.
设
${J_\infty }({{x}_k}) \leqslant {J_{i + 1}}({{x}_k}) = U({{x}_k},{{u}_i}({{x}_k})) + {J_i}({{x}_{k + 1}}).$
|
(18) |
若
${J_\infty }({{x}_k}) \leqslant U({{x}_k},{{u}_\infty }({{x}_k})) + {J_\infty }({{x}_{k + 1}}).$
|
(19) |
另一方面,根据引理2,有
${J_\infty }({{x}_k}) \leqslant {J_i}({{x}_k}),$
|
(20) |
于是得
${J_\infty }({{x}_{k + 1}}) \leqslant {J_i}({{x}_{k + 1}})$
|
(21) |
以及
$\begin{array}{l}{J_{i + 1}}({{x}_k}) = U({{x}_k},{{u}_i}({{x}_k})) + {J_i}({{x}_{k + 1}}) \geqslant \\[6pt]\;\;\;\;\;\;\;\;\;U({{x}_k},{{u}_i}({{x}_k})) + {J_\infty }({{x}_{k + 1}}).\end{array}$
|
(22) |
若
${J_\infty }({{x}_k}) \geqslant U({{x}_k},{{u}_\infty }({{x}_k})) + {J_\infty }({{x}_{k + 1}}).$
|
(23) |
由不等式(19)和(23)可得
${J_\infty }({{x}_k}) = U({{x}_k},{{u}_\infty }({{x}_k})) + {J_\infty }({{x}_{k + 1}}).$
|
(24) |
另外根据式(7),当
${{u}_\infty }({{x}_k}) = \mathop {\arg \min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J_\infty }({{x}_{k + 1}})\} .$
|
(25) |
可见式(24)、(25)与式(3)、(5)一样.
定理1证毕.
下面证明如果
引理4
证明 1)
由于系统(1)是可控的,因此
$\begin{split}\displaystyle\sum\limits_{t = 0}^p {\left\{ {U({{{x}}_{k + \eta + t(\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \eta } )}},{{\varLambda}} ({{{x}}_{k + \eta + t(\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1)}})) + } \right.} \\\displaystyle\left. {\sum\limits_{\tau = k + \eta + t(\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1) + 1}^{k + \eta + t(\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1)\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \eta } } {} \left. {U({{{x}}_\tau },{{\mu}} ({{{x}}_\tau }))} \right.} \right\}.\end{split}$
|
(26) |
在式(26)中,
$\left\{ \begin{array}{l}{{x}_{k + \eta }} = {{x}_{k + \eta + (t + 1)(\mathord{\buildrel{\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1)}} = 0\\[9pt]{{x}_{k + \eta + t(\mathord{\buildrel{\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1) + 1}} = {X}\end{array} \right.,\;t = 0,1, \cdots ,p.$
|
(27) |
于是式(26)可重写为
$\begin{array}{l}H({{x}_k},p) = \sum\limits_{t = k}^{k + \eta - 1} {U({{x}_t},{\mu }({{x}_t}))} + \\p\left\{ {U({{x}_{k + \eta }},{\varLambda }({{x}_{k + \eta }})) + \sum\limits_{\tau = k + \eta + 1}^{k + \eta + \mathord{\buildrel{\hbox{$\scriptscriptstyle\frown$}} \over \eta } } {U({{x}_\tau },{\mu }({{x}_\tau }))} } \right\}.\end{array}$
|
(28) |
${J_i}({{x}_k}) \leqslant H({{x}_k},P).$
|
(29) |
设
$S({{x}_k}) = H({{x}_k},P) + \sum\limits_{t = k + \eta + (P + 1)(\mathord{\buildrel{\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1)}^\infty {U({{x}_t},{\theta }({{x}_t}))} .$
|
(30) |
根据
${J_i}({{x}_k}) \leqslant H({{x}_k},P) = S({{x}_k}).$
|
(31) |
令
${\omega }({{x}_t}) = \left\{ \begin{array}{l}{\mu }({{x}_t}),{\rm{ }}t < k + \eta + (P + 1)(\mathord{\buildrel{\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1){\rm{ and }}t \ne T;\\[6pt]{\varLambda }({{x}_t}),t = T;\\[6pt]{\theta }({{x}_t}),{\rm{ }}t \geqslant k + \eta + (P + 1)(\mathord{\buildrel{\hbox{$\scriptscriptstyle\frown$}} \over \eta } + 1).\end{array} \right.$
|
(32) |
式(30)可重写为
$S({{x}_k}) = \sum\limits_{t = k}^\infty {U({{x}_t},{\omega }({{x}_t}))} .$
|
(33) |
由
2)
给定一个零控制策略,并令
$S({{x}_k}) = \sum\limits_{t = k}^\infty {U({{x}_t},0)} ,$
|
(34) |
由
${J_i}({{x}_k}) = S({{x}_k}) = 0,$
|
(35) |
因此当
根据式(31)~(35),可得到以下结论.
$S({{x}_k}) \!=\!\! \sum\limits_{t = k}^\infty {U({{x}_t},\bar{ \omega }({{x}_t}))} ,\;\;\bar{ \omega }({{x}_t}) \!=\! \left\{ \begin{array}{l}\!\! {\omega }({{x}_t}),{\rm{ }}{{x}_k} \ne 0;\\[6pt]\!\! 0,{\rm{ }}{{x}_k} = 0.\end{array} \right.$
|
(36) |
使得
引理4证毕.
定理2 若
证明 设
${\tilde J_i}({{x}_k}) \leqslant {J_i}({{x}_k}).$
|
(37) |
另外根据引理4,存在
${J_i}({{x}_k}) \leqslant {\bar J_i}({{x}_k}).$
|
(38) |
根据值迭代(7)和(8),定义两个不同初始的迭代性能指标函数的迭代过程:
$\left\{ \begin{array}{l}{{\tilde{ u}}_i}({{x}_k}) = \mathop {\arg \min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {{\tilde J}_i}({{x}_{k + 1}})\} ;\\[6pt]{{\tilde J}_{i + 1}}({{x}_k}) = U({{x}_k},{{\tilde{ u}}_i}({{x}_k})) + {{\tilde J}_i}({{x}_{k + 1}}).\end{array} \right.$
|
(39) |
$\left\{ \begin{array}{l}{{\bar{ u}}_i}({{x}_k}) = \mathop {\arg \min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {{\bar J}_i}({{x}_{k + 1}})\} ;\\[6pt]{{\bar J}_{i + 1}}({{x}_k}) = U({{x}_k},{{\bar{ u}}_i}({{x}_k})) + {{\bar J}_i}({{x}_{k + 1}}).\end{array} \right.$
|
(40) |
设q=0,把不等式(37)和(38)合并得
${\tilde J_{i + q}}({{x}_k}) \leqslant {J_{i + q}}({{x}_k}) \leqslant {\bar J_{i + q}}({{x}_k}),$
|
(41) |
也可以重写为
${\tilde J_{i + q}}({{x}_{k + 1}}) \leqslant {J_{i + q}}({{x}_{k + 1}}) \leqslant {\bar J_{i + q}}({{x}_{k + 1}}).$
|
(42) |
于是可推出
$\begin{array}{l}{{\bar J}_{i + q + 1}}({{x}_k}) = U({{x}_k},{{\bar{ u}}_{i + q}}({{x}_k})) + {{\bar J}_{i + q}}({{x}_{k + 1}}) \geqslant \\[6pt]\;\;\;\;\;\;\;\;\;\;\;\;\;U({{x}_k},{{\bar{ u}}_{i + q}}({{x}_k})) + {J_{i + q}}({{x}_{k + 1}}) \geqslant \\[6pt]\;\;\;\;\;\;\;\;\;\;\;\;\;\mathop {\min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {J_{i + q}}({{x}_{k + 1}})\} = {J_{i + q + 1}}({{x}_k}),\end{array}$
|
(43) |
以及
$\begin{array}{l}{J_{i + q + 1}}({{x}_k}) = U({{x}_k},{{u}_{i + q}}({{x}_k})) + {J_{i + q}}({{x}_{k + 1}}) \geqslant \\[6pt]\;\;\;\;\;\;\;\;\;\;\;\;\;U({{x}_k},{{u}_{i + q}}({{x}_k})) + {{\tilde J}_{i + q}}({{x}_{k + 1}}) \geqslant \\[6pt]\;\;\;\;\;\;\;\;\;\;\;\;\;\mathop {\min }\limits_{{u}({{x}_k})} \{ U({{x}_k},{u}({{x}_k})) + {{\tilde J}_{i + q}}({{x}_{k + 1}})\} = {{\tilde J}_{i + q + 1}}({{x}_k}).\end{array}$
|
(44) |
将不等式(43)和(44)合并为
${\tilde J_{i + q + 1}}({{x}_k}) \leqslant {J_{i + q + 1}}({{x}_k}) \leqslant {\bar J_{i + q + 1}}({{x}_k}).$
|
(45) |
当
${\tilde J_{i + q}}({{x}_k}) \leqslant {J_{i + q}}({{x}_k}) \leqslant {\bar J_{i + q}}({{x}_k}),\;\;q = 0,1,2, \cdots .$
|
(46) |
根据定理1,
定理2证毕.
定理2表明,
本文给出了一种证明方法,通过构造迭代性能指标函数的上下界,证明了初始化为任意半正定函数的迭代性能指标函数可保证值迭代收敛到最优.
[1] |
张海舰, 成思源, 骆少明, 等. 基于动态规划法的B 样条主动轮廓模型[J].
广东工业大学学报, 2005, 22(4): 26-30.
ZHANG H J, CHENG S Y, LUO S M, et al. B-Spline active contour based on dynamic programming[J]. Journal of Guangdong University of Technology, 2005, 22(4): 26-30. |
[2] | BELLMAN R E. Dynamic Programming [M]. Princeton: Princeton University Press, 1957. |
[3] | WERBOS P J. Advanced forecasting methods for global crisis warning and models of intelligence[J]. General Systems Yearbook, 1977, 22(6): 25-38. |
[4] | MILLER W T, SUTTON R S, WERBOS P J. A menu of designs for reinforcement learning over time, in neural networks for control [M]. Cambridge: MIT Press, 1991. |
[5] | WERBOS P J. Approximate dynamic programming for real-time control and neural modeling, in handbook of intelligent control [M]. New York: Van Nostrand Reinhold, 1992. |
[6] | MURRAY J J, COX C J, LENDARIS G G, et al. Adaptive dynamic programming[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140-153. DOI: 10.1109/TSMCC.2002.801727. |
[7] | PROKHOROV D V, WUNSCH D C. Adaptive critic designs[J]. IEEE Transactions on Neural Networks, 1997, 8(5): 997-1007. DOI: 10.1109/72.623201. |
[8] | BERTSEKAS D P, TSITSIKLIS J N. Neuro-dynamic programming [M]. Belmont: Athena Scientific, 1996. |
[9] | SUTTON R S, BARTO A G. Reinforcement learning: an introduction [M]. Cambridge: The MIT Press, 1998. |
[10] | LEWIS F L, LIU D. Reinforcement learning and adaptive dynamic programming for feedback control[J]. IEEE Circuits & Systems Magazine, 2009, 9(3): 32-50. |
[11] | LEWIS F L, VRABIE D, VAMVOUDAKIS K G. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers[J]. IEEE Control Systems, 2012, 32(6): 76-105. DOI: 10.1109/MCS.2012.2214134. |
[12] | ABU-KHALAF M, LEWIS F L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach[J]. Automatica, 2005, 41(5): 779-791. DOI: 10.1016/j.automatica.2004.11.034. |
[13] | LIU D, WEI Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621-634. DOI: 10.1109/TNNLS.2013.2281663. |
[14] | WANG F, JIN N, LIU D, WEI Q. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound[J]. IEEE Transactions on Neural Networks, 2011, 22(1): 24-36. DOI: 10.1109/TNN.2010.2076370. |
[15] | AL-TAMIMI A, LEWIS F L, ABU-KHALAF M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943-949. DOI: 10.1109/TSMCB.2008.926614. |
[16] | ZHANG H, WEI Q, LUO Y. A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 937-942. DOI: 10.1109/TSMCB.2008.920269. |
[17] | WANG D, LIU D, WEI Q, et al. Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming[J]. Automatica, 2012, 48(8): 1825-1832. DOI: 10.1016/j.automatica.2012.05.049. |
[18] | WEI Q, LIU D. A novel iterative θ-adaptive dynamic programming for discrete-time nonlinear systems[J]. IEEE Transactions on Automation Science and Engineering, 2014, 11(4): 1176-1190. DOI: 10.1109/TASE.2013.2280974. |
[19] | WEI Q, LIU D, LIN H. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems[J]. IEEE Transactions on Cybernetics, 2016, 46(3): 840-853. DOI: 10.1109/TCYB.2015.2492242. |
[20] | LIAO X, WANG L, YU P. Stability of dynamical systems [M]. Amsterdam: Elsevier Press, 2007. |