2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China;
4. Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
NOWADAYS, the need of the smart grid is continuously increasing [1][3]. Smart home energy management system is an important component of the smart grid. In smart home energy management systems, the intelligent optimal control of battery is a key technology for saving the power consumptions. In [4], a battery management method was proposed by battery dynamics modeling. In [5], the development of battery management systems was summarized which was applied in the smart grid and electric vehicles. In [6], the operating schedule of battery energy storage system was solved by the particle swarm optimization approach. However, in previous researches on the battery management, the properties of the battery management, such as convergence and optimality, were not provided, which limited the applications of the battery control. Adaptive dynamic programming (ADP), proposed by Werbos [7], [8], has been widely used in optimal energy management [6], [9][15]. There are several synonyms of ADP, including ''adaptive critic designs'' [16], ''approximate dynamic programming'' [11], ''neurodynamic programming'' [17], and ''relaxing dynamic programming'' [18].
Iterative methods are widely used in ADP to obtain the solution of HamiltonJacobiBellman (HJB) equation indirectly [14], [19][34]. Policy and value iterations are two primary iterative ADP algorithms [35]. Policy iteration algorithms for optimal control of continuoustime (CT) systems with continuous states and action spaces were first given in [36], [37]. In [38], a complexvalued ADP algorithm was discussed, where for the first time the optimal control problem of complexvalued nonlinear systems was successfully solved by ADP. In [39], based on neurocognitive psychology, a novel controller based on multiple actorcritic structures was developed for unknown systems and the proposed controller traded off fast actions based on stored behavior patterns with realtime exploration using current inputoutput data. In [40], an effective offpolicy learning based integral reinforcement learning (IRL) algorithm was presented, which successfully solved the optimal control problem for completely unknown continuoustime systems with unknown disturbances. In [41], a policy iteration algorithm for discretetime nonlinear systems was developed. Value iteration algorithms for optimal control of discretetime nonlinear systems were given in [17]. In [18] and [42], the convergence properties of the value iteration were proposed. Value iteration algorithms are generally initialized by a ''zero'' performance index function [18], [42], [43], which guarantees the convergence properties of the iterative value functions. In [44], a
In this paper, inspired by [45], a new iterative ADP algorithm is developed to solve the optimal battery control for the smart home energy management system, where the charging/discharging constraints of the battery are considered. First, the models of the smart home energy systems and the battery are established, where the efficiency of the battery is considered. Second, inspired by [37], a new nonquadratic performance index function is constructed, where the charging/discharging power of the battery is defined in the performance index function. Then, the iterative ADP algorithm is derived for the optimal control law of the battery. Via the system transformation and the definition of the performance index function, the expression of the iterative sequential control law for the battery can be obtained. The convergence and optimality of the algorithm are presented, which guarantees that the iterative value function will converge to the optimal performance index function, as the iteration index increases to infinity.
The rest of this paper is organized as follows. In Section Ⅱ, the problem formulation is presented. The model of the smart home energy system is constructed. The operation principle of the battery is introduced. The optimization objectives of the control problem are also declared. In Section Ⅲ, the iterative ADP algorithm for battery management system is established. According to the system transformation and the optimality principle, the iterative ADP algorithm is derived. The convergence properties will also be proven in this section. In Section Ⅳ, numerical results are presented to demonstrate the effectiveness of the developed algorithm. Finally, in Section Ⅴ, the conclusion is drawn.
Ⅱ. PROBLEM FORMULATIONIn this section, smart home energy systems with a battery will be described. The optimization objectives of our research will be defined and the corresponding principle of optimality will be introduced.
A. Smart Home Energy SystemsIn this paper, the optimal battery control problem is treated as a discrete time problem with the time step of
The battery model used in this work is based on [6], [44], [46] where battery efficiency is considered to extend the battery's lifetime as far as possible. Let
$ \begin{align} \label{equation1} {E_{b(t + 1)}} = {E_{bt}}  {P_{bt}} \times \eta ({P_{bt}}) \end{align} $  (1) 
where
$ \begin{align} \label{equation1a} \eta ({P_{bt}}) = 0.898  0.173{P_{bt}}/{P_{\mathrm{rate}}} \end{align} $  (2) 
where
1) The storage limit is considered:
$ \begin{align} \label{equation2} E_b^{\min } \le {E_{bt}} \le E_b^{\max } \end{align} $  (3) 
where
2) The charging and discharging power limits are considered:
$ \begin{align} \label{equation3} P_b^{\min } \le {P_{bt}} \le P_b^{\max } \end{align} $  (4) 
where
Given the home load and realtime electricity rate, the objective of the optimal control is to find the optimal battery charging/discharging/idle control law at each time step which minimizes the total expense of the power from the grid while considering the battery limitations. To find the optimal control law, the load balance should be considered. Let
To establish the equation of the home energy system, we introduce a delay in
$ \begin{align} \label{equation4} P_{L(t1)}=P_{b(t1)} \eta(P_{b(t1)})+ P_{gt}. \end{align} $  (5) 
In this paper, the power flow from the battery to the grid is not permitted, i.e., we define
$ \begin{align} \label{equation5} E_b^o = \frac{1}{2}(E_b^{\min } + E_b^{\max }). \end{align} $  (6) 
In [45], a quadratic form performance index function was proposed, which was expected to be minimized
$ \begin{align} \label{equation6} \sum\limits_{t = 0}^\infty \bigg( {m_1}{{({C_t}{P_{gt}})}^2} + {m_2}{{({E_{bt}}  E_b^o)}^2} + m_3 (P_{bt})^2\bigg) \end{align} $  (7) 
where
In [45], a
First, we define the system states. Let
$ \begin{align} \label{equation6x1} x_{1, t+1}=P_{Lt}u_t\eta(u_t). \end{align} $  (8) 
Let
$ \begin{align} \label{equation6x2} x_{2, t+1}=x_{2, t}u_t\eta(u_t). \end{align} $  (9) 
Letting
$ \begin{align} \label{equation6a} {x_{t + 1}} = F({x_t}, {u_t}, t) = \left( {\begin{array}{*{20}{c}} {{P_{Lt}}  {u_t}\eta ({u_t})}\\ {{x_{2t}}  {u_t}\eta ({u_t})} \end{array}} \right). \end{align} $  (10) 
In [45], the performance index function was defined as (7). However, in (7), the power constraint of the battery was not considered. In this paper, inspired by [37], a nonquadratic performance index function will be defined for the battery management system, which is expressed as
$ \begin{align} \label{equation6nn1} \sum\limits_{t = 0}^\infty \bigg(& {m_1}{{({C_t}{P_{gt}})}^2}+ {m_2}{{({E_{bt}}  E_b^o)}^2} \nonumber \\ &+ m_3 \int_0^{{P_{bt}}} {{{({\Phi ^{  1}}(s))}^T}ds}\bigg) \end{align} $  (11) 
where
$ \begin{align} \label{equation6nn2} \sum\limits_{t = 0}^\infty \bigg(& {m_1}{{({C_t}{P_{gt}})}^2} + {m_2}{{({E_{bt}}  E_b^o)}^2} \nonumber \\ &+ m_3 \int_0^{{P_{bt}\eta(P_{bt})}} {{{({\Phi ^{  1}}(s))}^T}ds}\bigg). \end{align} $  (12) 
Remark 1: The smart home energy system (10) is different from the one in [45]. First, in equation (3) of [45], the battery efficiency of the battery was not considered in the power balance of the load, while in (10) of this paper, the battery efficiency is considered both in the power balance of the load and the battery power balance. Second, the performance index function defined in [45] did not consider the efficiency of the battery. This makes it possible for the optimal control law to exceed the max/min power of the battery, which may make the optimal control invalid. In this paper, both the max/min power of the battery and the efficiency of the battery are considered in the performance index function (12). Thus, the models of the system (10) and performance index function (12) are more reasonable.
B. Optimality PrincipleLet
$ \begin{align} \label{equation7} J({x_t}, \underline{u}_t, t) = \sum\limits_{i = t}^\infty {\gamma^i U(x_i, u_i, i)} \end{align} $  (13) 
where the utility function is expressed as
$ \begin{align} \label{equation7xn1} U(x_t, u_t, t) = {x^T_t}M_t{x_t} + m_3 \int_0^{{u_t}\eta ({u_t})} {{{({\Phi ^{  1}}(s))}^T}ds}. \end{align} $  (14) 
Define the control sequence set as
$ \begin{equation} J^*(x_t, t)=\inf_{\underline{u}_t} \left\{ J(x_t, \underline{u}_t, t)\colon \underline{u}_t\in \underline{\mathfrak{U}}_{t}\right\}. \label{equation8} \end{equation} $  (15) 
According to Bellman's principle of optimality [47], we can obtain the following discretetime HJB equation
$ \begin{align}\label{equation9} {J^*}({x_t}, t) = \mathop {\inf }\limits_{{u_{t}}} \big\{ U({x_t}, {u_t}, t) + {J^*}({x_{t + 1}}, t + 1)\big\}. \end{align} $  (16) 
The optimal sequential control law can be expressed as
$ \begin{equation} u ^*(x_t, t)=\mathop {\inf }\limits_{{u_{t}}} \big\{U(x_t, u_t, t)+J^*(x_{t+1}, t+1)\big\}. \label{equation10} \end{equation} $  (17) 
Remark 2: We can see that the home energy system (10) is a nonlinear dynamic system. The optimal battery control is actually an infinite horizon optimal control problem for nonlinear system with a nonquadratic performance index function. In this situation, many static mathematical programming methods, such as linear programming, are not effective. Dynamic programming is a powerful method to solve these problems. However, if we adopt the traditional dynamic programming method to obtain the optimal performance index function one step at a time, then we have to face the ''curse of dimensionality''. Thus, a new iterative ADP algorithm will be developed in this paper.
C. Derivations of the Iterative ADP AlgorithmFrom (10) we know that the battery management system is timevarying. It means that the control law is also timevarying. This makes the controller design difficult. To overcome this difficulty, in [45], [48], by defining a new sequence of control for a period, the timevarying optimal control was transformed into a timeinvariant one, which significantly relaxed the computation burden. In this paper, inspired by [45], [48], we will define the sequence of control for a period, where the constraints of the battery are considered. For any
$ \begin{align} \label{equation14} \Lambda \, (x_k, \mathcal {U}_k) = \sum\limits_{\theta = 0}^{\lambda  1} {\, U(x_{k + \theta}, u_{k + \theta}, \theta).} \end{align} $  (18) 
Then, for any
$ \begin{align} \label{equation16} {J^*}({x_k}) = \mathop {\min }\limits_{{{\cal U}_k}} \big\{ \Lambda ({x_k}, {{\cal U}_k}) + \bar \gamma {J^*}({x_{k + \lambda }})\big\} \end{align} $  (19) 
where
$ \begin{align} \label{equation16a}{{\cal U}^*}({x_k}) = \arg \mathop {\min }\limits_{{\mathcal {U}_k}} \{ \Lambda {\mkern 1mu} ({x_k}, {{\cal U}_k}) + \bar \gamma {J^*}({x_{k + \lambda }})\}. \end{align} $  (20) 
Define an iteration index
$ \begin{align} \label{equation17a} {V_{i+1}}({x_k}) = \mathop {\min }\limits_{{{\cal U}_k}} \{ \Lambda {\mkern 1mu} ({x_k}, {{\cal U}_k}) + \bar \gamma {V_i}({x_{k + \lambda }})\} \end{align} $  (21) 
where
$ \begin{align} \label{equation17} \mathcal {U}_i(x_k) = \arg\mathop {\min }\limits_{{{\cal U}_k}} \{ \Lambda {\mkern 1mu} ({x_k}, {{\cal U}_k}) + \bar \gamma {V_i}({x_{k + \lambda }})\}. \end{align} $  (22) 
For any
$ \begin{align} \label{equation23} V_i^{j + 1}({x_k}) &= \mathop {\min }\limits_{{u_k}} \{ U({x_k}, {u_k}, j) + \gamma V_i^j({x_{k + 1}})\} \nonumber \\ &= U({x_k}, u_i^j({x_k}), j) + \gamma V_i^j({x_{k + 1}}) \end{align} $  (23) 
and
$ \begin{align} \label{equation231}u^j_i(x_k) = \arg \mathop {\min }\limits_{{u_k}} \{ U({x_k}, {u_k}, j) + \gamma V_i^j({x_{k + 1}})\} \end{align} $  (24) 
where we let
The system function is defined as
$ \begin{align} \label{equation23an1}{x_{k + 1}} = f({x_k}, j) + g{v_k} \end{align} $  (25) 
where
$ \begin{align} \label{equation23nn1}\frac{{\partial V_i^{j + 1}({x_k})}}{{\partial v_i^j({x_k})}} = 0. \end{align} $  (26) 
Then, we can obtain that
$ \begin{align} \label{equation23nn2}v_i^j(x_k) =  \Phi \bigg(\frac{1}{2}m_3^{  1}{g^{{T}}}\frac{{\mathrm{d}{V^j_i}({x_{k + 1}})}}{{\mathrm{d}{x_{k + 1}}}}\bigg). \end{align} $  (27) 
We can see that for any
Remark 3: From (27), we can see that the expression of the iterative sequential control law in this paper is different from the one in [45]. In [45], the iterative sequential control law was obtained by minimizing a
$ \begin{align} \label{equation231TIE2015}u^j_i(x_k) = \arg \mathop {\min }\limits_{u_k} Q^j_i(x_k, u_k). \end{align} $  (28) 
Generally, the
Now, we let
$ \begin{align} \label{equation7xn2} u_k=\mathcal {G}^{1}(v_k). \end{align} $  (29) 
Hence, for any
In this subsection, the convergence and optimality properties of the proposed iterative ADP algorithm will be developed. It will be shown that the iterative value function and iterative control law will converge to their optimums as the iteration index
Theorem 1: For
$ \begin{align} \label{equation24} {{\cal U}_i}(x_k) = \{ {u^{23}_i}(x_k), {u^{22}_i}(x_{k+1}), \ldots , {u^{0}_{i}}(x_{k+23})\}. \end{align} $  (30) 
Proof: The statement can be proven by mathematical induction. For any
$ \begin{align} \label{equation30b} V_i^{j + 1}&(x_k) \nonumber \\ = &\mathop {\min }\limits_{u_k} \big(U(x_k, u_k, j) + \gamma V_i^j({x_{k+1} })\big) \nonumber\\=& \mathop {\min }\limits_{u_k} \bigg(U(x_k, u_k, j) + \gamma \mathop {\min }\limits_{{u_{k+1} }} \Big(U({x_{k+1}}, {u_{k+1}}, j  1) \nonumber \\& + \cdots + \gamma \mathop {\min }\limits_{{u_{k+j}}} \big(U({x_{k+j}}, {u_{k+j}}, 0) + \gamma V_i^0 ({x_{k+j+1}})\big)\Big)\bigg) \nonumber\\ = &\mathop {\min }\limits_{(u_k, {u_{k+1}}, \ldots, {u_{k+j}})} \bigg(\sum\limits_{l = 0}^j {\gamma^l U({x_{k+l}}, {u_{k+l}}, j  l)} \nonumber \\ &+ \gamma^{j+1} V_i \big({x_{k+j + 1}}\big)\bigg). \end{align} $  (31) 
First, let
$ \begin{align} \label{equation30bn1} u^0_i(x_k) = \arg \mathop {\min }\limits_{{u_k}} \{ U({x_k}, {u_k}, 0) +\gamma V_i^0({x_{k + 1}})\} \end{align} $  (32) 
where
$ \begin{align} \label{equation30c} ({u_k}, &\, {u_{k + 1}}, \ldots, {u_{k + j}})\nonumber \\ & = (u_i^{23}({x_k}), u_i^{22}({x_{k + 1}}), \ldots, u_i^{\lambda  1  j}({x_{k + j}})). \end{align} $  (33) 
Let
From Theorem 1, for
Theorem 2: For
$ \begin{align} \label{equation31} \mathop {\lim }\limits_{i \to \infty } V_{i}(x_k)=J^*(x_k). \end{align} $  (34) 
Proof: According to the control law sequence
$ \begin{align} \label{eq0303aa1} x_{k+\lambda}=\mathcal {F}(x_k, \mathcal {U}_k). \end{align} $  (35) 
Inspired by [18], [20], assume that there are constants
For
$ \begin{align} \label{eq0305add1}\left(1 + \frac{{\psi_1  1}}{{{{(1 + {\chi ^{\,  1}})}^i}}}\right)&J^*(x_k) \leq {V}_i(x_k) \nonumber \\ &\leq \left(1 + \frac{{\psi_2  1}}{{{{(1 + {{\chi} ^{\,  1}})}^i}}} \right)J^*(x_k). \end{align} $  (36) 
Inequality (36) can be proven by mathematical induction. Let
$ \begin{align} \label{eq0306} {V_1}({x_k}) =& \mathop {\min }\limits_{{\mathcal {U}_k}} \left\{ {U({x_k}, {\mathcal {U}_k}) + \bar \gamma {V_0}({x_{k + \lambda}})} \right\} \nonumber \\ %\ge&\mathop {\min }\limits_{{\mathcal {U}_k}} \left\{ {U({x_k}, {\mathcal {U}_k}) + \psi_1 {J^*}({x_{k + \lambda}})} \right\} \nonumber\\ \ge&\mathop {\min }\limits_{{u_k}} \left\{ \Big(1 + \chi\frac{{ \psi_11 }}{{1 + \chi }}\Big)U({x_k}, {\mathcal {U}_k}) \right.\nonumber \\ &+ \bar \gamma \bigg(\psi_1  \frac{{ \psi_11}}{{1 + \chi }}\bigg){J^*}({x_{k + \lambda}})\left. \right\} \nonumber \\ \ge&\left( {1 + \frac{{\psi_1  1}}{{(1 + {\chi^{ \,  1}})}}} \right)\mathop {\min }\limits_{{u_k}} \left\{ {U({x_k}, {\mathcal {U}_k}) + \bar \gamma {J^*}({x_{k + \lambda}})} \right\} \nonumber\\ =&\left( {1 + \frac{{\underline{\delta}  1}}{{(1 + {\overline{\gamma} ^{\,  1}})}}} \right){J^*}({x_k}). \end{align} $  (37) 
Following a similar procedure, we can prove inequality (36) using mathematical induction for
$ \begin{align} \label{eq0312aa1} \mathop {\lim }\limits_{i \to \infty }& \left\{ {\left( {1 + \frac{{\psi_1  1}}{{{{(1 + {\chi^{  1}})}^i}}}} \right){J^*}({x_k})} \right\}\nonumber \\ &= \mathop {\lim }\limits_{i \to \infty } \left\{ {\left( {1 + \frac{{\psi_2  1}}{{{{(1 + {\chi^{\,  1}})}^i}}}} \right){J^*}({x_k})} \right\} \nonumber \\ &= \, {J^*}({x_k}). \end{align} $  (38) 
Corollary 1: For
Remark 4: In [45], for a dual
In this section, the performance of the iterative ADP algorithm with constraints will be examined by numerical experiments. The simulation results for the developed iterative ADP algorithm with constraints will be compared with the dual
Download:


Fig. 1 Home load demand and electricity rate. (a) Home load demand for 672 hours. (b) Average home load demand in a day. (c) Realtime electricity rate for 672 hours. (d) Average electricity rate in a day. 
Let the capacity of the battery be chosen as in [45], which is
$ \begin{align} \Pi = \left[{\begin{array}{*{20}{c}} {0.2894}&{0.6852}\\ {0.6852}&{2.3505} \end{array}} \right].\nonumber \end{align} $ 
Neural networks are used to implement the iterative ADP algorithm. There are two neural networks, which are critic and action networks, respectively. Both neural networks are chosen as threelayer backpropagation (BP) network. The structures of the critic and action networks are chosen as 281 and 281, respectively. The training methods of the neural networks are shown in [45], [49] and omitted here. Now we let the real maximum charging/discharging power be
Download:


Fig. 2 The trajectories of the iterative value function. 
From Fig. 2, we can see that under the power constraints of the battery, the iterative value function converges to the optimum in
Download:


Fig. 3 Optimal management of battery with constraints in four weeks. 
Download:


Fig. 4 Optimal battery energy with constraints in four weeks. 
In [45], a dual
$ \begin{align} \bar \Pi = \left[{\begin{array}{*{20}{c}} {{\rm{0}}{\rm{.3479}}}&{{\rm{0}}{\rm{.8594}}}&{{\rm{0}}{\rm{.4914}}}\\ {{\rm{0}}{\rm{.8594}}}&{{\rm{2}}{\rm{.8694}}}&{{\rm{1}}{\rm{.4634}}}\\ {{\rm{0}}{\rm{.4914 }}}&{{\rm{1}}{\rm{.4634}}}&{{\rm{4}}{\rm{.1274}}} \end{array}} \right]. \nonumber \end{align} $ 
We implement the dual
Download:


Fig. 5 Optimal management of batteries with no constraints in four weeks. 
Download:


Fig. 6 Optimal battery energy with no constraints. 
Download:


Fig. 7 Numerical comparisons in a day. 
In this paper, a new iterative ADP algorithm is developed to solve the optimal battery control problem for the smart home energy systems. Considering the efficiency and the charging/discharging constraints of the battery, the model of the smart home energy system is constructed. A new nonquadratic form performance index function is established, which guarantees the iterative control amplitude not to exceed the upper bound of the battery. Iterative ADP algorithm is developed, where in each iteration, the expressions of the iterative control law can be obtained. The convergence properties of the iterative ADP algorithm are given, which guarantees that the iterative value function and the iterative control law both reach the optimal ones. Finally, simulation and comparison results are given to illustrate the performance of the presented method.
In this paper, to extend the lifetime of the battery, we aim to draw the stored energy of the battery close to the middle of storage limit and minimize the large charging/discharging power of the battery. On the other hand, the frequency of the battery is not considered in this paper. Since frequent quick charging/discharging frequency may also damage the battery, how to avoid frequent charging/discharging of the battery will be our main future research topic.
[1]  H. P. Li, C. Z. Zang, P. Zeng, H. B. Yu, and Z. W. Li, "A stochastic programming strategy in microgrid cyber physical energy system for energy optimal operation, " IEEE/CAA J. Automat. Sin. , vol. 2, no. 3, pp. 296303, Jul. 2015. 
[2]  G. Chen and E. N. Feng, "Distributed secondary control and optimal power sharing in microgrids, " IEEE/CAA J. Automat. Sin. , vol. 2, no. 3, pp. 304312, Jul. 2015. 
[3]  W. Wei, F. Liu, S. W. Mei, "Energy pricing and dispatch for smart grid retailers under demand response and market price uncertainty". IEEE Trans. Smart Grid , 2015, 6 (3) :1364–1374. DOI:10.1109/TSG.2014.2376522 
[4]  A. Szumanowski, Y. H. Chang, "Battery management system based on battery nonlinear dynamics modeling". IEEE Transactions on Vehicular Technology , 2008, 57 (3) :1425–1432. DOI:10.1109/TVT.2007.912176 
[5]  H. RahimiEichi, U. Ojha, F. Baronti, and M. Y. Chow, "Battery management system: an overview of its application in the smart grid and electric vehicles, " IEEE Ind. Electron. Maga. , vol. 7, no. 2, pp. 416, Jun. 2013. 
[6]  T. Y. Lee, "Operating schedule of battery energy storage system in a timeofuse rate industrial user with wind turbine generators: a multipass iteration particle swarm optimization approach, " IEEE Trans. Energy Convers. , vol. 22, no. 3, pp. 774782, Sep. 2007. 
[7]  P. J. Werbos, "Advanced forecasting methods for global crisis warning and models of intelligence, " Gen. Syst. Yearbook, vol. 22, pp. 2538, Jan. 1977. 
[8]  P. J. Werbos, W. T. Miller, R. S. Sutton, Neural Networks for Control. Cambridge: MIT Press, 1991 . 
[9]  M. Boaro, D. Fuselli, F. De Angelis, D. R. Liu, Q. L. Wei, and F. Piazza, "Adaptive dynamic programming algorithm for renewable energy scheduling and battery management, " Cognit. Comput. , vol. 5, no. 2, pp. 264277, Jun. 2013. 
[10]  D. Fuselli, F. De Angelis, M. Boaro, S. Squartini, Q. L. Wei, D. R. Liu, and F. Piazza, "Action dependent heuristic dynamic programming for home energy resource scheduling, " Int. J. Electr. Power Energy Syst. , vol. 48, pp. 148160, Jun. 2013. 
[11]  D. Molina, G. K. Venayagamoorthy, J. Q. Liang, and R. G. Harley, "Intelligent local area signals based damping of power system oscillations using virtual generators and approximate dynamic programming, " IEEE Trans. Smart Grid, vol. 4, no. 1, pp. 498508, Mar. 2013. 
[12]  J. Q. Liang, D. D. Molina, G. K. Venayagamoorthy, and R. G. Harley, "Twolevel dynamic stochastic optimal power flow control for power systems with intermittent renewable generation, " IEEE Trans. Power Syst. , vol. 28, no. 3, pp. 26702678, Aug. 2013. 
[13]  S. Mohagheghi, G. K. Venayagamoorthy, and R. G. Harley, "Fully evolvable optimal neurofuzzy controller using adaptive critic designs, " IEEE Trans. Fuzzy Syst. , vol. 16, no. 6, pp. 14501461, Dec. 2008. 
[14]  Y. F. Tang, H. B. He, J. Y. Wen, and J. Liu, "Power system stability control for a wind farm based on adaptive dynamic programming, " IEEE Trans. Smart Grid, vol. 6, no. 1, pp. 166177, Jan. 2015. 
[15]  Z. Zhang and D. B. Zhao, "Cliquebased cooperative multiagent reinforcement learning using factor graphs, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 3, pp. 248256, Jul. 2014. 
[16]  D. V. Prokhorov and D. C. Wunsch, "Adaptive critic designs, " IEEE Trans. Neur. Net. , vol. 8, no. 5, pp. 9971007, Sep. 1997. 
[17]  D. P. Bertsekas, J. N. Tsitsiklis, NeuroDynamic Programming. Belmont, MA: Athena Scientific, 1996 . 
[18]  B. Lincoln and A. Rantzer, "Relaxing dynamic programming, " IEEE Trans. Automat. Control, vol. 51, no. 8, pp. 12491260, Aug. 2006. 
[19]  Q. L. Wei, F. Y. Wang, D. R. Liu, and X. Yang, "Finiteapproximationerrorbased discretetime iterative adaptive dynamic programming, " IEEE Trans. Cybernet. , vol. 44, no. 12, pp. 28202833, Dec. 2014. 
[20]  A. Rantzer, "Relaxed dynamic programming in switching systems, " IET Proc. Contr. Theor. Appl. , vol. 153, no. 5, pp. 567574, Oct. 2006. 
[21]  H. Modares and F. L. Lewis, "Linear quadratic tracking control of partiallyunknown continuoustime systems using reinforcement learning, " IEEE Trans. Automat. Control, vol. 59, no. 11, pp. 30513056, Nov. 2014. 
[22]  H. Modares and F. L. Lewis, "Optimal tracking control of nonlinear partiallyunknown constrainedinput systems using integral reinforcement learning, " Automatica, vol. 50, no. 7, pp. 17801792, Jul. 2014. 
[23]  H. G. Zhang, T. Feng, G. H. Yang, and H. J. Liang, "Distributed cooperative optimal control for multiagent systems on directed graphs: an inverse optimal approach, " IEEE Trans. Cybernet. , vol. 45, no. 7, pp. 13151326, Jul. 2015. 
[24]  M. Kumar, K. Rajagopal, S. N. Balakrishnan, et al, "Reinforcement learning based controller synthesis for flexible aircraft wings, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 4, pp. 435448, Oct. 2014. 
[25]  R. Kamalapurkar, J. R. Klotz, and W. E. Dixon, "Concurrent learningbased approximate feedbackNash equilibrium solution of Nplayer nonzerosum differential games, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 3, pp. 239247, Jul. 2014. 
[26]  Q. M. Zhao, H. Xu, and S. Jagannathan, "Near optimal output feedback control of nonlinear discretetime systems based on reinforcement neural network learning, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 4, pp. 372384, Oct. 2014. 
[27]  Q. L. Wei, R. Z. Song, and P. F. Yan, "Datadriven zerosum neurooptimal control for a class of continuoustime unknown nonlinear systems with disturbance using ADP, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 27, no. 2, pp. 444458, Feb. 2016. 
[28]  Q. L. Wei and D. R. Liu, "Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification, " IEEE Trans. Automat. Sci. Eng. , vol. 11, no. 4, pp. 10201036, Oct. 2014. 
[29]  Q. L. Wei and D. R. Liu, "A novel iterative the taadaptive dynamic programming for discretetime nonlinear systems, " IEEE Trans. Automat. Sci. Eng. , vol. 11, no. 4, pp. 11761190, Oct. 2014. 
[30]  Q. L. Wei and D. R. Liu, "Datadriven neurooptimal temperature control of watergas shift reaction using stable iterative adaptive dynamic programming, " IEEE Trans. Ind. Electron. , vol. 61, no. 11, pp. 63996408, Nov. 2014. 
[31]  Q. L. Wei, D. R. Liu, and H. Q. Lin, "Value iteration adaptive dynamic programming for optimal control of discretetime nonlinear systems, " IEEE Trans. Cybernet. , vol. 46, no. 3, pp. 840853, Mar. 2016. 
[32]  Q. L. Wei, D. R. Liu, and X. Yang, "Infinite horizon selflearning optimal control of nonaffine discretetime nonlinear systems, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 26, no. 4, pp. 866879, Apr. 2015. 
[33]  Q. L. Wei, D. R. Liu, and F. L. Lewis, "Optimal distributed synchronization control for continuoustime heterogeneous multiagent differential graphical games, " Inform. Sci. , vol. 317, pp. 96113, Oct. 2015. 
[34]  Q. L. Wei and D. R. Liu, "A novel policy iteration based deterministic Qlearning for discretetime nonlinear systems, " Sci. China Inform. Sci. , vol. 58, no. 12, pp. 115, Dec. 2015. 
[35]  F. L. Lewis, D. Vrabie, K. G. Vamvoudakis, "Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers, " IEEE Control Syst. , vol. 32, no. 6, pp. 76105, Dec. 2012. 
[36]  J. J. Murray, C. J. Cox, G. G. Lendaris, R. Saeks, "Adaptive dynamic programming". IEEE Trans. Syst. Man Cybern. C Appl. Rev. , 2002, 32 (2) :140–153. DOI:10.1109/TSMCC.2002.801727 
[37]  M. AbuKhalaf, F. L. Lewis, "Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach". Automatica , 2005, 41 (5) :779–791. DOI:10.1016/j.automatica.2004.11.034 
[38]  R. Z. Song, W. D. Xiao, H. G. Zhang, and C. Y. Sun, "Adaptive dynamic programming for a class of complexvalued nonlinear systems, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 25, no. 9, pp. 17331739, Sep. 2014. 
[39]  R. Z. Song, F. Lewis, Q. L. Wei, H. G. Zhang, Z. P. Jiang, and D. Levine, "Multiple actorcritic structures for continuoustime optimal control using inputoutput data, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 26, no. 4, pp. 851865, Apr. 2015. 
[40]  R. Z. Song, F. L. Lewis, Q. L. Wei, H. G. Zhang, "Offpolicy actorcritic structure for optimal control of unknown systems with disturbances". IEEE Trans. Cybernet. , 2016, 46 (5) :1041–1050. DOI:10.1109/TCYB.2015.2421338 
[41]  D. R. Liu and Q. L. Wei, "Policy iteration adaptive dynamic programming algorithm for discretetime nonlinear systems, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 25, no. 3, pp. 621634, Mar. 2014. 
[42]  A. AlTamimi, F. L. Lewis, and M. AbuKhalaf, "Discretetime nonlinear HJB solution using approximate dynamic programming: convergence proof, " IEEE Trans. Syst. Man Cybern. B Cybern. , vol. 38, no. 4, pp. 943949, Aug. 2008. 
[43]  H. G. Zhang, Q. L. Wei, and Y. H. Luo, "A novel infinitetime optimal tracking control scheme for a class of discretetime nonlinear systems via the greedy HDP iteration algorithm, " IEEE Trans. Syst. Man Cybern. B Cybern. , vol. 38, no. 4, pp. 937942, Aug. 2008. 
[44]  T. Huang and D. R. Liu, "A selflearning scheme for residential energy system control and management, " Neur. Comput. Appl. , vol. 22, no. 2, pp. 259269, Feb. 2013. 
[45]  Q. L. Wei, D. R. Liu, and G. Shi, "A novel dual iterative Qlearning method for optimal battery management in smart residential environments, " IEEE Trans. Ind. Electron. , vol. 62, no. 4, pp. 25092518, Apr. 2015. 
[46]  T. Yau, L. N. Walker, H. L. Graham, A. Gupta, and R. Raithel, "Effects of battery storage devices on power system dispatch, " IEEE Trans. Power Apparatus Syst. , vol. PAS100, no. 1, pp. 375383, Jan. 1981. 
[47]  R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton University Press, 1957 . 
[48]  Q. L. Wei, D. R. Liu, G. Shi, and Y. Liu, "Multibattery optimal coordination control for home energy management systems via distributed iterative adaptive dynamic programming, " IEEE Trans. Ind. Electron. , vol. 62, no. 7, pp. 42034214, Jul. 2015. 
[49]  J. Si and Y. T. Wang, "Online learning control by association and reinforcement, " IEEE Trans. Neur. Net. , vol. 12, no. 2, pp. 264276, Mar. 2001. 