IEEE/CAA Journal of Automatica Sinica  2017, Vol. 4 Issue(2): 168-176 PDF
Optimal Constrained Self-learning Battery Sequential Management in Microgrid Via Adaptive Dynamic Programming
Qinglai Wei1,2, Derong Liu3, Yu Liu4, Ruizhuo Song3
1. Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China;
4. Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Abstract: This paper concerns a novel optimal self-learning battery sequential control scheme for smart home energy systems. The main idea is to use the adaptive dynamic programming (ADP) technique to obtain the optimal battery sequential control iteratively. First, the battery energy management system model is established, where the power efficiency of the battery is considered. Next, considering the power constraints of the battery, a new non-quadratic form performance index function is established, which guarantees that the value of the iterative control law cannot exceed the maximum charging/discharging power of the battery to extend the service life of the battery. Then, the convergence properties of the iterative ADP algorithm are analyzed, which guarantees that the iterative value function and the iterative control law both reach the optimums. Finally, simulation and comparison results are given to illustrate the performance of the presented method.
Key words: Adaptive critic designs     adaptive dynamic programming (ADP)     approximate dynamic programming     battery management     energy management system     neuro-dynamic programming     optimal control     smart home
Ⅰ. INTRODUCTION

NOWADAYS, the need of the smart grid is continuously increasing [1]-[3]. Smart home energy management system is an important component of the smart grid. In smart home energy management systems, the intelligent optimal control of battery is a key technology for saving the power consumptions. In [4], a battery management method was proposed by battery dynamics modeling. In [5], the development of battery management systems was summarized which was applied in the smart grid and electric vehicles. In [6], the operating schedule of battery energy storage system was solved by the particle swarm optimization approach. However, in previous researches on the battery management, the properties of the battery management, such as convergence and optimality, were not provided, which limited the applications of the battery control. Adaptive dynamic programming (ADP), proposed by Werbos [7], [8], has been widely used in optimal energy management [6], [9]-[15]. There are several synonyms of ADP, including ''adaptive critic designs'' [16], ''approximate dynamic programming'' [11], ''neuro-dynamic programming'' [17], and ''relaxing dynamic programming'' [18].

Iterative methods are widely used in ADP to obtain the solution of Hamilton-Jacobi-Bellman (HJB) equation indirectly [14], [19]-[34]. Policy and value iterations are two primary iterative ADP algorithms [35]. Policy iteration algorithms for optimal control of continuous-time (CT) systems with continuous states and action spaces were first given in [36], [37]. In [38], a complex-valued ADP algorithm was discussed, where for the first time the optimal control problem of complex-valued nonlinear systems was successfully solved by ADP. In [39], based on neurocognitive psychology, a novel controller based on multiple actor-critic structures was developed for unknown systems and the proposed controller traded off fast actions based on stored behavior patterns with real-time exploration using current input-output data. In [40], an effective off-policy learning based integral reinforcement learning (IRL) algorithm was presented, which successfully solved the optimal control problem for completely unknown continuous-time systems with unknown disturbances. In [41], a policy iteration algorithm for discrete-time nonlinear systems was developed. Value iteration algorithms for optimal control of discrete-time nonlinear systems were given in [17]. In [18] and [42], the convergence properties of the value iteration were proposed. Value iteration algorithms are generally initialized by a ''zero'' performance index function [18], [42], [43], which guarantees the convergence properties of the iterative value functions. In [44], a $Q$-learning based ADP algorithm was proposed to obtain the optimal control law for the battery, which solved the optimal energy management for the microgrid of smart homes. In [9], considering renewable electricity, such as electricity generated from wind and solar energies, the optimal control for the battery was solved by ADP method. In [10], a particle swarm optimization (PSO) method was proposed to pre-train the weights of the action and critic neural networks, which facilitated the implementation of the ADP method for the optimal control of the battery. In [45], an effective dual $Q$-learning based iterative ADP algorithm was developed to obtain the optimal battery management for the microgrid of smart homes, where the convergence and optimality of the dual $Q$-learning based ADP algorithm were proven to guarantee the optimal battery control. However, in [45], the charging/discharging constraints of the battery were not considered in the performance index function. Actually, for all the energy management systems of the microgrid of smart homes, the charging and discharging power of the battery cannot reach infinity. Hence, the optimal control of the battery with power constraints of the battery is a key technique for real-world smart home energy management systems.

In this paper, inspired by [45], a new iterative ADP algorithm is developed to solve the optimal battery control for the smart home energy management system, where the charging/discharging constraints of the battery are considered. First, the models of the smart home energy systems and the battery are established, where the efficiency of the battery is considered. Second, inspired by [37], a new non-quadratic performance index function is constructed, where the charging/discharging power of the battery is defined in the performance index function. Then, the iterative ADP algorithm is derived for the optimal control law of the battery. Via the system transformation and the definition of the performance index function, the expression of the iterative sequential control law for the battery can be obtained. The convergence and optimality of the algorithm are presented, which guarantees that the iterative value function will converge to the optimal performance index function, as the iteration index increases to infinity.

The rest of this paper is organized as follows. In Section Ⅱ, the problem formulation is presented. The model of the smart home energy system is constructed. The operation principle of the battery is introduced. The optimization objectives of the control problem are also declared. In Section Ⅲ, the iterative ADP algorithm for battery management system is established. According to the system transformation and the optimality principle, the iterative ADP algorithm is derived. The convergence properties will also be proven in this section. In Section Ⅳ, numerical results are presented to demonstrate the effectiveness of the developed algorithm. Finally, in Section Ⅴ, the conclusion is drawn.

Ⅱ. PROBLEM FORMULATION

In this section, smart home energy systems with a battery will be described. The optimization objectives of our research will be defined and the corresponding principle of optimality will be introduced.

A. Smart Home Energy Systems

In this paper, the optimal battery control problem is treated as a discrete time problem with the time step of $1$ hour and it is assumed that the home load varies hourly. The home load $P_{Lt}$ and the electricity rate $C_t$ are periodic functions with the period $\lambda=24$ hours. The battery will make decisions to meet the demand of the home load, according to the real-time electricity rate. There are three operational modes for the battery of the home energy system, which are charging mode, idle mode, and discharging mode, respectively.

B. Battery Model

The battery model used in this work is based on [6], [44], [46] where battery efficiency is considered to extend the battery's lifetime as far as possible. Let ${E_{bt}}$ be the battery energy at time $t$ and let $\eta(\cdot)$ be the charging/discharging efficiency of the battery. Then, the battery model can be expressed as

 \begin{align} \label{equation1} {E_{b(t + 1)}} = {E_{bt}} - {P_{bt}} \times \eta ({P_{bt}}) \end{align} (1)

where ${P_{bt}}$ is the battery power output at time $t$. Let ${P_{bt}}>0$ denote battery discharging. Let ${P_{bt}}<0$ denote battery charging and let ${P_{bt}}=0$ denote battery idle. The efficiency of battery charging/discharging is derived by [6], [44], [46], which is expressed as

 \begin{align} \label{equation1a} \eta ({P_{bt}}) = 0.898 - 0.173|{P_{bt}}|/{P_{\mathrm{rate}}} \end{align} (2)

where $P_{\mathrm{rate}}>0$is the rated power output of the battery. To extend the battery's lifetime, two constraints need to be considered:

1) The storage limit is considered:

 \begin{align} \label{equation2} E_b^{\min } \le {E_{bt}} \le E_b^{\max } \end{align} (3)

where $E_b^{\min }$ and $E_b^{\max }$ are the minimum and maximum storage energy of the battery, respectively.

2) The charging and discharging power limits are considered:

 \begin{align} \label{equation3} P_b^{\min } \le {P_{bt}} \le P_b^{\max } \end{align} (4)

where $P_b^{\min }$ and $P_b^{\max }$ are the minimum and maximum charging/discharging powers of the battery, respectively.

C. Optimization Objectives

Given the home load and real-time electricity rate, the objective of the optimal control is to find the optimal battery charging/discharging/idle control law at each time step which minimizes the total expense of the power from the grid while considering the battery limitations. To find the optimal control law, the load balance should be considered. Let $P_{Tt}$ be the power of the home load at time $t$ and let $P_{gt}$ be the power from the power grid.

To establish the equation of the home energy system, we introduce a delay in $P_{bt}$ and then we have $P_{Tt}=$ $P_{b(t-1)} \eta(P_{b(t-1)}) + P_{gt}$. We denote $P_{L(t-1)}=P_{Tt}$ and then we can define the load balance as

 \begin{align} \label{equation4} P_{L(t-1)}=P_{b(t-1)} \eta(P_{b(t-1)})+ P_{gt}. \end{align} (5)

In this paper, the power flow from the battery to the grid is not permitted, i.e., we define $P_{gt}\geq 0$, to guarantee the power quality of the grid. To extend the lifetime of the battery, according to (3), we desire that the stored energy of the battery is close to the middle of storage limit $E_b^o$, where

 \begin{align} \label{equation5} E_b^o = \frac{1}{2}(E_b^{\min } + E_b^{\max }). \end{align} (6)

In [45], a quadratic form performance index function was proposed, which was expected to be minimized

 \begin{align} \label{equation6} \sum\limits_{t = 0}^\infty \bigg( {m_1}{{({C_t}{P_{gt}})}^2} + {m_2}{{({E_{bt}} - E_b^o)}^2} + m_3 (P_{bt})^2\bigg) \end{align} (7)

where $m_1$, $m_2$, and $m_3$ are given positive constants. In the performance index function (7), the first term aims to minimize the total cost from the grid. The second term avoids fully charging/discharging of the battery and the third term aims to minimize the charging/discharging power of the battery.

Ⅲ. ITERATIVE ADP ALGORITHM FOR BATTERY MANAGEMENT SYSTEM

In [45], a $Q$-learning based iterative ADP algorithm was developed to obtain the optimal control law of the battery, which minimized the performance index function (7). In this section, inspired by [45], considering the power efficiency and the power constraint of the battery, a new system function will be constructed and a new performance index function will be established.

A. System Transformations

First, we define the system states. Let $x_{1t}= {P_{gt}}$ and $P_{bt}=u_t$. According to (5), we have

 \begin{align} \label{equation6-x1} x_{1, t+1}=P_{Lt}-u_t\eta(u_t). \end{align} (8)

Let $x_{2t}={E_{bt}} - E_b^o$. According to (1), we can get

 \begin{align} \label{equation6-x2} x_{2, t+1}=x_{2, t}-u_t\eta(u_t). \end{align} (9)

Letting $x_t = [{x_{1t}}, {x_{2t}}]^{ {T}}$, the equation of the home energy system can be written as

 \begin{align} \label{equation6-a} {x_{t + 1}} = F({x_t}, {u_t}, t) = \left( {\begin{array}{*{20}{c}} {{P_{Lt}} - {u_t}\eta ({u_t})}\\ {{x_{2t}} - {u_t}\eta ({u_t})} \end{array}} \right). \end{align} (10)

In [45], the performance index function was defined as (7). However, in (7), the power constraint of the battery was not considered. In this paper, inspired by [37], a non-quadratic performance index function will be defined for the battery management system, which is expressed as

 \begin{align} \label{equation6-nn1} \sum\limits_{t = 0}^\infty \bigg(& {m_1}{{({C_t}{P_{gt}})}^2}+ {m_2}{{({E_{bt}} - E_b^o)}^2} \nonumber \\ &+ m_3 \int_0^{{P_{bt}}} {{{({\Phi ^{ - 1}}(s))}^T}ds}\bigg) \end{align} (11)

where $\Phi(\cdot)$ is a monotonic odd function with its first derivative bounded by a constant $\mathcal {M}$. An example is the hyperbolic tangent $\Phi(\cdot)=\mathrm{tanh}(\cdot)$. $R$ is a positive definite matrix. In (11), the upper bound of the parameter $s$ is not larger than $P_{bt}$. However, in (11), the power efficiency of the battery is not described. Thus, we further define a new performance index function as

 \begin{align} \label{equation6-nn2} \sum\limits_{t = 0}^\infty \bigg(& {m_1}{{({C_t}{P_{gt}})}^2} + {m_2}{{({E_{bt}} - E_b^o)}^2} \nonumber \\ &+ m_3 \int_0^{{P_{bt}\eta(P_{bt})}} {{{({\Phi ^{ - 1}}(s))}^T}ds}\bigg). \end{align} (12)

Remark 1: The smart home energy system (10) is different from the one in [45]. First, in equation (3) of [45], the battery efficiency of the battery was not considered in the power balance of the load, while in (10) of this paper, the battery efficiency is considered both in the power balance of the load and the battery power balance. Second, the performance index function defined in [45] did not consider the efficiency of the battery. This makes it possible for the optimal control law to exceed the max/min power of the battery, which may make the optimal control invalid. In this paper, both the max/min power of the battery and the efficiency of the battery are considered in the performance index function (12). Thus, the models of the system (10) and performance index function (12) are more reasonable.

B. Optimality Principle

Let $\underline{u}_t=(u_t, u_{t+1}, \ldots)$ denote the control sequence from $t$ to $\infty$. Let $M_t = \left[ {\begin{array}{*{20}{c}} {{m_1}C^2_t}&{0} \\ {0}&{{m_2}} \\ \end{array}} \right]$. The performance index function (12) can be written as

 \begin{align} \label{equation7} J({x_t}, \underline{u}_t, t) = \sum\limits_{i = t}^\infty {\gamma^i U(x_i, u_i, i)} \end{align} (13)

where the utility function is expressed as

 \begin{align} \label{equation7-xn1} U(x_t, u_t, t) = {x^T_t}M_t{x_t} + m_3 \int_0^{{u_t}\eta ({u_t})} {{{({\Phi ^{ - 1}}(s))}^T}ds}. \end{align} (14)

Define the control sequence set as $\underline{\mathfrak{U}}_t=\big\{\underline{u}_t\colon \underline{u}_t=$ $(u_t, u_{t+1}, \ldots), \, u_{t+i}\in { \mathbb R}^m, i=0, 1, \ldots\big\}$. Then, for an arbitrary control sequence $\underline{u}_t \in \underline{\mathfrak{U}}_t$, the optimal performance index function can be defined as

 $$$J^*(x_t, t)=\inf_{\underline{u}_t} \left\{ J(x_t, \underline{u}_t, t)\colon \underline{u}_t\in \underline{\mathfrak{U}}_{t}\right\}. \label{equation8}$$$ (15)

According to Bellman's principle of optimality [47], we can obtain the following discrete-time HJB equation

 \begin{align}\label{equation9} {J^*}({x_t}, t) = \mathop {\inf }\limits_{{u_{t}}} \big\{ U({x_t}, {u_t}, t) + {J^*}({x_{t + 1}}, t + 1)\big\}. \end{align} (16)

The optimal sequential control law can be expressed as

 $$$u ^*(x_t, t)=\mathop {\inf }\limits_{{u_{t}}} \big\{U(x_t, u_t, t)+J^*(x_{t+1}, t+1)\big\}. \label{equation10}$$$ (17)

Remark 2: We can see that the home energy system (10) is a nonlinear dynamic system. The optimal battery control is actually an infinite horizon optimal control problem for nonlinear system with a non-quadratic performance index function. In this situation, many static mathematical programming methods, such as linear programming, are not effective. Dynamic programming is a powerful method to solve these problems. However, if we adopt the traditional dynamic programming method to obtain the optimal performance index function one step at a time, then we have to face the ''curse of dimensionality''. Thus, a new iterative ADP algorithm will be developed in this paper.

C. Derivations of the Iterative ADP Algorithm

From (10) we know that the battery management system is time-varying. It means that the control law is also time-varying. This makes the controller design difficult. To overcome this difficulty, in [45], [48], by defining a new sequence of control for a period, the time-varying optimal control was transformed into a time-invariant one, which significantly relaxed the computation burden. In this paper, inspired by [45], [48], we will define the sequence of control for a period, where the constraints of the battery are considered. For any $k=0, 1, \ldots$, we define $\mathcal {U}_k$ as the control sequence from $k$ to $k+\lambda-1$, i.e., $\mathcal {U}_k=(u_k, u_{k+1}, \ldots, u_{k+\lambda-1})$. We can define a new utility function as

 \begin{align} \label{equation14} \Lambda \, (x_k, \mathcal {U}_k) = \sum\limits_{\theta = 0}^{\lambda - 1} {\, U(x_{k + \theta}, u_{k + \theta}, \theta).} \end{align} (18)

Then, for any $k=0, 1, \ldots$, the optimal performance index function is obtained as

 \begin{align} \label{equation16} {J^*}({x_k}) = \mathop {\min }\limits_{{{\cal U}_k}} \big\{ \Lambda ({x_k}, {{\cal U}_k}) + \bar \gamma {J^*}({x_{k + \lambda }})\big\} \end{align} (19)

where $\bar{\gamma}=\gamma^{\lambda}$. The optimal sequential control law sequence can be expressed by

 \begin{align} \label{equation16a}{{\cal U}^*}({x_k}) = \arg \mathop {\min }\limits_{{\mathcal {U}_k}} \{ \Lambda {\mkern 1mu} ({x_k}, {{\cal U}_k}) + \bar \gamma {J^*}({x_{k + \lambda }})\}. \end{align} (20)

Define an iteration index $i=0, 1, \ldots$. The iterative value function is defined as

 \begin{align} \label{equation17a} {V_{i+1}}({x_k}) = \mathop {\min }\limits_{{{\cal U}_k}} \{ \Lambda {\mkern 1mu} ({x_k}, {{\cal U}_k}) + \bar \gamma {V_i}({x_{k + \lambda }})\} \end{align} (21)

where $V_0(x_k)=\Psi(x_k)$ and $\Psi(x_k)$ is a positive semi-definite function. The iterative sequential control law sequence $\mathcal {U}_i$ can be computed as follows

 \begin{align} \label{equation17} \mathcal {U}_i(x_k) = \arg\mathop {\min }\limits_{{{\cal U}_k}} \{ \Lambda {\mkern 1mu} ({x_k}, {{\cal U}_k}) + \bar \gamma {V_i}({x_{k + \lambda }})\}. \end{align} (22)

For any $i=0, 1, \ldots$, we define a new iteration index $j=0, 1, \ldots, 23$. We can get

 \begin{align} \label{equation23} V_i^{j + 1}({x_k}) &= \mathop {\min }\limits_{{u_k}} \{ U({x_k}, {u_k}, j) + \gamma V_i^j({x_{k + 1}})\} \nonumber \\ &= U({x_k}, u_i^j({x_k}), j) + \gamma V_i^j({x_{k + 1}}) \end{align} (23)

and

 \begin{align} \label{equation23-1}u^j_i(x_k) = \arg \mathop {\min }\limits_{{u_k}} \{ U({x_k}, {u_k}, j) + \gamma V_i^j({x_{k + 1}})\} \end{align} (24)

where we let $V_i^{0}({x_{k}})= V_{i}(x_k)$ and $V_{i+1}(x_k)=V_i^{23}({x_{k}})$.

The system function is defined as

 \begin{align} \label{equation23-an1}{x_{k + 1}} = f({x_k}, j) + g{v_k} \end{align} (25)

where $f({x_k}, j)=[{P_{L(\lambda-1-j)}}, {x_{2k}}]^T$, $g=[-1,-1]^T$, and $v_k=u_k\eta(u_k)$. According to the principle of optimality, for any $i=0, 1, \ldots$ and $j=0, 1, \ldots, 23$, the iterative control law $v^j_i(x_k)$ satisfies

 \begin{align} \label{equation23-nn1}\frac{{\partial V_i^{j + 1}({x_k})}}{{\partial v_i^j({x_k})}} = 0. \end{align} (26)

Then, we can obtain that

 \begin{align} \label{equation23-nn2}v_i^j(x_k) = - \Phi \bigg(\frac{1}{2}m_3^{ - 1}{g^{{T}}}\frac{{\mathrm{d}{V^j_i}({x_{k + 1}})}}{{\mathrm{d}{x_{k + 1}}}}\bigg). \end{align} (27)

We can see that for any $i=0, 1, \ldots$, we use $j=0, 1,$ $\ldots, \lambda-1$ iterations to obtain the optimal sequential control law sequence for a day.

Remark 3: From (27), we can see that the expression of the iterative sequential control law in this paper is different from the one in [45]. In [45], the iterative sequential control law was obtained by minimizing a $Q$ function, such as

 \begin{align} \label{equation23-1-TIE2015}u^j_i(x_k) = \arg \mathop {\min }\limits_{u_k} Q^j_i(x_k, u_k). \end{align} (28)

Generally, the $Q$ function is not an analytical function. In this situation, the iterative control law in [45] does not possess an analytical expression. In this paper, according to the iterative value function $V^j_i(x_k)$, the expressions of the iterative control law can be obtained according to (27), which is simpler than the one in [45]. This is a merit of the algorithm developed in this paper. On the other hand, in [45], the iterative control law could be obtained directly by minimizing the iterative $Q$ function, where the system function was not required. However, in this paper, to obtain the expression of the iterative control law, the system function (25) is required. This is a disadvantage of the method in this paper.

Now, we let $v_k=\mathcal {G}(u_k)=u_k\eta(u_k)$. Then, we can obtain

 \begin{align} \label{equation7-xn2} u_k=\mathcal {G}^{-1}(v_k). \end{align} (29)

Hence, for any $i\, =\, 0, 1, \ldots$, and $j= 0, 1, \ldots, 23$, we can get $u_i^j(x_k)\, =\, \mathcal {G}^{-1}(v_i^j(x_k))$. For a given $v_i^j(x_k)$, if there exist several iterative control laws $u_i^j(x_k)$ that satisfy (29), we use the iterative control law that minimizes the norm $\|u_i^j(x_k)\|$.

D. Properties of the Iterative ADP Algorithm

In this subsection, the convergence and optimality properties of the proposed iterative ADP algorithm will be developed. It will be shown that the iterative value function and iterative control law will converge to their optimums as the iteration index $i$ increases to infinity.

Theorem 1: For $i\, =\, 0, 1, \ldots$, and $j=0, 1, \ldots, 23$, let the iterative value function ${V^j_{i}}(x_k)$ and the iterative control law $v_i^{j}(x_k)$ be obtained by (23) and (24). Then, the iterative control law sequence can be expressed by

 \begin{align} \label{equation24} {{\cal U}_i}(x_k) = \{ {u^{23}_i}(x_k), {u^{22}_i}(x_{k+1}), \ldots , {u^{0}_{i}}(x_{k+23})\}. \end{align} (30)

Proof: The statement can be proven by mathematical induction. For any $i=0, 1, \ldots$, we have

 \begin{align} \label{equation30b} V_i^{j + 1}&(x_k) \nonumber \\ = &\mathop {\min }\limits_{u_k} \big(U(x_k, u_k, j) + \gamma V_i^j({x_{k+1} })\big) \nonumber\\=& \mathop {\min }\limits_{u_k} \bigg(U(x_k, u_k, j) + \gamma \mathop {\min }\limits_{{u_{k+1} }} \Big(U({x_{k+1}}, {u_{k+1}}, j - 1) \nonumber \\& + \cdots + \gamma \mathop {\min }\limits_{{u_{k+j}}} \big(U({x_{k+j}}, {u_{k+j}}, 0) + \gamma V_i^0 ({x_{k+j+1}})\big)\Big)\bigg) \nonumber\\ = &\mathop {\min }\limits_{(u_k, {u_{k+1}}, \ldots, {u_{k+j}})} \bigg(\sum\limits_{l = 0}^j {\gamma^l U({x_{k+l}}, {u_{k+l}}, j - l)} \nonumber \\ &+ \gamma^{j+1} V_i \big({x_{k+j + 1}}\big)\bigg). \end{align} (31)

First, let $j=0$. According to (24), we can derive that

 \begin{align} \label{equation30b-n1} u^0_i(x_k) = \arg \mathop {\min }\limits_{{u_k}} \{ U({x_k}, {u_k}, 0) +\gamma V_i^0({x_{k + 1}})\} \end{align} (32)

where ${x_{k + 1}} = f({x_k}, 0) + g{v_k}$. The function $f({x_k}, 0)$ can be expressed as $f({x_k}, 0)=[{P_{L(23)}}, {x_{2k}}]^{{T}}$. For $j=0, 1, \ldots, 23$, (24) holds. Hence, the sequential sequence of iterative control law in (31) can be expressed as

 \begin{align} \label{equation30c} ({u_k}, &\, {u_{k + 1}}, \ldots, {u_{k + j}})\nonumber \\ & = (u_i^{23}({x_k}), u_i^{22}({x_{k + 1}}), \ldots, u_i^{\lambda - 1 - j}({x_{k + j}})). \end{align} (33)

Let $j=23$. We can obtain that (30) holds.

From Theorem 1, for $i=0, 1, \ldots$, we can say that the total cost in each period can be minimized by the iterative sequential control law sequence $\mathcal {U}_i(x_k)$ according to the local iteration (23)-(27). In [18], [20], an effective ''functional bound'' method was proposed by Rantzer for the iterative ADP algorithm. Next, inspired by [18], [20], the convergence property of the iterative value function $V_i(x_k)$ will be developed.

Theorem 2: For $i=0, 1, \ldots$, let $V_{i+1}(x_k)$ and $\mathcal {U}_i(x_k)$ be obtained by (21) and (22). Then, the iterative value function $V_{i}(x_k)$ converges to the optimal performance index function, i.e.,

 \begin{align} \label{equation31} \mathop {\lim }\limits_{i \to \infty } V_{i}(x_k)=J^*(x_k). \end{align} (34)

Proof: According to the control law sequence $\mathcal {U}_k=(u_k, u_{k+1}, \ldots, u_{k+\lambda-1})$, we can obtain

 \begin{align} \label{eq0303aa-1} x_{k+\lambda}=\mathcal {F}(x_k, \mathcal {U}_k). \end{align} (35)

Inspired by [18], [20], assume that there are constants $\psi_1$, $\psi_2$, and $\lambda$, such that $0\leq\psi_1\leq1\leq\psi_2<\infty$ and $0<\chi<\infty$, which satisfy $J^*(\mathcal {F}(x_k, \mathcal {U}_k)) \leq \chi U(x_k, \mathcal {U}_k)$, and $\psi_1J^*(x_k)\leq V_0(x_k) \leq \psi_2 J^*(x_k).$

For $i=0, 1, \ldots$, the iterative value function $V_i(x_k)$ satisfies

 \begin{align} \label{eq0305-add1}\left(1 + \frac{{\psi_1 - 1}}{{{{(1 + {\chi ^{\, - 1}})}^i}}}\right)&J^*(x_k) \leq {V}_i(x_k) \nonumber \\ &\leq \left(1 + \frac{{\psi_2 - 1}}{{{{(1 + {{\chi} ^{\, - 1}})}^i}}} \right)J^*(x_k). \end{align} (36)

Inequality (36) can be proven by mathematical induction. Let $i=0$. From (21), we have

 \begin{align} \label{eq0306} {V_1}({x_k}) =& \mathop {\min }\limits_{{\mathcal {U}_k}} \left\{ {U({x_k}, {\mathcal {U}_k}) + \bar \gamma {V_0}({x_{k + \lambda}})} \right\} \nonumber \\ %\ge&\mathop {\min }\limits_{{\mathcal {U}_k}} \left\{ {U({x_k}, {\mathcal {U}_k}) + \psi_1 {J^*}({x_{k + \lambda}})} \right\} \nonumber\\ \ge&\mathop {\min }\limits_{{u_k}} \left\{ \Big(1 + \chi\frac{{ \psi_1-1 }}{{1 + \chi }}\Big)U({x_k}, {\mathcal {U}_k}) \right.\nonumber \\ &+ \bar \gamma \bigg(\psi_1 - \frac{{ \psi_1-1}}{{1 + \chi }}\bigg){J^*}({x_{k + \lambda}})\left. \right\} \nonumber \\ \ge&\left( {1 + \frac{{\psi_1 - 1}}{{(1 + {\chi^{ \, - 1}})}}} \right)\mathop {\min }\limits_{{u_k}} \left\{ {U({x_k}, {\mathcal {U}_k}) + \bar \gamma {J^*}({x_{k + \lambda}})} \right\} \nonumber\\ =&\left( {1 + \frac{{\underline{\delta} - 1}}{{(1 + {\overline{\gamma} ^{\, - 1}})}}} \right){J^*}({x_k}). \end{align} (37)

Following a similar procedure, we can prove inequality (36) using mathematical induction for $i=0, 1, \ldots$. Letting $i\rightarrow\infty$, we can obtain

 \begin{align} \label{eq0312-aa1} \mathop {\lim }\limits_{i \to \infty }& \left\{ {\left( {1 + \frac{{\psi_1 - 1}}{{{{(1 + {\chi^{ - 1}})}^i}}}} \right){J^*}({x_k})} \right\}\nonumber \\ &= \mathop {\lim }\limits_{i \to \infty } \left\{ {\left( {1 + \frac{{\psi_2 - 1}}{{{{(1 + {\chi^{\, - 1}})}^i}}}} \right){J^*}({x_k})} \right\} \nonumber \\ &= \, {J^*}({x_k}). \end{align} (38)

Corollary 1: For $i=0, 1, \ldots$, let $V_{i+1}(x_k)$ and $\mathcal {U}_i(x_k)$ be obtained by (21) and (22). Then the iterative control law sequence $\mathcal {U}_i(x_k)$ converges to the optimal control law sequence, i.e., $\mathop {\lim }\limits_{i \to \infty } {{\cal U}_i}(x_k) = {{\cal U}^*}(x_k).$

Remark 4: In [45], for a dual $Q$-learning based iterative ADP algorithm, inspired by [18], [20], it was proven that the iterative $Q$ function converges to the optimal $Q$ function as the iteration index $i \rightarrow \infty$. In this paper, inspired by [18], [20], we have shown that the iterative value function $V_i(x_k)$ with the constraints of the battery can also converge to the optimal performance index function $J^*(x_k)$. We should point out that the implementation of the above two iterative ADP algorithms are different. First, for the dual iterative $Q$-learning algorithm in [45], the system model is not necessary. This is a remarkable advantage of the dual iterative $Q$-learning algorithm. However, in each iteration of the dual iterative $Q$-learning algorithm, it is required to search both of the state and control spaces to update the iterative $Q$ function. Hence in this case, the computation load is high. In the proposed iterative ADP algorithm (21)-(27), we can see that the iterative value function can be updated only in state space. Thus, the computation burden of the present algorithm is relaxed. This is an advantage of the iterative ADP algorithm in this paper. On the other hand, from (27) we can see that the system model is necessary to obtain the iterative control law. This is a disadvantage of the algorithm.

Ⅳ. SIMULATION ANALYSIS

In this section, the performance of the iterative ADP algorithm with constraints will be examined by numerical experiments. The simulation results for the developed iterative ADP algorithm with constraints will be compared with the dual $Q$-learning algorithm in [45]. The profiles of the home load demand (kW) and the real-time electricity rate (in cents) are taken from [44], where the home load demand and the real-time electricity rate for four weeks ($672$ hours) are shown in Figs. 1(a) and (c), respectively. We can see that the home load demand and the real-time electricity rate are quasi-periodic functions and the periods are $\lambda =24$ hours. According to the functions of the home load demand and the real-time electricity rate for $672$ hours, we can obtain the average trajectories of the home load demand and the electricity rate which are shown in Figs. 1(b) and (d).

 Download: larger image Fig. 1 Home load demand and electricity rate. (a) Home load demand for 672 hours. (b) Average home load demand in a day. (c) Real-time electricity rate for 672 hours. (d) Average electricity rate in a day.

Let the capacity of the battery be chosen as in [45], which is $100$ kWh. Let the upper and lower storage limits of the battery be $E_b^{\min }=20$ kWh and $E_b^{\max }= 80$ kWh, respectively, which are the same as those in [45]. The rated power output of the battery namely the maximum charging/discharging rate is 10 kW. Let the initial battery energy be $65$ kWh. Let the performance index function be expressed as in (7), where we set $m_1=0.9$, $m_2=0.5$ and $m_3=0.5$. Let the initial function $\Psi(x_k)=x_k^{ {T}}\Pi x_k$, where $\Psi$ is arbitrarily chosen as a positive semi-definite matrix with

 \begin{align} \Pi = \left[{\begin{array}{*{20}{c}} {0.2894}&{0.6852}\\ {0.6852}&{2.3505} \end{array}} \right].\nonumber \end{align}

Neural networks are used to implement the iterative ADP algorithm. There are two neural networks, which are critic and action networks, respectively. Both neural networks are chosen as three-layer back-propagation (BP) network. The structures of the critic and action networks are chosen as 2-8-1 and 2-8-1, respectively. The training methods of the neural networks are shown in [45], [49] and omitted here. Now we let the real maximum charging/discharging power be $4.5$ kW. According to these data, we implement the iterative ADP algorithm (21)-(27) for $15$ iterations to guarantee the computation precision $\varepsilon=10^{-2}$. The plots of the iterative value function are shown in Fig. 2.

 Download: larger image Fig. 2 The trajectories of the iterative value function.

From Fig. 2, we can see that under the power constraints of the battery, the iterative value function converges to the optimum in $15$ iterations. The optimal battery charging/discharging management is shown in Fig. 3. From Fig. 3, we can see that the charging and discharging power of the battery cannot exceed the bound of $4.5$ kW, which shows the effectiveness of the developed algorithm. The plot of battery energy is shown in Fig. 4. From Fig. 4, we can see that the energy of the battery does not exceed the maximum energy of the battery.

 Download: larger image Fig. 3 Optimal management of battery with constraints in four weeks.
 Download: larger image Fig. 4 Optimal battery energy with constraints in four weeks.

In [45], a dual $Q$-learning based iterative ADP algorithm was developed to solve the optimal battery management where the constraints for the power of the battery were not considered. The performance index function is defined as in (7), where the parameters are kept unchanged. Choosing the initial $Q$ function as $Q_0=[x_k, u_k]^T \bar \Pi [x_k, u_k]$, where $\bar \Pi$ is chosen as

 \begin{align} \bar \Pi = \left[{\begin{array}{*{20}{c}} {{\rm{0}}{\rm{.3479}}}&{{\rm{0}}{\rm{.8594}}}&{{\rm{0}}{\rm{.4914}}}\\ {{\rm{0}}{\rm{.8594}}}&{{\rm{2}}{\rm{.8694}}}&{{\rm{1}}{\rm{.4634}}}\\ {{\rm{0}}{\rm{.4914 }}}&{{\rm{1}}{\rm{.4634}}}&{{\rm{4}}{\rm{.1274}}} \end{array}} \right]. \nonumber \end{align}

We implement the dual $Q$-learning based iterative ADP algorithm [45] for $15$ iterations which makes the iterative $Q$ learning algorithm converge to the optimum. The optimal battery management with no constraints is shown in Fig. 5. From Fig. 5, we can see that if the power constraints are not considered, the maximum battery power reaches $16$ kW. The energy of the battery is shown in Fig. 6. However, adding the constraints of battery, we can see that the power of the battery does not exceed $4.5$ kW, which prevents the large charging/discharging of the battery. In Fig. 6, we can see that the battery reaches its maximum and minimum energies of the battery under the un-constrained optimal battery management. In Fig. 4, the battery nearly does not reach the maximum and minimum energies of the battery. Hence, the proposed algorithm is preferred for the long term operation of the battery. Fig. 7 shows the real-time cost comparisons in a day. We can see that the real-time cost using the iterative ADP with constraints is smaller than the iterative ADP with no constraint, which shows the effectiveness of the developed method.

 Download: larger image Fig. 5 Optimal management of batteries with no constraints in four weeks.
 Download: larger image Fig. 6 Optimal battery energy with no constraints.
 Download: larger image Fig. 7 Numerical comparisons in a day.
Ⅴ. CONCLUSION

In this paper, a new iterative ADP algorithm is developed to solve the optimal battery control problem for the smart home energy systems. Considering the efficiency and the charging/discharging constraints of the battery, the model of the smart home energy system is constructed. A new non-quadratic form performance index function is established, which guarantees the iterative control amplitude not to exceed the upper bound of the battery. Iterative ADP algorithm is developed, where in each iteration, the expressions of the iterative control law can be obtained. The convergence properties of the iterative ADP algorithm are given, which guarantees that the iterative value function and the iterative control law both reach the optimal ones. Finally, simulation and comparison results are given to illustrate the performance of the presented method.

In this paper, to extend the lifetime of the battery, we aim to draw the stored energy of the battery close to the middle of storage limit and minimize the large charging/discharging power of the battery. On the other hand, the frequency of the battery is not considered in this paper. Since frequent quick charging/discharging frequency may also damage the battery, how to avoid frequent charging/discharging of the battery will be our main future research topic.

References
 [1] H. P. Li, C. Z. Zang, P. Zeng, H. B. Yu, and Z. W. Li, "A stochastic programming strategy in microgrid cyber physical energy system for energy optimal operation, " IEEE/CAA J. Automat. Sin. , vol. 2, no. 3, pp. 296-303, Jul. 2015. [2] G. Chen and E. N. Feng, "Distributed secondary control and optimal power sharing in microgrids, " IEEE/CAA J. Automat. Sin. , vol. 2, no. 3, pp. 304-312, Jul. 2015. [3] W. Wei, F. Liu, S. W. Mei, "Energy pricing and dispatch for smart grid retailers under demand response and market price uncertainty". IEEE Trans. Smart Grid , 2015, 6 (3) :1364–1374. DOI:10.1109/TSG.2014.2376522 [4] A. Szumanowski, Y. H. Chang, "Battery management system based on battery nonlinear dynamics modeling". IEEE Transactions on Vehicular Technology , 2008, 57 (3) :1425–1432. DOI:10.1109/TVT.2007.912176 [5] H. Rahimi-Eichi, U. Ojha, F. Baronti, and M. Y. Chow, "Battery management system: an overview of its application in the smart grid and electric vehicles, " IEEE Ind. Electron. Maga. , vol. 7, no. 2, pp. 4-16, Jun. 2013. [6] T. Y. Lee, "Operating schedule of battery energy storage system in a time-of-use rate industrial user with wind turbine generators: a multipass iteration particle swarm optimization approach, " IEEE Trans. Energy Convers. , vol. 22, no. 3, pp. 774-782, Sep. 2007. [7] P. J. Werbos, "Advanced forecasting methods for global crisis warning and models of intelligence, " Gen. Syst. Yearbook, vol. 22, pp. 25-38, Jan. 1977. [8] P. J. Werbos, W. T. Miller, R. S. Sutton, Neural Networks for Control. Cambridge: MIT Press, 1991 . [9] M. Boaro, D. Fuselli, F. De Angelis, D. R. Liu, Q. L. Wei, and F. Piazza, "Adaptive dynamic programming algorithm for renewable energy scheduling and battery management, " Cognit. Comput. , vol. 5, no. 2, pp. 264-277, Jun. 2013. [10] D. Fuselli, F. De Angelis, M. Boaro, S. Squartini, Q. L. Wei, D. R. Liu, and F. Piazza, "Action dependent heuristic dynamic programming for home energy resource scheduling, " Int. J. Electr. Power Energy Syst. , vol. 48, pp. 148-160, Jun. 2013. [11] D. Molina, G. K. Venayagamoorthy, J. Q. Liang, and R. G. Harley, "Intelligent local area signals based damping of power system oscillations using virtual generators and approximate dynamic programming, " IEEE Trans. Smart Grid, vol. 4, no. 1, pp. 498-508, Mar. 2013. [12] J. Q. Liang, D. D. Molina, G. K. Venayagamoorthy, and R. G. Harley, "Two-level dynamic stochastic optimal power flow control for power systems with intermittent renewable generation, " IEEE Trans. Power Syst. , vol. 28, no. 3, pp. 2670-2678, Aug. 2013. [13] S. Mohagheghi, G. K. Venayagamoorthy, and R. G. Harley, "Fully evolvable optimal neurofuzzy controller using adaptive critic designs, " IEEE Trans. Fuzzy Syst. , vol. 16, no. 6, pp. 1450-1461, Dec. 2008. [14] Y. F. Tang, H. B. He, J. Y. Wen, and J. Liu, "Power system stability control for a wind farm based on adaptive dynamic programming, " IEEE Trans. Smart Grid, vol. 6, no. 1, pp. 166-177, Jan. 2015. [15] Z. Zhang and D. B. Zhao, "Clique-based cooperative multiagent reinforcement learning using factor graphs, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 3, pp. 248-256, Jul. 2014. [16] D. V. Prokhorov and D. C. Wunsch, "Adaptive critic designs, " IEEE Trans. Neur. Net. , vol. 8, no. 5, pp. 997-1007, Sep. 1997. [17] D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996 . [18] B. Lincoln and A. Rantzer, "Relaxing dynamic programming, " IEEE Trans. Automat. Control, vol. 51, no. 8, pp. 1249-1260, Aug. 2006. [19] Q. L. Wei, F. Y. Wang, D. R. Liu, and X. Yang, "Finite-approximation-error-based discrete-time iterative adaptive dynamic programming, " IEEE Trans. Cybernet. , vol. 44, no. 12, pp. 2820-2833, Dec. 2014. [20] A. Rantzer, "Relaxed dynamic programming in switching systems, " IET Proc. -Contr. Theor. Appl. , vol. 153, no. 5, pp. 567-574, Oct. 2006. [21] H. Modares and F. L. Lewis, "Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning, " IEEE Trans. Automat. Control, vol. 59, no. 11, pp. 3051-3056, Nov. 2014. [22] H. Modares and F. L. Lewis, "Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning, " Automatica, vol. 50, no. 7, pp. 1780-1792, Jul. 2014. [23] H. G. Zhang, T. Feng, G. H. Yang, and H. J. Liang, "Distributed cooperative optimal control for multiagent systems on directed graphs: an inverse optimal approach, " IEEE Trans. Cybernet. , vol. 45, no. 7, pp. 1315-1326, Jul. 2015. [24] M. Kumar, K. Rajagopal, S. N. Balakrishnan, et al, "Reinforcement learning based controller synthesis for flexible aircraft wings, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 4, pp. 435-448, Oct. 2014. [25] R. Kamalapurkar, J. R. Klotz, and W. E. Dixon, "Concurrent learning-based approximate feedback-Nash equilibrium solution of N-player nonzero-sum differential games, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 3, pp. 239-247, Jul. 2014. [26] Q. M. Zhao, H. Xu, and S. Jagannathan, "Near optimal output feedback control of nonlinear discrete-time systems based on reinforcement neural network learning, " IEEE/CAA J. Automat. Sin. , vol. 1, no. 4, pp. 372-384, Oct. 2014. [27] Q. L. Wei, R. Z. Song, and P. F. Yan, "Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 27, no. 2, pp. 444-458, Feb. 2016. [28] Q. L. Wei and D. R. Liu, "Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification, " IEEE Trans. Automat. Sci. Eng. , vol. 11, no. 4, pp. 1020-1036, Oct. 2014. [29] Q. L. Wei and D. R. Liu, "A novel iterative the ta-adaptive dynamic programming for discrete-time nonlinear systems, " IEEE Trans. Automat. Sci. Eng. , vol. 11, no. 4, pp. 1176-1190, Oct. 2014. [30] Q. L. Wei and D. R. Liu, "Data-driven neuro-optimal temperature control of water-gas shift reaction using stable iterative adaptive dynamic programming, " IEEE Trans. Ind. Electron. , vol. 61, no. 11, pp. 6399-6408, Nov. 2014. [31] Q. L. Wei, D. R. Liu, and H. Q. Lin, "Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems, " IEEE Trans. Cybernet. , vol. 46, no. 3, pp. 840-853, Mar. 2016. [32] Q. L. Wei, D. R. Liu, and X. Yang, "Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 26, no. 4, pp. 866-879, Apr. 2015. [33] Q. L. Wei, D. R. Liu, and F. L. Lewis, "Optimal distributed synchronization control for continuous-time heterogeneous multi-agent differential graphical games, " Inform. Sci. , vol. 317, pp. 96-113, Oct. 2015. [34] Q. L. Wei and D. R. Liu, "A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems, " Sci. China Inform. Sci. , vol. 58, no. 12, pp. 1-15, Dec. 2015. [35] F. L. Lewis, D. Vrabie, K. G. Vamvoudakis, "Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers, " IEEE Control Syst. , vol. 32, no. 6, pp. 76-105, Dec. 2012. [36] J. J. Murray, C. J. Cox, G. G. Lendaris, R. Saeks, "Adaptive dynamic programming". IEEE Trans. Syst. Man Cybern. C Appl. Rev. , 2002, 32 (2) :140–153. DOI:10.1109/TSMCC.2002.801727 [37] M. Abu-Khalaf, F. L. Lewis, "Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach". Automatica , 2005, 41 (5) :779–791. DOI:10.1016/j.automatica.2004.11.034 [38] R. Z. Song, W. D. Xiao, H. G. Zhang, and C. Y. Sun, "Adaptive dynamic programming for a class of complex-valued nonlinear systems, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 25, no. 9, pp. 1733-1739, Sep. 2014. [39] R. Z. Song, F. Lewis, Q. L. Wei, H. G. Zhang, Z. P. Jiang, and D. Levine, "Multiple actor-critic structures for continuous-time optimal control using input-output data, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 26, no. 4, pp. 851-865, Apr. 2015. [40] R. Z. Song, F. L. Lewis, Q. L. Wei, H. G. Zhang, "Off-policy actor-critic structure for optimal control of unknown systems with disturbances". IEEE Trans. Cybernet. , 2016, 46 (5) :1041–1050. DOI:10.1109/TCYB.2015.2421338 [41] D. R. Liu and Q. L. Wei, "Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems, " IEEE Trans. Neur. Net. Lear. Syst. , vol. 25, no. 3, pp. 621-634, Mar. 2014. [42] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, "Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof, " IEEE Trans. Syst. Man Cybern. B Cybern. , vol. 38, no. 4, pp. 943-949, Aug. 2008. [43] H. G. Zhang, Q. L. Wei, and Y. H. Luo, "A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm, " IEEE Trans. Syst. Man Cybern. B Cybern. , vol. 38, no. 4, pp. 937-942, Aug. 2008. [44] T. Huang and D. R. Liu, "A self-learning scheme for residential energy system control and management, " Neur. Comput. Appl. , vol. 22, no. 2, pp. 259-269, Feb. 2013. [45] Q. L. Wei, D. R. Liu, and G. Shi, "A novel dual iterative Q-learning method for optimal battery management in smart residential environments, " IEEE Trans. Ind. Electron. , vol. 62, no. 4, pp. 2509-2518, Apr. 2015. [46] T. Yau, L. N. Walker, H. L. Graham, A. Gupta, and R. Raithel, "Effects of battery storage devices on power system dispatch, " IEEE Trans. Power Apparatus Syst. , vol. PAS-100, no. 1, pp. 375-383, Jan. 1981. [47] R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton University Press, 1957 . [48] Q. L. Wei, D. R. Liu, G. Shi, and Y. Liu, "Multibattery optimal coordination control for home energy management systems via distributed iterative adaptive dynamic programming, " IEEE Trans. Ind. Electron. , vol. 62, no. 7, pp. 4203-4214, Jul. 2015. [49] J. Si and Y. T. Wang, "Online learning control by association and reinforcement, " IEEE Trans. Neur. Net. , vol. 12, no. 2, pp. 264-276, Mar. 2001.