IEEE/CAA Journal of Automatica Sinica  2018, Vol. 5 Issue(4): 827-835   PDF    
Parallel Reinforcement Learning: A Framework and Case Study
Teng Liu1, Bin Tian2,3, Yunfeng Ai4, Li Li5, Dongpu Cao6, Fei-Yue Wang1     
1. State Key Laboratory of Management and Control for Complex System, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. State Key Lab. of Management and Control for Complex System, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
3. Cloud Computing Center, Chinese Academy of Sciences, Dongguan 523808, China;
4. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
5. Department of Automation, Tsinghua University, Beijing 100190, China;
6. Driver Cognition and Automated Driving Laboratory, University of Waterloo, Waterloo N2L 3G1, Canada
Abstract: In this paper, a new machine learning framework is developed for complex system control, called parallel reinforcement learning. To overcome data deficiency of current data-driven algorithms, a parallel system is built to improve complex learning system by self-guidance. Based on the Markov chain (MC) theory, we combine the transfer learning, predictive learning, deep learning and reinforcement learning to tackle the data and action processes and to express the knowledge. Parallel reinforcement learning framework is formulated and several case studies for real-world problems are finally introduced.
Key words: Deep learning     machine learning     parallel reinforcement learning     parallel system     predictive learning     transfer learning    

Machine learning especially deep reinforcement learning (DRL) experiences an ultrafast development in recent years [1], [2]. No matter in traditional visual detection [3], dexterous manipulation in robotics [4], energy efficiency improvement [5], object localization [6], novel Atari game [7], [8], Leduc poker [9], Doom game [10] and text-based games [11], these data-driven learning approaches show great potential in improving performance and accuracy. However, there are still several issues to impede researchers applying DRL to handle the real complex system problems.

One of the issues is lack of generalization capability to new goals [3]. DRL agents need to collect new data and learn new model parameters for a new target. It is computationally expensive to retrain the learning model. Hence, we need to utilize the limited data well to accommodate the environments via learning.

Another issue is data inefficiency [8]. Acquiring large-scale action and interaction data of real complex systems is arduous. To explore control policy by themselves is very difficult for the learning systems. Thus, it is necessary to create a large number of observations for action and knowledge from the historical available data.

Finally, the issue is data dependency and distribution. In practical systems, data samples dependency is often uncertain and probability distribution is usually variant. So, it is hard for DRL agents to consider the state, action and knowledge of a learning system in an integrated way.

In order to address these difficulties, we develop a new parallel reinforcement learning framework for complex system control in this paper. We construct an artificial system analogy to the real system via modelling to constitute a parallel system. Based on the Markov chain (MC) theory, transfer learning, predictive learning, deep learning and reinforcement learning are exhibited to tackle data and action processes and to express knowledge. Furthermore, several application cases of parallel reinforcement learning are introduced to illustrate its usability. It is noticed that the proposed technique in this paper can be regarded as the specification of the parallel learning in [12].

Fei-Yue Wang first initialized the parallel system theory in 2004 [13], [14]. In [13] and [14], ACP method was proposed to deal the complex system problem. ACP approach represents artificial societies (A) for modelling, computational experiments (C) for analysis, and parallel execution (P) for control. An artificial system is usually built by modelling, to explore the data and knowledge as the real system does. Through executing independently and complementally in these two systems, the learning model can be more efficient and less data-hungry. ACP approach has been applied in several fields to discuss different problems in complex systems [15]-[17].

Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Taking driving cycles of vehicle as an example, we introduce mean traction force (MTF) components to achieve equivalent transformation of them. By transferring limited data via MTF, the generalization capability problem can be relieved.

Predictive learning tries to use prior knowledge to build a model of environment by trying out different actions in various circumstances. Taking power demand for example, we introduce fuzzy encoding predictor to forecast the future power demand in different time steps. Based on the MC, historical available data can be used to solve the data inefficiency.

Deep learning is defined via learning data representations, including multiple layers of nonlinear processing units and the supervised or unsupervised learning of feature representations in each layer. Reinforcement learning is concerned with how agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The main contribution of this paper is combining parallel system with transfer learning, predictive learning, deep learning and reinforcement learning to formulate the parallel reinforcement learning framework to dispose the data dependency and distribution problems in real-world complex systems.

The rest of this paper is organized as follows. Section Ⅱ introduces the parallel reinforcement learning framework and relevant components, then several case studies for real-world complex system problems are described in Section Ⅲ. Finally, we conclude the paper in Section Ⅳ.

Ⅱ. FRAMEWORK AND COMPONENTS A. The Framework and the Parallel System

The purpose of parallel reinforcement learning is building a closed loop of data and knowledge in the parallel system to determine the next operation in each system, as shown in Fig. 1. The data represents the inputs and parameters in artificial and real systems. The knowledge means the records from state space to action space, which we name in the real system as experience and that in the artificial system as policy. The experience can be used to rectify the artificial model and updated policy is utilized to guide the real actor along with feedback from environment.

Fig. 1 Parallel reinforcement learning framework.

Cyber physical systems have attracted increasingly more concerns for their potentials to fuse computational processes with the physical world in the past two decades. Furthermore, cyber-physical-social systems (CPSS) augment the cyber physical system capacity by integrating the human and social characteristics to achieve more effective design and operation [18]. The ACP-driven parallel system framework is depicted in Fig. 2. The integration of the real and artificial system as a whole is called parallel system.

Fig. 2 ACP-driven parallel system framework.

In this framework, the physically-defined real system interacts with the software-defined artificial system via three coupled modules within the CPSS. The three modules are control and management, experiment and evaluation, learning and training. The first module belongs to decision maker in these two systems, the second one represents the evaluator and the final one indicates the learning controller.

ACP = Artificial societies + Computational experiments + Parallel execution. Artificial system is often constructed by descriptive learning based on the observation on the real system due to the development in information and communication technologies. It can help learning controller store more computing results and make more flexible decisions. Thus, the artificial system is parallel to the real system and runs asynchronously to stabilize the learning process and extend the learning capability.

In the computational experiment stage, the specifications of transfer learning, predictive learning and deep learning are illustrated by the MC theory, as we will discuss them later. For the parallel system, combining these learning processes with reinforcement learning, the parallel reinforcement learning is formulated to derive the experience and policy and to clarify the interaction of them. For a general parallel intelligent system, such knowledge can be applied in different tasks because the learning controller can handle several tasks via rational reasoning [19].

Finally, parallel execution between the artificial and real systems is expected to enable an optimal operation of these systems [20]. Although the artificial system is drawn by the prior data of real system, it will be rectified and improved by the further observation. The consecutive updated knowledge in the artificial system is also used to instruct the real system operation in an efficient way. Owing to the communication of data and knowledge by parallel execution, these two systems can improve by self-guidance.

B. Transfer Learning

In this paper, we choose driving cycles as an example to introduce transfer learning, which can be easily popularized for other data in the MC domain. A general driving cycle transformation methodology based on the mean tractive force (MTF) components is introduced in this section. This transformation can convert the existent driving cycles database into an equivalent one with a real MTF value to relieve the data lacking problem.

MTF is defined as the tractive energy divided by the distance traveled for a whole driving cycle, which is integrated over the entire time interval [0, $T$ ] as follows

$ \begin{equation} \label{eq1} \bar {F}=\frac{1}{x_T }\int_0^T {F(t)} v(t)dt \end{equation} $ (1)

where $x_{T}$ is the total distance traveled in a certain driving cycle and calculated as $\smallint v(t)\textit{dt}$ , $v$ is the vehicle speed with respect to a certain driving cycle. $F$ is the longitudinal force to propel the vehicle and computed as

$ \begin{equation} \label{eq2} \left\{ {\begin{array}{l} F=F_a +F_r +F_m \\ F_a =\frac{1}{2}\rho _a C_d Av^2, F_r =M_v g\cdot f, F_m =M_v a \\ \end{array}} \right. \end{equation} $ (2)

where $F_{a}$ is aerodynamic drag, $F_{r}$ is rolling resistance and $F_{m}$ is inertial force. $\rho_{a}$ is the air density, $C_{d}$ is the aerodynamic coefficient, and $A$ is the frontal area. $M_{v}$ is the curb weight, $g$ is the gravitational acceleration, $f $ is the rolling friction coefficient and $ a$ is the acceleration.

The vehicle operating modes are divided as traction, coasting, braking and idling according to the force imposed on the vehicle powertrain [21]. Hence, the time interval is depicted as

$ \begin{equation} \label{eq3}\small \left\{\! \!{\begin{array}{l} T=T_{tr} \cup T_{co} \cup T_{br} \cup T_{id} \\ T_{tr} =\left\{ {\left. t \right|F(t)>0, v(t)\ne 0} \right\}, T_{co} =\left\{ {\left. t \right|F(t)=0, v(t)\ne 0} \right\} \\ T_{br} =\left\{ {\left. t \right|F(t)<0, v(t)\ne 0} \right\}, T_{id} =\left\{ {\left. t \right|v(t)=0} \right\} \\ \end{array}} \right.\end{equation} $ (3)

where $T_{tr}$ and $T_{co}$ are the traction-mode and coasting-mode regions, respectively. $T_{br}$ represents the vehicle brakes and $T_{id }$ is the idling set.

From (3), it is obvious that the powertrain only provides positive power to wheels in the traction regions. MTF in (1) is specialized as follows:

$ \begin{equation} \label{eq4} \bar {F}=\frac{1}{x_L }\int\limits_{t\in T_{tr} } {F(t)} v(t)dt=\bar {F}_a +\bar {F}_r +\bar {F}_m. \end{equation} $ (4)

Then, MTF components ( $\alpha $ , $\beta $ , $\gamma $ ) are the statistic characteristics measures for a driving cycle that are defined as [22]

$ \begin{equation} \label{eq5} \left\{ {\begin{array}{l} \alpha =\dfrac{\bar {F}_a }{\dfrac{1}{2}\rho _a C_d A}=\dfrac{1}{x_L }\int\limits_{t\in T_{tr} } {v^3(t)} dt \\[2mm] \beta =\dfrac{\bar {F}_f }{mg\cdot f}=\dfrac{1}{x_L } \int\limits_{t\in T_{tr} } {v(t)} dt \\[2mm] \gamma =\dfrac{\bar {F}_m }{m}=\dfrac{1}{x_L } \int\limits_{t\in T_{tr} } {a\cdot v(t)} dt. \\ \end{array}} \right. \end{equation} $ (5)

Note that MTF components are related to the speed and acceleration for a specific driving cycle. These measures are employed as the constraints for driving cycle transformation.

Definition decides MTF is unique for a specific driving cycle, thus inequality and equality constraints are employed to determine the transferred driving cycle. A cost function can be defined by the designer to choose an optimal equivalent one from a set of feasible solutions. This transformation is formulated as a non-linear program (NLP) as

$ \begin{equation} \label{eq6} \begin{array}{l} ~~~~~~~~~~~~~~~~~\mathop {\min }\limits_{\tilde {v}} f(\tilde {v}) \\ {\rm s.t.}~~~~~g_i (\tilde {v}, T_{tr} , {\alpha }', {\beta }'or{\gamma }'))=0, i=1, 2, 3 \\ ~~~~~~~~~~~~~h_1 (\tilde {v}, T_{tr} , v_{\rm coast} )<0 \\ ~~~~~~~~~~~~~h_2 (\tilde {v}, T_{co} \cup T_{br} , v_{\rm coast} )\ge 0 \\ \end{array} \end{equation} $ (6)

where $\tilde {v}$ is the transferred driving cycle, ( $\alpha'$ , $\beta'$ , $\gamma'$ ) are the target MTF components, $v_{\rm coast}$ is the vehicle coasting speed. $g_{i}$ and $h_{j}$ are the constraints. Through this process, the transferred driving cycle related to the real conditions can be decided, and afterwards can be used for other operations, such as control and management [21], [22], see Fig. 3 for an illustration.

Fig. 3 Transfer learning for driving cycles transformation.

The purpose of transfer learning is converting historical available data into equivalent one to expand the database. The transferred data is strongly associated with the real environments. Thus, it can be used for generating adaptive control and operations in complex systems, so as to solve the generalization capability and data hungry problems.

C. Predictive Learning

Taking power demand of vehicle for example, we introduce predictive learning to forecast the future power demand based on the observed data and processes in parallel system. A better understanding of the real system can then be described and applied to update the artificial system from these new experiences. A power demand prediction technology based on fuzzy encoding predictor (FEP) is illustrated in this section. This approach can also be used to draw more future knowledge from experiences for other parameters in the complex systems.

Power demand is modelled as a finite-state MC [23] and depicted as $P_{dem}={\{}p_{j }\vert j=1, {\ldots}, M{\}}\subset X$ , where $X\subset R$ is bounded. Transition probability of power demand is calculated by maximum likelihood estimator as

$ \begin{equation} \label{eq7} \left\{ {\begin{array}{l} \pi _{ij} =P(\left. {p^+=p_j } \right|p=p_i )=\dfrac{N_{ij} }{N_i }{\kern 1pt} \\ N_i =\sum\limits_{j=1}^M {N_{ij} } \\ \end{array}} \right. \end{equation} $ (7)

where $\pi _{ij}$ is the transition probability from $p_{i}$ to $p_{j}$ . $p$ and $p^{+}$ are the present and next one-step ahead power demands, respectively. Furthermore, $N_{ij }$ indicates the transition count number from $p_{i}$ to $p_{j}$ , and $N_{i }$ is the total transition count number initiated from $p_{i}$ .

All elements $\pi_{ij}$ constitute the transition probability matrix $\Pi $ . For fuzzy encoding technique, $X $ is divided into a finite set of fuzzy subsets $\Phi_{j}, j=1, {\ldots}, M$ , where $\Phi_{j}$ is a pair ( $X$ , $\mu_{j}(\cdot))$ and $\mu_{j}(\cdot )$ is called Lebesgue measurable membership function and defined as

$ \begin{equation} \label{eq8} \mu _j :X\to [0, {\kern 1pt}{\kern 1pt}1]{\kern 1pt}{\kern 1pt}{\kern 1pt}{\rm s.t.}{\kern 1pt}{\kern 1pt}\forall p\in X, {\kern 1pt}{\kern 1pt}\exists j, 1\le j\le M, \mu _j (p)>0 \end{equation} $ (8)

where $\mu_{j}(p)$ reflects the membership degree of $p\in X$ in $\mu_{j}$ . It is noticed that a continuous state $p\in X$ in the fuzzy encoding may be associated with several states $p_{j}$ of the underlying finite-state MC model [24].

Two transformations are involved in FEP. The first transformation allocates an $M$ -dimensional possibility (not probability) vector for each $p\in X$ as

$ \begin{equation} \label{eq9} \tilde {O}^T(p)=\mu ^T(p)\, = [\mu _1 , \mu _2 , \ldots , \mu _M (p)]. \end{equation} $ (9)

This transformation is named fuzzification and maps power demand in the space $X$ to the vector in $M$ -dimensional possibility vector space $\tilde {X}$ . Note that it is not necessary for the sum of the elements in possibility vector $\tilde {O}(p)$ to equal 1.

The second transformation is called proportional possibility-to-probability transformation, in which the possibility vector $\tilde {O}(p)$ is converted into a probability vector $O(p)$ by normalization [23], [24]:

$ \begin{equation} \label{eq10} O(p)=\frac{\tilde {O}(p)}{\sum\limits_{j=1}^M {\tilde {O}_j (p)}} \end{equation} $ (10)

where this transformation maps $\tilde {X}$ to an $M$ -dimensional probability vector space, $\bar {X}$ . The element $\pi_{ij }$ in the transition probability matrix (TPM) $\Pi $ is interpreted as a transition probability between $\Phi_{i}$ and $\Phi _{j}$ . To decode vectors in $\bar {X}$ back to $X$ , the probability distribution $O^{+}(p)$ is utilized to aggregate the membership function $\mu (p)$ to encode the probability vector of the next state in $X$ :

$ \begin{equation} \label{eq11} w^+(p)=(O^+(p))^T\mu (p)=(O(p))^T\Pi \mu (p). \end{equation} $ (11)

The expected value over the possibility vector leads to the next one-step ahead power demand in FEP:

$ \begin{equation} \label{eq12} \left\{ {\begin{array}{l} p^+=\int\limits_X {w^+(y)yd} y/\int\limits_X {w^+(y)d} y \\ \int\limits_X {w^+(y)yd} y = \sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } \int\limits_X {y\mu _j (y)d} y \\ \int\limits_X {w^+(y)d} y=\sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } \int\limits_X {\mu _j (y)d} y. \\ \end{array}} \right. \end{equation} $ (12)

The centroid and volume of the membership function $\mu_{j}(p)$ is expressed as

$ \begin{equation} \label{eq13} \left\{ {\begin{array}{l} \bar {c}_i =\int\limits_X {y\mu _j (y)d} y \\ V_j =\int\limits_X {\mu _j (y)d} y. \\ \end{array}} \right. \end{equation} $ (13)

Thus, (12) is reformulated as

$ \begin{equation} \label{eq14} \left\{ {p^+=\frac{\sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } V_j \bar {c}_j }{\sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } V_j }} \right.. \end{equation} $ (14)

where expression (14) is the predicted one-step ahead power demand using FEP. Fig. 4 shows an example of predictive learning used for power demand prediction. By doing this, the future power demand of vehicle in different time steps can be determined, and then these data will be used for improving the management and operations in the parallel system by self-guidance.

Fig. 4 Predictive learning for future power demand prediction.

The goal of predictive learning is generating reasonable data from the prior existed data and real-time observations in the real world. We aim to minimize the differences between real samples and generated samples by tuning the parameters in the predictive learning methodology. Therefore, these generated data are responsible for deriving various experiences and guiding the complex system by learning process, so as to settle the data inefficiency and distribution problem.

D. Reinforcement Learning

In the reinforcement learning framework, a learning agent interacts with a stochastic environment. We model the interaction as quintuple (S, A, $\Pi $ , R, $\gamma $ ), where $ s\in {\textit{S}}$ and $a\in {\textit{A}}$ are state variables and control actions sets, $\Pi $ is the transition probability matrix, $r\in {\textit{R}}$ is the reward function, and $\gamma $ $\in $ (0, 1) denotes a discount factor.

The action value function $ Q(s$ , $a)$ is defined as the expected reward starting from $s $ and taking the action $a$ :

$ \begin{equation} \label{eq15} Q(s, a)={\rm E}\left\{ {\sum\limits_{l=0}^\infty {\gamma ^lr_{t+1+l} \left| {s_t =s, a_t =a} \right.} } \right\}. \end{equation} $ (15)

The action value function associated with an optimal policy can be found by the Q-learning algorithm as in [25]

$ \begin{equation} \label{eq16} Q(s_t , a_t )\leftarrow Q(s_t , a_t )+\eta (r+\gamma \mathop {\max }\limits_a Q(s_{t+1} , a)-Q(s_t , a_t )). \end{equation} $ (16)

When the state and action space is large, for example the action $a_{t}$ consists of several sub-actions, modelling Q-values $Q(s$ , $a)$ becomes difficult. In this situation, we use both state and action representations as input to a deep neural network to approximate the action value function.

A deep neural network is composed of an input layer, one or more hidden layers and an output layer. As shown in Fig. 5(a), the input vector $g=[g_{1}, g_{2}, {\ldots}, g_{R}]$ is weighted by elements $w_{1},w_{2}, {\ldots},w_{R}$ , and then summed with a bias $b$ to form the net input $n$ as

Fig. 5 Deep neural network and bidirectional long short-term memory.
$ \begin{equation} \label{eq17} n=\sum\limits_{i=1}^R {w_i g_i } +b. \end{equation} $ (17)

Then, the net input $n$ is affected by an activation function $h$ to generate the neuron output $d$ .

$ \begin{equation} \label{eq18} d=h(n) \end{equation} $ (18)

where activation function usually includes activation function in the hidden layer $h_{1}$ and activation function in the output layer $h_{2}$ .

In this paper, we propose a bidirectional long short-term memory [26] based deep reinforcement network (BiLSTM- DRN) to approximate the action value function in reinforcement learning, see Fig. 5 (b) for an illustration. This structure consists of a pair of deep neural networks, one for state variable $s_{t}$ embedding and the other for control sub-actions $c^{i}_{t}$ embeddings. As the bidirectional LSTM has a larger capacity due to its nonlinear structure, we expect it will capture more details on how the embeddings in each sub-action are combined into an action embedding. Finally, a pairwise interaction function (e.g., inner product) is used to compute new $Q(s_{t}$ , $a_{t})$ via combining the state and sub-actions neuron output as

$ \begin{equation} \label{eq19} Q(s_t , a_t )=\sum\limits_{i=1}^K Q (s_t , c_t^i ) \end{equation} $ (19)

where $K$ is the number of the sub-actions, and $Q(s_{t}$ , $c^{i}_{t})$ represents the expected accumulated future rewards by including this sub-action.

Combining the ideas of parallel system, transfer learning, predictive learning and reinforcement learning, we can formulate a closed loop of data and knowledge, named parallel reinforcement learning, as described in Fig. 1. Several case studies for real-world complex system problems are introduced and discussed in the next section.

Ⅲ. CASE STUDIES OF PARALLEL REINFORCEMENT LEARNING A. Existing Case Studies in Parallel Reinforcement Learning Framework

Parallel reinforcement learning serves as a reasonable and suitable framework to analyse the real world complex system. It consists of a self-boosting process in the parallel system, a self-adaptive process by transfer learning, a self-guided process by predictive learning and big data screening and generating process by BiLSTM-DRN. Learning process becomes more efficient and continuous in the parallel reinforcement learning framework.

Several complex systems have been researched and analysed in the perspective of parallel reinforcement learning, such as transportation systems [27], [28], and vision systems [29]. A traffic flow prediction system was designed in [27], which considered the spatial and temporal correlations inherently. First, an artificial system named stacked autoencoder model was built to learn generic traffic flow features. Second, the synthetic data were trained by a layer-wise greedy method in the deep learning architecture. Finally, predictive learning was used to achieve traffic flow prediction and self-guidance for the parallel system. A survey on the development of the data-driven intelligent transportation system (D-DITS) was introduced in [28]. The functionality of D-DITS's key components and some deployment issues associated with its future research were addressed in [28].

Also, a parallel reinforcement learning framework has also been applied to address the problems in visual perception and understanding [29]. To draw an artificial vision system based on the observations from real scenes, the synthetic data can be used for feature analysis, object analysis and scene analysis. This novel research methodology, named parallel vision, was proposed for perception and understanding of complex scenes.

Furthermore, autonomous learning system for vehicle energy efficiency improvement in [30] can also be put into parallel reinforcement learning framework. First, a plug-in hybrid electric vehicle was imitated to construct the parallel system. Then, historical driving record for the real vehicle was collected to learn autonomously the optimal fuel use via a deep neural network and reinforcement learning. Finally, this trained policy can guide the real vehicle operations and improve control performance. A better understanding of the real vehicle can then be obtained and used to adjust the artificial system from these new experiences.

B. New Applications Using Parallel Reinforcement Learning Methods

Recently, we designed a driving cycles transformation based adaptive energy management system for a hybrid electric vehicle (HEV). There exist two major difficulties in the energy management problem of HEV. First, most of energy management strategies or predefined rules cannot adapt to changing driving conditions. Second, model-based approaches used in energy management require accurate vehicle models, which bring a considerable model parameter calibration cost. Hence, we apply parallel reinforcement learning framework into the energy management problem of HEV, as depicted in Fig. 6. More precisely, the core idea of this methodology is bi-level.

Fig. 6 Parallel reinforcement learning for energy management of HEV.

The up-level characterizes how to transform driving cycles using transfer learning by considering the induced matrix norm (IMN). Specially, TPM of power demand are computed and IMN is employed as a critical criterion to identify the differences of TPMs and to determine the alteration of control strategy. The lower-level determines how to set the corresponding control strategies with the transferred driving cycle by using model-free reinforcement learning algorithm. In other words, we simulate the HEV as an artificial system to sample the possible energy management solutions, use transfer learning to make the computed strategies adaptive to real world driving conditions, and use reinforcement learning to generate the corresponding controls. Tests demonstrate that the proposed strategy exceeds the conventional reinforcement learning approach in both calculation speed and control performance.

Furthermore, we construct an energy efficiency improvement system in parallel reinforcement learning framework for a hybrid tracked vehicle (HTV). Specifically, we combine the simulated artificial vehicle with real vehicle to constitute the parallel system, use predictive learning to realize power demand prediction for further self-guidance and use reinforcement learning for control policy calculation. This approach also includes two layers, see Fig. 7 for a visualization of such idea. The first layer discusses how to accurately forecast the future power demand using FEP based on the MC theory. Kullback-Leibler (KL) divergence rate is employed to decide the differences of TPMs and updating of control strategy. The second layer computes the relevant control policy based on the predicted power demand and reinforcement learning technique. Finally, comparison shows that the proposed control policy is superior to the primary reinforcement learning approach in energy efficiency improvement and computational speed.

Fig. 7 Parallel reinforcement learning for energy efficiency of HTV.

In the future, we plan to apply BiLSTM- DRN to process and train the large real vehicle data for optimal energy management strategy computation. The objective is to realize real-time control using the parallel reinforcement learning method in our self-made tracked vehicle. More importantly, we will apply parallel reinforcement learning framework into multi-missions of automated vehicles [30], such as decision making, trajectory planning and so on. To address the existing disadvantages of traditional data-driven methods, we expect that parallel reinforcement learning can promote the development of machine learning.


The general framework and case studies of parallel reinforcement learning for complex systems are introduced in this paper. The purpose is to build a closed loop of data and knowledge in the parallel system to guide the real system operation or improve the artificial system precision. Particularly, ACP approach is used to construct the parallel system that contains an artificial system and a real system. Transfer learning is utilized to achieve driving cycle transformation by mean of tractive force components. Predictive learning is applied to forecast the future power demand via fuzzy encoding predictor. To train data in the large action and state space, we introduce BiLSTM-DRN to approximate the action value function in reinforcement learning.

Data-driven models are usually viewed as a component irrelevant to the data in learning process, which results in the large-scale exploration and observation-insufficiency problems. Furthermore, data in these models tend to be inadequate, and the general principle to organize these models remains absent. By combining parallel system, transfer learning, predictive learning, deep learning and reinforcement learning, we believe that parallel reinforcement learning can effectively address these problems and promote the development of machine learning.

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning, " Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015.
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of Go with deep neural networks and tree search, " Nature, vol. 529, no. 7587, pp. 484-489, Jan. 2016.
[3] Y. K. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, F. -F. Li, and A. Farhadi, "Target-driven visual navigation in indoor scenes using deep reinforcement learning, " in Proc. 2017 IEEE Int. Conf. Robotics and Automation (ICRA), Singapore, pp. 3357-3364.
[4] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, "Data-efficient deep reinforcement learning for dexterous manipulation, " arXiv: 1704.03073, 2017.
[5] X. W. Qi, Y. D. Luo, G. Y. Wu, K. Boriboonsomsin, and M. J. Barth, "Deep reinforcement learning-based vehicle energy efficiency autonomous learning system, " in Proc. Intelligent Vehicles Symp. (Ⅳ), Los Angeles, CA, USA, pp. 1228-1233, 2017.
[6] J. C. Caicedo and S. Lazebnik, "Active object localization with deep reinforcement learning, " in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2488-2496.
[7] X. X. Guo, S. Singh, R. Lewis, and H. Lee, "Deep learning for reward design to improve Monte Carlo tree search in Atari games, " arXiv: 1604.07095, 2016.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with deep reinforcement learning, " arXiv: 1312.5602, 2013.
[9] J. Heinrich and D. Silver, "Deep reinforcement learning from self-play in imperfect-information games, " arXiv: 1603.01121, 2016.
[10] D. Hafner, "Deep reinforcement learning from raw pixels in doom, " arXiv: 1610.02164, 2016.
[11] K. Narasimhan, T. Kulkarni, and R. Barzilay, "Language understanding for text-based games using deep reinforcement learning, " arXiv: 1506.08941, 2015.
[12] L. Li, Y. L. Lin, N. N. Zheng, and F. Y. Wang, "Parallel learning: a perspective and a framework, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 3, pp. 389-395, Jul. 2017.
[13] F. Y. Wang, "Artificial societies, computational experiments, and parallel systems: a discussion on computational theory of complex social-economic systems, " Complex Syst. Complex. Sci., vol. 1, no. 4, pp. 25-35, Oct. 2004.
[14] F. Y. Wang, "Toward a paradigm shift in social computing: the ACP approach, " IEEE Intell. Syst., vol. 22, no. 5, pp. 65-67, Sep. -Oct. 2007.
[15] F. Y. Wang, "Parallel control and management for intelligent transportation systems: concepts, architectures, and applications, " IEEE Trans. Intell. Transp. Syst., vol. 11, no. 3, pp. 630-638, Sep. 2010.
[16] F. Y. Wang and S. N. Tang, "Artificial societies for integrated and sustainable development of metropolitan systems, " IEEE Intell. Syst., vol. 19, no. 4, pp. 82-87, Jul. -Aug. 2004.
[17] F. Y. Wang, H. G. Zhang, and D. R. Liu, "Adaptive dynamic programming: an introduction, " IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 39-47, May 2009.
[18] F. Y. Wang, "The emergence of intelligent enterprises: From CPS to CPSS, " IEEE Intell. Syst., vol. 25, no. 4, pp. 85-88, Jul. -Aug. 2010.
[19] F. Y. Wang, N. N. Zheng, D. P. Cao, C. M. Martinez, L. Li, and T. Liu, "Parallel driving in CPSS: a unified approach for transport automation and vehicle intelligence, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 4, pp. 577-587, Oct. 2017.
[20] K. F. Wang, C. Gou, and F. Y. Wang, "Parallel vision: an ACP-based approach to intelligent vision computing, " Acta Automat. Sin., vol. 42, no. 10, pp. 1490-1500, Oct. 2016.
[21] P. Nyberg, E. Frisk, and L. Nielsen, "Driving cycle equivalence and transformation, " IEEE Trans. Veh. Technol., vol. 66, no. 3, pp. 1963-1974, Mar. 2017.
[22] P. Nyberg, E. Frisk, and L. Nielsen, "Driving cycle adaption and design based on mean tractive force, " in Proc. 7th IFAC Symp. Advanced Automatic Control, Tokyo, Japan, vol. 7, no. 1, pp. 689-694, 2013.
[23] D. P. Filev and I. Kolmanovsky, "Generalized markov models for real-time modeling of continuous systems, " IEEE Trans. Fuzzy Syst., vol. 22, no. 4, pp. 983-998, Aug. 2014.
[24] D. P. Filev and I. Kolmanovsky, "Markov chain modeling approaches for on board applications, " in Proc. 2010 American Control Conf., Baltimore, MD, USA, pp. 4139-4145.
[25] T. Liu, X. S. Hu, S. E. Li, and D. P. Cao, "Reinforcement learning optimized look-ahead energy management of a parallel hybrid electric vehicle, " IEEE/ASME Trans. Mechatron., vol. 22, no. 4, pp. 1497-1507, Aug. 2017.
[26] A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM and other neural network architectures, " Neural Netw., vol. 18, no. 5-6, pp. 602-610, Jul. -Aug. 2005.
[27] Y. S. Lv, Y. J. Duan, W. W. Kang, Z. X. Li, and F. Y. Wang, "Traffic flow prediction with big data: a deep learning approach, " IEEE Trans. Intell. Transp. Syst., vol. 16, no. 2, pp. 865-873, Apr. 2015.
[28] J. P. Zhang, F. Y. Wang, K. F. Wang, W. H. Lin, X. Xu, and C. Chen, "Data-driven intelligent transportation systems: a survey, " IEEE Trans. Intell. Transp. Syst., vol. 12, no. 4, pp. 1624-1639, Dec. 2011.
[29] K. F. Wang, C. Gou, N. N. Zheng, J. M. Rehg, and F. Y. Wang, "Parallel vision for perception and understanding of complex scenes: methods, framework, and perspectives, " Artif. Intell. Rev., vol. 48, no. 3, pp. 299-329, Oct. 2017.
[30] W. Liu, Z. H. Li, L. Li, and F. Y. Wang, "Parking like a human: A direct trajectory planning solution, " IEEE Trans. Intell. Transp. Syst., vol. 18, no. 12, pp. 3388-3397, Dec. 2017.