2. State Key Lab. of Management and Control for Complex System, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
3. Cloud Computing Center, Chinese Academy of Sciences, Dongguan 523808, China;
4. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
5. Department of Automation, Tsinghua University, Beijing 100190, China;
6. Driver Cognition and Automated Driving Laboratory, University of Waterloo, Waterloo N2L 3G1, Canada
Machine learning especially deep reinforcement learning (DRL) experiences an ultrafast development in recent years [1], [2]. No matter in traditional visual detection [3], dexterous manipulation in robotics [4], energy efficiency improvement [5], object localization [6], novel Atari game [7], [8], Leduc poker [9], Doom game [10] and textbased games [11], these datadriven learning approaches show great potential in improving performance and accuracy. However, there are still several issues to impede researchers applying DRL to handle the real complex system problems.
One of the issues is lack of generalization capability to new goals [3]. DRL agents need to collect new data and learn new model parameters for a new target. It is computationally expensive to retrain the learning model. Hence, we need to utilize the limited data well to accommodate the environments via learning.
Another issue is data inefficiency [8]. Acquiring largescale action and interaction data of real complex systems is arduous. To explore control policy by themselves is very difficult for the learning systems. Thus, it is necessary to create a large number of observations for action and knowledge from the historical available data.
Finally, the issue is data dependency and distribution. In practical systems, data samples dependency is often uncertain and probability distribution is usually variant. So, it is hard for DRL agents to consider the state, action and knowledge of a learning system in an integrated way.
In order to address these difficulties, we develop a new parallel reinforcement learning framework for complex system control in this paper. We construct an artificial system analogy to the real system via modelling to constitute a parallel system. Based on the Markov chain (MC) theory, transfer learning, predictive learning, deep learning and reinforcement learning are exhibited to tackle data and action processes and to express knowledge. Furthermore, several application cases of parallel reinforcement learning are introduced to illustrate its usability. It is noticed that the proposed technique in this paper can be regarded as the specification of the parallel learning in [12].
FeiYue Wang first initialized the parallel system theory in 2004 [13], [14]. In [13] and [14], ACP method was proposed to deal the complex system problem. ACP approach represents artificial societies (A) for modelling, computational experiments (C) for analysis, and parallel execution (P) for control. An artificial system is usually built by modelling, to explore the data and knowledge as the real system does. Through executing independently and complementally in these two systems, the learning model can be more efficient and less datahungry. ACP approach has been applied in several fields to discuss different problems in complex systems [15][17].
Transfer learning focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Taking driving cycles of vehicle as an example, we introduce mean traction force (MTF) components to achieve equivalent transformation of them. By transferring limited data via MTF, the generalization capability problem can be relieved.
Predictive learning tries to use prior knowledge to build a model of environment by trying out different actions in various circumstances. Taking power demand for example, we introduce fuzzy encoding predictor to forecast the future power demand in different time steps. Based on the MC, historical available data can be used to solve the data inefficiency.
Deep learning is defined via learning data representations, including multiple layers of nonlinear processing units and the supervised or unsupervised learning of feature representations in each layer. Reinforcement learning is concerned with how agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The main contribution of this paper is combining parallel system with transfer learning, predictive learning, deep learning and reinforcement learning to formulate the parallel reinforcement learning framework to dispose the data dependency and distribution problems in realworld complex systems.
The rest of this paper is organized as follows. Section Ⅱ introduces the parallel reinforcement learning framework and relevant components, then several case studies for realworld complex system problems are described in Section Ⅲ. Finally, we conclude the paper in Section Ⅳ.
Ⅱ. FRAMEWORK AND COMPONENTS A. The Framework and the Parallel SystemThe purpose of parallel reinforcement learning is building a closed loop of data and knowledge in the parallel system to determine the next operation in each system, as shown in Fig. 1. The data represents the inputs and parameters in artificial and real systems. The knowledge means the records from state space to action space, which we name in the real system as experience and that in the artificial system as policy. The experience can be used to rectify the artificial model and updated policy is utilized to guide the real actor along with feedback from environment.
Download:


Fig. 1 Parallel reinforcement learning framework. 
Cyber physical systems have attracted increasingly more concerns for their potentials to fuse computational processes with the physical world in the past two decades. Furthermore, cyberphysicalsocial systems (CPSS) augment the cyber physical system capacity by integrating the human and social characteristics to achieve more effective design and operation [18]. The ACPdriven parallel system framework is depicted in Fig. 2. The integration of the real and artificial system as a whole is called parallel system.
Download:


Fig. 2 ACPdriven parallel system framework. 
In this framework, the physicallydefined real system interacts with the softwaredefined artificial system via three coupled modules within the CPSS. The three modules are control and management, experiment and evaluation, learning and training. The first module belongs to decision maker in these two systems, the second one represents the evaluator and the final one indicates the learning controller.
ACP = Artificial societies + Computational experiments + Parallel execution. Artificial system is often constructed by descriptive learning based on the observation on the real system due to the development in information and communication technologies. It can help learning controller store more computing results and make more flexible decisions. Thus, the artificial system is parallel to the real system and runs asynchronously to stabilize the learning process and extend the learning capability.
In the computational experiment stage, the specifications of transfer learning, predictive learning and deep learning are illustrated by the MC theory, as we will discuss them later. For the parallel system, combining these learning processes with reinforcement learning, the parallel reinforcement learning is formulated to derive the experience and policy and to clarify the interaction of them. For a general parallel intelligent system, such knowledge can be applied in different tasks because the learning controller can handle several tasks via rational reasoning [19].
Finally, parallel execution between the artificial and real systems is expected to enable an optimal operation of these systems [20]. Although the artificial system is drawn by the prior data of real system, it will be rectified and improved by the further observation. The consecutive updated knowledge in the artificial system is also used to instruct the real system operation in an efficient way. Owing to the communication of data and knowledge by parallel execution, these two systems can improve by selfguidance.
B. Transfer LearningIn this paper, we choose driving cycles as an example to introduce transfer learning, which can be easily popularized for other data in the MC domain. A general driving cycle transformation methodology based on the mean tractive force (MTF) components is introduced in this section. This transformation can convert the existent driving cycles database into an equivalent one with a real MTF value to relieve the data lacking problem.
MTF is defined as the tractive energy divided by the distance traveled for a whole driving cycle, which is integrated over the entire time interval [0,
$ \begin{equation} \label{eq1} \bar {F}=\frac{1}{x_T }\int_0^T {F(t)} v(t)dt \end{equation} $  (1) 
where
$ \begin{equation} \label{eq2} \left\{ {\begin{array}{l} F=F_a +F_r +F_m \\ F_a =\frac{1}{2}\rho _a C_d Av^2, F_r =M_v g\cdot f, F_m =M_v a \\ \end{array}} \right. \end{equation} $  (2) 
where
The vehicle operating modes are divided as traction, coasting, braking and idling according to the force imposed on the vehicle powertrain [21]. Hence, the time interval is depicted as
$ \begin{equation} \label{eq3}\small \left\{\! \!{\begin{array}{l} T=T_{tr} \cup T_{co} \cup T_{br} \cup T_{id} \\ T_{tr} =\left\{ {\left. t \rightF(t)>0, v(t)\ne 0} \right\}, T_{co} =\left\{ {\left. t \rightF(t)=0, v(t)\ne 0} \right\} \\ T_{br} =\left\{ {\left. t \rightF(t)<0, v(t)\ne 0} \right\}, T_{id} =\left\{ {\left. t \rightv(t)=0} \right\} \\ \end{array}} \right.\end{equation} $  (3) 
where
From (3), it is obvious that the powertrain only provides positive power to wheels in the traction regions. MTF in (1) is specialized as follows:
$ \begin{equation} \label{eq4} \bar {F}=\frac{1}{x_L }\int\limits_{t\in T_{tr} } {F(t)} v(t)dt=\bar {F}_a +\bar {F}_r +\bar {F}_m. \end{equation} $  (4) 
Then, MTF components (
$ \begin{equation} \label{eq5} \left\{ {\begin{array}{l} \alpha =\dfrac{\bar {F}_a }{\dfrac{1}{2}\rho _a C_d A}=\dfrac{1}{x_L }\int\limits_{t\in T_{tr} } {v^3(t)} dt \\[2mm] \beta =\dfrac{\bar {F}_f }{mg\cdot f}=\dfrac{1}{x_L } \int\limits_{t\in T_{tr} } {v(t)} dt \\[2mm] \gamma =\dfrac{\bar {F}_m }{m}=\dfrac{1}{x_L } \int\limits_{t\in T_{tr} } {a\cdot v(t)} dt. \\ \end{array}} \right. \end{equation} $  (5) 
Note that MTF components are related to the speed and acceleration for a specific driving cycle. These measures are employed as the constraints for driving cycle transformation.
Definition decides MTF is unique for a specific driving cycle, thus inequality and equality constraints are employed to determine the transferred driving cycle. A cost function can be defined by the designer to choose an optimal equivalent one from a set of feasible solutions. This transformation is formulated as a nonlinear program (NLP) as
$ \begin{equation} \label{eq6} \begin{array}{l} ~~~~~~~~~~~~~~~~~\mathop {\min }\limits_{\tilde {v}} f(\tilde {v}) \\ {\rm s.t.}~~~~~g_i (\tilde {v}, T_{tr} , {\alpha }', {\beta }'or{\gamma }'))=0, i=1, 2, 3 \\ ~~~~~~~~~~~~~h_1 (\tilde {v}, T_{tr} , v_{\rm coast} )<0 \\ ~~~~~~~~~~~~~h_2 (\tilde {v}, T_{co} \cup T_{br} , v_{\rm coast} )\ge 0 \\ \end{array} \end{equation} $  (6) 
where
Download:


Fig. 3 Transfer learning for driving cycles transformation. 
The purpose of transfer learning is converting historical available data into equivalent one to expand the database. The transferred data is strongly associated with the real environments. Thus, it can be used for generating adaptive control and operations in complex systems, so as to solve the generalization capability and data hungry problems.
C. Predictive LearningTaking power demand of vehicle for example, we introduce predictive learning to forecast the future power demand based on the observed data and processes in parallel system. A better understanding of the real system can then be described and applied to update the artificial system from these new experiences. A power demand prediction technology based on fuzzy encoding predictor (FEP) is illustrated in this section. This approach can also be used to draw more future knowledge from experiences for other parameters in the complex systems.
Power demand is modelled as a finitestate MC [23] and depicted as
$ \begin{equation} \label{eq7} \left\{ {\begin{array}{l} \pi _{ij} =P(\left. {p^+=p_j } \rightp=p_i )=\dfrac{N_{ij} }{N_i }{\kern 1pt} \\ N_i =\sum\limits_{j=1}^M {N_{ij} } \\ \end{array}} \right. \end{equation} $  (7) 
where
All elements
$ \begin{equation} \label{eq8} \mu _j :X\to [0, {\kern 1pt}{\kern 1pt}1]{\kern 1pt}{\kern 1pt}{\kern 1pt}{\rm s.t.}{\kern 1pt}{\kern 1pt}\forall p\in X, {\kern 1pt}{\kern 1pt}\exists j, 1\le j\le M, \mu _j (p)>0 \end{equation} $  (8) 
where
Two transformations are involved in FEP. The first transformation allocates an
$ \begin{equation} \label{eq9} \tilde {O}^T(p)=\mu ^T(p)\, = [\mu _1 , \mu _2 , \ldots , \mu _M (p)]. \end{equation} $  (9) 
This transformation is named fuzzification and maps power demand in the space
The second transformation is called proportional possibilitytoprobability transformation, in which the possibility vector
$ \begin{equation} \label{eq10} O(p)=\frac{\tilde {O}(p)}{\sum\limits_{j=1}^M {\tilde {O}_j (p)}} \end{equation} $  (10) 
where this transformation maps
$ \begin{equation} \label{eq11} w^+(p)=(O^+(p))^T\mu (p)=(O(p))^T\Pi \mu (p). \end{equation} $  (11) 
The expected value over the possibility vector leads to the next onestep ahead power demand in FEP:
$ \begin{equation} \label{eq12} \left\{ {\begin{array}{l} p^+=\int\limits_X {w^+(y)yd} y/\int\limits_X {w^+(y)d} y \\ \int\limits_X {w^+(y)yd} y = \sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } \int\limits_X {y\mu _j (y)d} y \\ \int\limits_X {w^+(y)d} y=\sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } \int\limits_X {\mu _j (y)d} y. \\ \end{array}} \right. \end{equation} $  (12) 
The centroid and volume of the membership function
$ \begin{equation} \label{eq13} \left\{ {\begin{array}{l} \bar {c}_i =\int\limits_X {y\mu _j (y)d} y \\ V_j =\int\limits_X {\mu _j (y)d} y. \\ \end{array}} \right. \end{equation} $  (13) 
Thus, (12) is reformulated as
$ \begin{equation} \label{eq14} \left\{ {p^+=\frac{\sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } V_j \bar {c}_j }{\sum\limits_{i=1}^M {O_i (p)\sum\limits_{j=1}^M {\pi _{ij} } } V_j }} \right.. \end{equation} $  (14) 
where expression (14) is the predicted onestep ahead power demand using FEP. Fig. 4 shows an example of predictive learning used for power demand prediction. By doing this, the future power demand of vehicle in different time steps can be determined, and then these data will be used for improving the management and operations in the parallel system by selfguidance.
Download:


Fig. 4 Predictive learning for future power demand prediction. 
The goal of predictive learning is generating reasonable data from the prior existed data and realtime observations in the real world. We aim to minimize the differences between real samples and generated samples by tuning the parameters in the predictive learning methodology. Therefore, these generated data are responsible for deriving various experiences and guiding the complex system by learning process, so as to settle the data inefficiency and distribution problem.
D. Reinforcement LearningIn the reinforcement learning framework, a learning agent interacts with a stochastic environment. We model the interaction as quintuple (S, A,
The action value function
$ \begin{equation} \label{eq15} Q(s, a)={\rm E}\left\{ {\sum\limits_{l=0}^\infty {\gamma ^lr_{t+1+l} \left {s_t =s, a_t =a} \right.} } \right\}. \end{equation} $  (15) 
The action value function associated with an optimal policy can be found by the Qlearning algorithm as in [25]
$ \begin{equation} \label{eq16} Q(s_t , a_t )\leftarrow Q(s_t , a_t )+\eta (r+\gamma \mathop {\max }\limits_a Q(s_{t+1} , a)Q(s_t , a_t )). \end{equation} $  (16) 
When the state and action space is large, for example the action
A deep neural network is composed of an input layer, one or more hidden layers and an output layer. As shown in Fig. 5(a), the input vector
Download:


Fig. 5 Deep neural network and bidirectional long shortterm memory. 
$ \begin{equation} \label{eq17} n=\sum\limits_{i=1}^R {w_i g_i } +b. \end{equation} $  (17) 
Then, the net input
$ \begin{equation} \label{eq18} d=h(n) \end{equation} $  (18) 
where activation function usually includes activation function in the hidden layer
In this paper, we propose a bidirectional long shortterm memory [26] based deep reinforcement network (BiLSTM DRN) to approximate the action value function in reinforcement learning, see Fig. 5 (b) for an illustration. This structure consists of a pair of deep neural networks, one for state variable
$ \begin{equation} \label{eq19} Q(s_t , a_t )=\sum\limits_{i=1}^K Q (s_t , c_t^i ) \end{equation} $  (19) 
where
Combining the ideas of parallel system, transfer learning, predictive learning and reinforcement learning, we can formulate a closed loop of data and knowledge, named parallel reinforcement learning, as described in Fig. 1. Several case studies for realworld complex system problems are introduced and discussed in the next section.
Ⅲ. CASE STUDIES OF PARALLEL REINFORCEMENT LEARNING A. Existing Case Studies in Parallel Reinforcement Learning FrameworkParallel reinforcement learning serves as a reasonable and suitable framework to analyse the real world complex system. It consists of a selfboosting process in the parallel system, a selfadaptive process by transfer learning, a selfguided process by predictive learning and big data screening and generating process by BiLSTMDRN. Learning process becomes more efficient and continuous in the parallel reinforcement learning framework.
Several complex systems have been researched and analysed in the perspective of parallel reinforcement learning, such as transportation systems [27], [28], and vision systems [29]. A traffic flow prediction system was designed in [27], which considered the spatial and temporal correlations inherently. First, an artificial system named stacked autoencoder model was built to learn generic traffic flow features. Second, the synthetic data were trained by a layerwise greedy method in the deep learning architecture. Finally, predictive learning was used to achieve traffic flow prediction and selfguidance for the parallel system. A survey on the development of the datadriven intelligent transportation system (DDITS) was introduced in [28]. The functionality of DDITS's key components and some deployment issues associated with its future research were addressed in [28].
Also, a parallel reinforcement learning framework has also been applied to address the problems in visual perception and understanding [29]. To draw an artificial vision system based on the observations from real scenes, the synthetic data can be used for feature analysis, object analysis and scene analysis. This novel research methodology, named parallel vision, was proposed for perception and understanding of complex scenes.
Furthermore, autonomous learning system for vehicle energy efficiency improvement in [30] can also be put into parallel reinforcement learning framework. First, a plugin hybrid electric vehicle was imitated to construct the parallel system. Then, historical driving record for the real vehicle was collected to learn autonomously the optimal fuel use via a deep neural network and reinforcement learning. Finally, this trained policy can guide the real vehicle operations and improve control performance. A better understanding of the real vehicle can then be obtained and used to adjust the artificial system from these new experiences.
B. New Applications Using Parallel Reinforcement Learning MethodsRecently, we designed a driving cycles transformation based adaptive energy management system for a hybrid electric vehicle (HEV). There exist two major difficulties in the energy management problem of HEV. First, most of energy management strategies or predefined rules cannot adapt to changing driving conditions. Second, modelbased approaches used in energy management require accurate vehicle models, which bring a considerable model parameter calibration cost. Hence, we apply parallel reinforcement learning framework into the energy management problem of HEV, as depicted in Fig. 6. More precisely, the core idea of this methodology is bilevel.
Download:


Fig. 6 Parallel reinforcement learning for energy management of HEV. 
The uplevel characterizes how to transform driving cycles using transfer learning by considering the induced matrix norm (IMN). Specially, TPM of power demand are computed and IMN is employed as a critical criterion to identify the differences of TPMs and to determine the alteration of control strategy. The lowerlevel determines how to set the corresponding control strategies with the transferred driving cycle by using modelfree reinforcement learning algorithm. In other words, we simulate the HEV as an artificial system to sample the possible energy management solutions, use transfer learning to make the computed strategies adaptive to real world driving conditions, and use reinforcement learning to generate the corresponding controls. Tests demonstrate that the proposed strategy exceeds the conventional reinforcement learning approach in both calculation speed and control performance.
Furthermore, we construct an energy efficiency improvement system in parallel reinforcement learning framework for a hybrid tracked vehicle (HTV). Specifically, we combine the simulated artificial vehicle with real vehicle to constitute the parallel system, use predictive learning to realize power demand prediction for further selfguidance and use reinforcement learning for control policy calculation. This approach also includes two layers, see Fig. 7 for a visualization of such idea. The first layer discusses how to accurately forecast the future power demand using FEP based on the MC theory. KullbackLeibler (KL) divergence rate is employed to decide the differences of TPMs and updating of control strategy. The second layer computes the relevant control policy based on the predicted power demand and reinforcement learning technique. Finally, comparison shows that the proposed control policy is superior to the primary reinforcement learning approach in energy efficiency improvement and computational speed.
Download:


Fig. 7 Parallel reinforcement learning for energy efficiency of HTV. 
In the future, we plan to apply BiLSTM DRN to process and train the large real vehicle data for optimal energy management strategy computation. The objective is to realize realtime control using the parallel reinforcement learning method in our selfmade tracked vehicle. More importantly, we will apply parallel reinforcement learning framework into multimissions of automated vehicles [30], such as decision making, trajectory planning and so on. To address the existing disadvantages of traditional datadriven methods, we expect that parallel reinforcement learning can promote the development of machine learning.
Ⅳ. CONCLUSIONThe general framework and case studies of parallel reinforcement learning for complex systems are introduced in this paper. The purpose is to build a closed loop of data and knowledge in the parallel system to guide the real system operation or improve the artificial system precision. Particularly, ACP approach is used to construct the parallel system that contains an artificial system and a real system. Transfer learning is utilized to achieve driving cycle transformation by mean of tractive force components. Predictive learning is applied to forecast the future power demand via fuzzy encoding predictor. To train data in the large action and state space, we introduce BiLSTMDRN to approximate the action value function in reinforcement learning.
Datadriven models are usually viewed as a component irrelevant to the data in learning process, which results in the largescale exploration and observationinsufficiency problems. Furthermore, data in these models tend to be inadequate, and the general principle to organize these models remains absent. By combining parallel system, transfer learning, predictive learning, deep learning and reinforcement learning, we believe that parallel reinforcement learning can effectively address these problems and promote the development of machine learning.
[1]  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Humanlevel control through deep reinforcement learning, " Nature, vol. 518, no. 7540, pp. 529533, Feb. 2015. http://europepmc.org/abstract/med/25719670 
[2]  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of Go with deep neural networks and tree search, " Nature, vol. 529, no. 7587, pp. 484489, Jan. 2016. http://www.ncbi.nlm.nih.gov/pubmed/26819042 
[3]  Y. K. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, F. F. Li, and A. Farhadi, "Targetdriven visual navigation in indoor scenes using deep reinforcement learning, " in Proc. 2017 IEEE Int. Conf. Robotics and Automation (ICRA), Singapore, pp. 33573364. http://arxiv.org/abs/1609.05143 
[4]  I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. BarthMaron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, "Dataefficient deep reinforcement learning for dexterous manipulation, " arXiv: 1704.03073, 2017. http://arxiv.org/abs/1704.03073 
[5]  X. W. Qi, Y. D. Luo, G. Y. Wu, K. Boriboonsomsin, and M. J. Barth, "Deep reinforcement learningbased vehicle energy efficiency autonomous learning system, " in Proc. Intelligent Vehicles Symp. (Ⅳ), Los Angeles, CA, USA, pp. 12281233, 2017. http://www.researchgate.net/publication/318800742_Deep_reinforcement_learningbased_vehicle_energy_efficiency_autonomous_learning_system 
[6]  J. C. Caicedo and S. Lazebnik, "Active object localization with deep reinforcement learning, " in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 24882496. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7410643 
[7]  X. X. Guo, S. Singh, R. Lewis, and H. Lee, "Deep learning for reward design to improve Monte Carlo tree search in Atari games, " arXiv: 1604.07095, 2016. http://arxiv.org/abs/1604.07095 
[8]  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with deep reinforcement learning, " arXiv: 1312.5602, 2013. 
[9]  J. Heinrich and D. Silver, "Deep reinforcement learning from selfplay in imperfectinformation games, " arXiv: 1603.01121, 2016. http://arxiv.org/abs/1603.01121 
[10]  D. Hafner, "Deep reinforcement learning from raw pixels in doom, " arXiv: 1610.02164, 2016. http://arxiv.org/abs/1610.02164 
[11]  K. Narasimhan, T. Kulkarni, and R. Barzilay, "Language understanding for textbased games using deep reinforcement learning, " arXiv: 1506.08941, 2015. http://arxiv.org/abs/1506.08941 
[12]  L. Li, Y. L. Lin, N. N. Zheng, and F. Y. Wang, "Parallel learning: a perspective and a framework, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 3, pp. 389395, Jul. 2017. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=7974888 
[13]  F. Y. Wang, "Artificial societies, computational experiments, and parallel systems: a discussion on computational theory of complex socialeconomic systems, " Complex Syst. Complex. Sci., vol. 1, no. 4, pp. 2535, Oct. 2004. http://en.cnki.com.cn/Article_en/CJFDTOTALFZXT200404001.htm 
[14]  F. Y. Wang, "Toward a paradigm shift in social computing: the ACP approach, " IEEE Intell. Syst., vol. 22, no. 5, pp. 6567, Sep. Oct. 2007. http://ieeexplore.ieee.org/document/4338496/ 
[15]  F. Y. Wang, "Parallel control and management for intelligent transportation systems: concepts, architectures, and applications, " IEEE Trans. Intell. Transp. Syst., vol. 11, no. 3, pp. 630638, Sep. 2010. http://ieeexplore.ieee.org/document/5549912/ 
[16]  F. Y. Wang and S. N. Tang, "Artificial societies for integrated and sustainable development of metropolitan systems, " IEEE Intell. Syst., vol. 19, no. 4, pp. 8287, Jul. Aug. 2004. http://ieeexplore.ieee.org/abstract/document/1333039/ 
[17]  F. Y. Wang, H. G. Zhang, and D. R. Liu, "Adaptive dynamic programming: an introduction, " IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 3947, May 2009. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=4840325 
[18]  F. Y. Wang, "The emergence of intelligent enterprises: From CPS to CPSS, " IEEE Intell. Syst., vol. 25, no. 4, pp. 8588, Jul. Aug. 2010. http://ieeexplore.ieee.org/document/5552591/ 
[19]  F. Y. Wang, N. N. Zheng, D. P. Cao, C. M. Martinez, L. Li, and T. Liu, "Parallel driving in CPSS: a unified approach for transport automation and vehicle intelligence, " IEEE/CAA J. of Autom. Sinica, vol. 4, no. 4, pp. 577587, Oct. 2017. http://ieeexplore.ieee.org/document/8039015/ 
[20]  K. F. Wang, C. Gou, and F. Y. Wang, "Parallel vision: an ACPbased approach to intelligent vision computing, " Acta Automat. Sin., vol. 42, no. 10, pp. 14901500, Oct. 2016. http://www.aas.net.cn/EN/Y2016/V42/I10/1490 
[21]  P. Nyberg, E. Frisk, and L. Nielsen, "Driving cycle equivalence and transformation, " IEEE Trans. Veh. Technol., vol. 66, no. 3, pp. 19631974, Mar. 2017. http://ieeexplore.ieee.org/document/7493605/ 
[22]  P. Nyberg, E. Frisk, and L. Nielsen, "Driving cycle adaption and design based on mean tractive force, " in Proc. 7th IFAC Symp. Advanced Automatic Control, Tokyo, Japan, vol. 7, no. 1, pp. 689694, 2013. http://www.researchgate.net/publication/271479464_Driving_Cycle_Adaption_and_Design_Based_on_Mean_Tractive_Force?ev=auth_pub 
[23]  D. P. Filev and I. Kolmanovsky, "Generalized markov models for realtime modeling of continuous systems, " IEEE Trans. Fuzzy Syst., vol. 22, no. 4, pp. 983998, Aug. 2014. http://ieeexplore.ieee.org/document/6588289/ 
[24]  D. P. Filev and I. Kolmanovsky, "Markov chain modeling approaches for on board applications, " in Proc. 2010 American Control Conf., Baltimore, MD, USA, pp. 41394145. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=5530610 
[25]  T. Liu, X. S. Hu, S. E. Li, and D. P. Cao, "Reinforcement learning optimized lookahead energy management of a parallel hybrid electric vehicle, " IEEE/ASME Trans. Mechatron., vol. 22, no. 4, pp. 14971507, Aug. 2017. http://ieeexplore.ieee.org/document/7932983/ 
[26]  A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM and other neural network architectures, " Neural Netw., vol. 18, no. 56, pp. 602610, Jul. Aug. 2005. http://www.ncbi.nlm.nih.gov/pubmed/16112549 
[27]  Y. S. Lv, Y. J. Duan, W. W. Kang, Z. X. Li, and F. Y. Wang, "Traffic flow prediction with big data: a deep learning approach, " IEEE Trans. Intell. Transp. Syst., vol. 16, no. 2, pp. 865873, Apr. 2015. http://ieeexplore.ieee.org/document/6894591/ 
[28]  J. P. Zhang, F. Y. Wang, K. F. Wang, W. H. Lin, X. Xu, and C. Chen, "Datadriven intelligent transportation systems: a survey, " IEEE Trans. Intell. Transp. Syst., vol. 12, no. 4, pp. 16241639, Dec. 2011. http://ieeexplore.ieee.org/document/5959985/ 
[29]  K. F. Wang, C. Gou, N. N. Zheng, J. M. Rehg, and F. Y. Wang, "Parallel vision for perception and understanding of complex scenes: methods, framework, and perspectives, " Artif. Intell. Rev., vol. 48, no. 3, pp. 299329, Oct. 2017. https://link.springer.com/article/10.1007%2Fs104620179569z 
[30]  W. Liu, Z. H. Li, L. Li, and F. Y. Wang, "Parking like a human: A direct trajectory planning solution, " IEEE Trans. Intell. Transp. Syst., vol. 18, no. 12, pp. 33883397, Dec. 2017. http://ieeexplore.ieee.org/document/7902173/ 