^{2} School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China;
^{3} School of Computing & Engineering, University of Huddersfield, Queensgate, Huddersfield, HD1 3DH, UK;
^{4} Department of Computing, Sheffield Hallam University, Sheffield, S1 2NT, UK
In the last decade, the powerful parallel computing capabilities of graphics cards and GPUs, originally driven by the market demands for realtime and highdefinition game displays, have been widely accepted by the research communities. Large scale and data intensive computational applications such as areal surface characterization filtration, visual recognition, and natural language processing (NLP), have been benefitted by this newlyfound and costeffective computational powerhouse. It has also attracted increasing attentions from researchers across the globe in devising general hardwarebased acceleration models for real world engineering challenges^{[13]}. Leading the trend, in 2007, NVIDIA released the Compute Unified Device Architecture (CUDA) a software framework that aimed at unifying the efforts in harnessing the GPU powers for generalpurpose usages and special applications. It has greatly simplified the GPU programming practices as well as embracing the inherent data parallelism from GPU architectures. The toolkit has significantly enhanced the performance of some of the most common data and signal processing functions such as fast fourier transform (FFT), Gaussian filtering, and discrete wavelet transform (DWT) that are widely used in applications such as face detection, DNA sequencing, and more recently, machine learning systems such as convolutional neural networks^{[46]}.
CUDA provides a scalable and integrated programming model for allocating and organizing processing threads and mapping them onto the computer hardware infrastructure equipped with dynamical adaptation ability for all mainstream GPU architectures^{[7]}. In addition, CUDA has linked and embedded a series of interfaces and APIs to assist direct programming on GPUs instead of relying on various graphics APIs (e.g., OpenGL) like in the socalled "GPGPU" era. CUDA treats GPU as a standalone parallel computational device that can realize data processing algorithms by using C/C++like programming routines and functions that are familiar to mainstream programmers and researchers.
Previous related works on parallelizing processes and data were mainly achieved through using a single GPU that had witnessed a moderate performance gain across the board. However, due to the limitation of data storage format and space (memory), as well as the fixed number of data streams available on a single GPU, previous works are often struggling to fulfill the realtime requirements from many large scale computational applications, this is especially problematic for the latest deep learning applications that often require to process data sets with tens of gigabytes (GB) in size (e.g., ImageNet)^{[810]}, never mentioned the online areal surface texture measurement tasks for processing surface texture data engaging complex filtration with a large number of numerical tolerance parameters (e.g., processing a data set with multilevel DWT)^{[11, 12]}. Thus, in comparison, multiGPUs based acceleration solutions can be flexible to achieve higher performance with relatively low hardware costs. Numerous computationalintensive issues that cannot be resolved by using the single GPU model have been making steady progress in the context of multiGPUs, e.g., multiGPUs based FFT^{[13]} and Gaussian filtering^{[14]}. In the meantime, several multiGPUs based programming libraries (e.g., MGPU)^{[15]} and MapReduce libraries (e.g., GPMR and HadoopCL)^{[16, 17]} have been developed by researchers in the field.
It is a challenging task to fully utilize the parallel computational power of multiple and interconnected GPU nodes^{[18]}, which is especially true for the heterogeneous multiGPU systems. Unbalanced load problem may cause low computational performance. To solve this problem, the load balancing models that can intelligently allocate tasks to individual GPU node are becoming the key solutions. Chen et al.^{[19]} proposed a taskbased DLB solution for multiGPU systems that can achieve a nearlinear speedup with the increasing number of GPU nodes. Acosta et al.^{[20]} had developed a DLB functional library that aims to balancing the load on each node. However, these pilot studies are based according to the corresponding system runtime performance on the assumption that all GPU nodes equipped in a multiGPU platform have equal computational capacity. In addition, taskbased load balancing schedulers that these approaches have relied upon often fall short to support applications with huge data throughputs but limited processing function(s) there are very few "tasks" to schedule, e.g., DWT. These applications need more attention in refining the task partition in each computational iteration taking into account of the data locality^{[18]}. In terms of data parallelism based load balancing schedulers, Acosta et al.^{[21]} presented a DLB model that dynamically balances the workload using information established by the first iteration of the computation, which failed to respond to the information changes during the later computational iterations. In contrast, the strategies developed by Boyer et al. and Kaleem et al. collect system information during the system runtime^{[18, 22, 23]}, so they can support the dynamic load balancing scheduling demands according to the realtime feedback, which consolidate the foundation for this study.
To optimize the load balancing problem among multiGPU nodes for large scale applications with highly repetitive computational procedures or iterations, this paper presents a novel DLB model based on fuzzy neural network (FNN) and data set division techniques for heterogeneous multiGPU systems, and this study is extended from our previous publication^{[24]}. In this study, five realtime state feedback parameters closely relating to the computational performance of every GPU node are defined. They are capable of predicting the relative computational performance of each GPU node during system runtime. Using the constructed FNN and the devised advanced data distribution method, a large data set can be adaptively divided to enhance the overall utilization of all hidden computing powers from a heterogeneous multiGPU system.
The rest of this paper is organized as follows. Section 2 presents a brief review over the preliminaries and related works in the field. Based on the literatures, the rationales of this research are justified; The proposed FNN DLB model for multiGPUs is explored and its features are discussed in Section 3. Section 4 constructs a case study that demonstrates how to improve the computational performance of the lifting scheme of DWT by using the devised model. Section 5 provides the test results of the design and evaluations. Finally, Section 6 concludes the research with future works.
2 Related studies 2.1 GPU architectures and process modelModern GPUs are not only powerful graphics engines, but also highly parallel arithmetic and programmable processors. More significantly, in 2007, NVIDIA introduced the Tesla architecture, which was the first unified graphics and computing architecture. After that, NVIDIA released series of GPU architectures, i.e., Fermi, Kepler, Maxwell, Pascal, and most recently, the Volta architecture. All GPU cards produced by NVIDIA in the last decade are based on these architectures. In the point of view of the hardware architectures, all these models are similar but with incremental improvements on memory sizes and their accessibility, the overall processing powers, and number of streaming multiprocessors (SM) that each contains multiple stream processors (SP, also named CUDA cores) and specialfunction units (SFU). Modern GPU architectures are based on a scalable processor array formed by SPs that provides a high performance parallel computing platform.
CUDA is a parallel programming framework that was designed especially for general purpose computing, and it greatly simplifies the GPU programming practices. CUDA adopts a SPMD (single program, multiple data) programming model and provides a sophisticated memory hierarchy (i.e., register, local memory, shared memory, global memory, texture memory, constant memory, etc.). Hence, a GPU can achieve high data parallel computation through elaborately designed CUDA codes empowered by the efficient usage of different memories according to the respective data features, including access mode, size and format.
The computational capacity of a single GPU can sometimes satisfy the computational demands of numerous applications, for example in the conventional image filtering and other transformation processes. However, it is still falling short of processing some complex tasks engaging massive data sets, for example in video indexing and visual recognition, due to its limited memory space, instruction length, and execution loops. One perceived solution is to deal with large volume data sets in distributed processing mode on multiGPUs. At present, there are two representative categories of multiGPU platforms, the standalone computer type (a single CPU node with multiple GPU processors), and the cluster type (multiple CPU nodes and each accompanied by one or more GPU processors). In general, the cluster computer systems require more complex communication and data transmission due to their commonly adopted peripheral component interconnect express (PCIE) architecture and network connections. Thus, the standalone computers have been chosen in this research.
2.2 Fuzzy neural networkArtificial neural network (ANN) is a branch of artificial intelligence (AI) that was first inspired by the "understanding" of how human brain works to process data and summarizes patterns. In contrast with traditional methods that have to extract features from input data in a rigid and almost mechanical manner, ANN based models can automatically find features from training data, which are called "learning from data". One of successful applications relating to ANN is deep learning (DL) based on a process model called deep neural network, e.g., Krizhevsky presented the AlexNet to classify images in ILSVRC2012 (the ImageNet Largescale Visual Recognition Challenge) and achieved the winning performance with the test error rate of 15.3%^{[8]}. AlexNet is considered as the first successful DL model. Later, in 2015, He et al.^{[25]} presented a new DL model, ResNet, that won the ILSVRC 2015 with an incredible error rate of 3.6%.
Generally speaking, traditional fuzzy systems are built on IFTHEN rules (i.e., fuzzy rules) which are acquired from experimental knowledge of domain experts. Fuzzy systems can solve complex decisionmaking issues when equipped with abundant fuzzy rules^{[26]}. Li et al.^{[27]} designed a fuzzy keyword search engine based on a fuzzy system for searching encrypted data over cloud sources, and it solved the drawback of traditional techniques that struggled to match keywords on cloud. Krinidis et al.^{[28]} had improved the fuzzy CMeans (FCM) algorithm and presented fuzzy local information CMeans (FLICM) algorithm based on fuzzy set theory for image clustering. Compared with FCM, FLICM is more effective and efficient, which provides robustness to noisy images clustering.
Both fuzzy theory and ANN have been widely used in decisionmaking applications. However, the main problem of traditional fuzzy systems is that it is very difficult to find experts who can extract and summarize knowledge from their experiences, and extracted IFTHEN rules are usually not objective, which means that traditional fuzzy models are lacking of flexibility and robustness. Furthermore, the ANN models are still inadequate in representing the expert experiences. To solve these shortcomings, fuzzy neural network was developed to combine the fuzzy rule based fuzzy systems and ANNs. Thus, ANN models have been merged into fuzzy systems to improve their efficiency and accuracy, such that FNN was envisaged to be a promising model^{[29]}. Kuo et al.^{[30]} proposed a FNN based decision support system of intelligent suppliers which is able to consider both the quantitative and qualitative factors. Chen et al.^{[31]} used FNN to approximate unknown functions in stochastic systems, which not only reduced the online computation load, but also achieved significant performance enhancement for fuzzy control algorithm.
Fuzzy theory and ANN based load balancing approaches have been widely used in traditional multiCPU systems, i.e., distribution systems, data centers, cloud computing applications, etc. Saffar et al.^{[32]} presented a fuzzy optimal reconfiguration approach that combines fuzzy variables and ant colony search method to balance the workloads on distribution systems. Susila et al.^{[33]} developed a fuzzy based firefly approach for dynamical load balancing purpose in cloud computing systems. Toosi and Buyya^{[34]} proposed a fuzzy logic based DLB model for cost and energy efficient purposes. These prior works have inspired the motivation of adopting FNN for multiGPU load balancing applications investigated in this study.
In summary, based on the achievements of previous related works, it is anticipated as a feasible way to solve the DLB issue by adopting the FNN model. This study explores and implements a novel dataoriented load balancing model by devising a FNN framework for large data sets with simple iterative tasks on heterogeneous multiGPU systems.
2.3 Conventional multiGPU strategiesFig. 1 demonstrates a traditional load balancing model based on the pure data set division method^{[2]}, where: 1) A large raw data set is divided into n small chunks (subsets) (n is equal to the number of GPU nodes in a targeted multiGPU system), and each data chunk is distributed to a GPU node respectively; 2) Each GPU node processes the corresponding subset; 3) The final results can be generated after merging the outputs of each GPU node. This approach is very simple and useful, however, it is likely to cause unbalanced load problem when the multiGPU system contains different types of GPU hardware with unequal computational performance, known as heterogeneous multiGPU platforms. As a result, the overall performance of a multiGPU platform is restricted to the GPU node that has the lowest computational capability due to delayed merging process.
Download:


Fig. 1. A traditional load balancing model based on the pure data division method 
In a heterogeneous multiGPU system, there are different types of GPUs having unequal computational performance, e.g., the multiGPU workstation used in this study has two GPU cards a middlelowend (NVIDIA GTX 750 Ti) and a highend GPU (NVIDIA GTX 1080). As the traditional data division method is still struggling to support heterogeneous multiGPU systems, Acosta et al.^{[21]} developed a DLB library (named ULL_Calibrate_lib) for heterogeneous systems aiming to solve the task allocation problem. ULL_Calibrate_lib can balance tasks dynamically to adapt system conditions during execution. This approach shows sound results for iterative operations, but has low performance when dealing with applications of large data throughputs with limited processing instructions the too few "tasks" problem for the scheduler. For example, in surface metrology, metrologists often apply DWT functions to extract the surface texture characteristics from large volume of measured data^{[35]}. Thus, in these cases, a dataoriented load balancing model is more suitable than the taskfocused ones.
Boyer et al.^{[18]} explored a dataoriented DLB approach that supports GPU programs having a few kernels to process large volume data set iteratively. The main idea of Boyer
$ \begin{equation} \label{eqn_1} {C_i} = \frac{D}{{{t_i}}}. \end{equation} $  (1) 
2) The host function divides the remaining data for each GPU node. Let
$ \begin{equation} \label{eqn_2} W{\rm{ = }}\sum\limits_i {{W_i}}. \end{equation} $  (2) 
In the balanced situation, all GPU nodes should finish their computations at the same time satisfying the following equation:
$ \begin{equation} \label{eqn_3} {C_1}{W_1}{\rm{ = }}{C_2}{W_2}. \end{equation} $  (3) 
According to (1) and (3),
$ \begin{equation} \label{eqn_4} {W_1}{\rm{ = }}\frac{{{t_1}{W_2}}}{{{t_2}}}. \end{equation} $  (4) 
One of the drawbacks of this load balancing model is that it is disputable whether the initial execution time can accurately predict the real computational ability of a GPU. More specifically, a modern GPU card can have hundreds or even thousands of CUDA cores, e.g., NVIDIA GTX 750 Ti contains 640 cores, and NVIDIA GTX 1080 has 2560 cores. As a result, a small data set may cause a low GPU utilization rate, which causes the inaccurate performance prediction. For instance, in this study, we tested and evaluated the execution time for processing a small surface measurement data set by using DWT on these two GPUs respectively, experimental result shows that the processing time of these two GPUs are almost the same because both of them cannot fully use their hardware resources as there are not enough data to process. In this case, the data allocated on each GPU node will be of the same size by using (4), which is not different from the pure data set division method (see Fig. 1). In addition, these previous load balancing models failed to respond to the fluctuation of computational performance that is frequently occurred on multiGPU systems in the real world.
The proposed DLB model in this paper aims at predicting the computational performance according to the real hardware conditions rather than testing the processing time with a small data set, such that it improves the accuracy of performance prediction and supports realtime response to the fluctuation of computational performance.
3 Load balancing on heterogeneous multiGPU systems 3.1 DLB idealismTo solve the unbalanced load problem and to respond to the fluctuation of computational performance from a heterogeneous multiGPU system, this paper presents a novel DLB model for optimizing the overall parallel computational performance of large scale data computations on multiGPU systems while ensuring the good priceperformance ratio based on the FNN and dataset division method. In this model, the original data set is divided into several equalsized data units and these data units are organized into n groups (n is equal to the number of GPU nodes in a specific multiGPU platform) by using the scheduler, see Fig. 2. The number of data units assigned to each GPU node is different, and it is determined by the realtime feedback (e.g., realtime computational performance and states of each GPU node) of a single GPU node. Thus, the purpose of dataoriented DLB model is to minimize the overall processing time by dynamically adjusting the number of data units in a group for each GPU node at runtime according to realtime state feedback of each GPU node.
Download:


Fig. 2. The overall framework of the proposed data based DLB model 
3.2 Model and workflow
To describe the relationship between the realtime state feedback parameters and the number of data units assigned in a group to be "pushed" to a node, this model defines the relative computational ability
$ \begin{equation} \label{eqn_5} CP_i^n = f(\frac{{{D_{unit}}}}{{T_i^{unit}}}), \quad CP_i^n \in [0, 1], n = 0, 1, 2, \cdots \end{equation} $  (5) 
where
In the ideal load balancing situation, all GPU nodes in a multiGPU system would finish their respective work at the same time, this idea is the same as Boyer
$ \begin{align} \label{eqn_6} {T_1}&{\rm{ = }}{T_2}{\rm{ = }}\cdots{\rm{ = }}{T_m} \nonumber \\ &\Rightarrow T_1^{unit} \times {W_1} = T_2^{unit} \times {W_2} = \cdots {\rm{ = }}T_m^{unit} \times {W_m} \end{align} $  (6) 
where
$ \begin{align} \label{eqn_7} T_1^{unit}& \times {W_1} = T_2^{unit} \times {W_2}\nonumber \\ &\Rightarrow {W_1} = \frac{{T_2^{unit} \times {W_2}}}{{T_1^{unit}}} \Rightarrow {W_1} = \frac{{{CP_2}^n \times {W_2}}}{{{CP_1}^n}}. \end{align} $  (7) 
The same calculation method can be extended to multiple GPU nodes by using (7). Based on (5)(7), the complete procedure for dynamically calculating the number of data units for every GPU node in any multiGPU platform during runtime can be defined as: 1) This DLB model conducts the initial prediction to get
$ \begin{equation} \label{eqn_8} T_i^r = T_i^{unit} \times \left( {{W_i}{\rm{  }}{W_i}'} \right) \end{equation} $  (8) 
Download:


Fig. 3. The structure of FNN in the proposed dataoriented DLB model 
where
According to (7), it is convenient to divide data units and organize data groups for each GPU node when
To predict
After defining the fuzzy subsets, this research has designed a network structure of FNN that combines theories of the fuzzy mathematics and the back propagation mechanism from ANN to predict
Input layer. The input layer collects realtime states of a GPU node and generates values of the five state feedback parameters (see Table 1) as inputs when a predication of
$ \begin{equation} \label{eqn_9} O_i^1 = I_i^1 = {x_i} \end{equation} $  (9) 
where
Fuzzy layer. The fuzzy layer transforms the correct values into fuzzy truth values by using a membership function. The input and output formulas are illustrated as
$ \begin{align} \label{eqn_10} I_i^2 = O_i^1\qquad\quad\qquad\qquad\;\nonumber \\ O_i^2 = {u_A}\left( {I_i^2} \right), O_i^2 \subset \left[{0, 1} \right] \end{align} $  (10) 
where
$ \begin{equation} \label{eqn_11} f(x) = \frac{1}{{1 + {\textrm{e}^{  a(x  c)}}}} \end{equation} $  (11) 
where
$ \begin{equation} \label{eqn_12} \left\{ \begin{array}{c} UFL:{u_{UFL}}\left( {UF} \right) = \dfrac{1}{{1 + {\textrm{e}^{15 \times (UF  0.5)}}}}\\ UFH:{u_{UFH}}(UF) = \dfrac{1}{{1 + {\textrm{e}^{  15 \times (UF  0.5)}}}}. \end{array} \right. \end{equation} $  (12) 
According to (12), for instance, when a GPU
Download:


Fig. 4. Membership Functions of UFL and UFH 
Hidden layer. In principle, the more the hidden layers, the more complex functions can be fitted. However, it also may cause the disadvantages of a mass of computation and overfitting. Generally speaking, a single hidden layer can meet majority requirements for prediction purposes^{[36]}. Thus, this load balancing model has only one hidden layer. The input and output formulas are defined as
$ \begin{align} \label{eqn_13} I_i^3 = \sum\limits_{j = 1}^n {{w_i}O_j^2  {\theta _i}} \nonumber \\ O_i^3 = \varphi (I_i^3)\qquad\quad\; \end{align} $  (13) 
where
$ \begin{equation} \label{eqn_14} \varphi (x) = \frac{1}{{1 + \textrm{e}^{  ax}}}. \end{equation} $  (14) 
Output layer. The output layer generates fuzzy truth values of the "high" and "low" fuzzy subsets of
$ \begin{align} \label{eqn_15} I_i^4 = \sum\limits_{j = 1}^m {{w_j}'O_j^3  \theta _i'}\nonumber \\ O_i^4 = \varphi \left( {I_i^4} \right)\qquad\quad\; \end{align} $  (15) 
where
Decode layer. The decode layer is added in this network to transform the fuzzy truth values of the
$ \begin{align} \label{eqn_16} {I^5}{\rm{ = }}\sum\limits_i^2 {{w_i}^{"}O_i^4}\quad\;\nonumber \\ {CP_i} = {O^5} = \frac{I^5} {\sum\limits_{i = 1}^2 {O_i^4} }. \end{align} $  (16) 
Based on the FNN structure illustrated in Fig. 3, the proposed load balancing model can be learned by training data using the back propagation algorithm that are collected from historical data of realtime state feedback (e.g., data processing time and a GPU states at some point). After the model is trained, it can be used to predict
This dataoriented DLB model supports a wide variety of large scale data computations. This research explores the LWT (lifting wavelet transform) computation for huge metrological data sets of surface textures as a case study to evaluate the validity and efficiency of the proposed model. Rooted in DWT, which is one of the fundamental algorithms for filtration widely used in surface metrology, signal and image processing, biomedicine visualization, and machine vision, LWT aims to improve the computational efficiency through a lifting scheme, also referred as the second generation wavelet^{[37]}.
The 1D forward LWT contains four operation steps: split, predict, update and scale^{[37]}.
Split. This step splits the original signal into two subsets of coefficients, i.e.,
$ \begin{equation} \label{eqn_17} \left\{ \begin{array}{c} even[i] = X[2i]\\ odd[i] = X[2i{\rm{ + }}1]. \end{array} \right. \end{equation} $  (17) 
Predict. The
$ \begin{equation} \label{eqn_18} odd = odd  P(even). \end{equation} $  (18) 
Update. Likewise,
$ \begin{equation} \label{eqn_19} even = even + U(odd). \end{equation} $  (19) 
Scale. Normalize
$ \begin{equation} \left\{ \begin{array}{c} evenApp = even \times ({1/ K})\\ oddDet = odd \times K \end{array} \right. \end{equation} $  (20) 
The inverse LWT with a lifting scheme is achieved by inverting the complete sequence of operation steps of forward LWT and switching the corresponding addition and subtraction operators. With the lifting scheme, the computational results of both forward and inverse LWT for arbitrary wavelet can be obtained through applying several steps of prediction and update operations and the final normalization with factor
Download:


Fig. 5. Main computational procedure of singlelevel 1D forward LWT 
In the case of a 2D DWT, it simply needs to perform the horizontal 1D LWT for each row of a 2D input data set and the vertical 1D LWT for each corresponding column in sequence separately because a 2D LWT can be realized through the 1D wavelet transform along its
Download:


Fig. 6. Main computational procedure of multilevel 2D forward LWT 
Lifting scheme supports variety types of wavelets, and in this case, the research has adopted the CDF 9/7 (CohenDaubechiesFeauveau 9/7) wavelet as an example. Table 2 illustrates equations for a single level forward LWT based on the CDF 9/7 wavelet, and its scheduling software routine on a GPU is illustrated in Algorithm 1. The basic idea is that every step of the lifting scheme is performed by different functions, and the CPU program schedules and launches these functions with respect to all data dependencies.
In the context of CUDA and multiGPU architectures, the overall workflow of LWT computation by using the devised DLB model conforms to Fig. 2, and the scheduler allocates initial data groups of a raw data set to each GPU node, and then GPU nodes process the corresponding data groups with the LWT functions listed in Algorithm 1.
Algorithm 1. The scheduling software routine on a GPU node
void
//
//
//
gpu_split(d_even, d_odd, d_raw);
gpu_lwt_predict(d_even, d_odd, [alpha, alpha]);
gpu_lwt_update(d_even, d_odd, [beta, beta]);
gpu_lwt_predict(d_even, d_odd, [gamma, gamma]);
gpu_lwt_update(d_even, d_odd, [deta, deta]);
gpu_scale(d_even, d_odd, phi);
//
//
cudaMemcpy(evenApp, d_even, size, deviceToHost);
cudaMemcpy(oddDet, d_odd, size, deviceToHost
}
5 Test and performance evaluation 5.1 Hardware and test environmentThis section analyses the tests and evaluation results of the developed dataoriented DLB model. Table 3 specifies the computer system constructed for the tests which contains two different types of GPU nodes a middlelow range GPU (NVIDIA GTX 750 Ti) and a highend GPU (NVIDIA GTX 1080). The proposed model and LWT are realized by using CUDA C/C++ and CUDA Toolkit 8.
5.2 FNN training
The FNN can be trained endtoend by the back propagation and the stochastic gradient descent (SGD) methods. Since there are limited open benchmarks or datasets for multiGPUs based load balancing models, this study has devised a customizable dataset containing 5state feedback parameters (see Table 1), the processing data size
$ \begin{equation} \label{eqn_21} CP = f(\frac{D}{T}). \end{equation} $  (21) 
We randomly initialized the weights for all layers (four layers) from a zeromean Gaussian distribution algorithm, and trained the FNN in two steps to complete the supervised pretraining and finetuning. The first step trained the FNN on 300 data items with the SGD on a learning rate of 0.01. The finetuning step then continued the SGD on the learning rate of 0.001 with 200 data items. With this twostep training strategy, the FNN based DLB model has achieved a reliable prediction performance. The comprehensive evaluation of the devised DLB model is further discussed in the following subsections.
5.3 Computation without the DLB modelThis section reports test results and evaluates the computational performance of multilevel 2D LWT without applying any DLB models on both single GPU and multiGPU platforms.
To begin with, this study tested and compared the processing time of a 2D LWT between a single GPU (using either GPU1 or GPU2 respectively) and two GPUs (using both GPU1 and GPU2) environment without employing any DLB models but the traditional division method to divide the data set for each GPU (see Section 2.3). This test performed 4 levels of forward 2D LWT with CDF (9, 7) wavelet on three data sizes
Download:


Fig. 7. The computational performance of three test settings 
5.4 Computation with the DLB model
Then, this study has tested and compared the computational performance of the 2D LWT operation between the unbalanced implementation (i.e., each GPU node processes one half of a large data set without consideration of its performance variations) and the dataoriented DLB implementation by using the FNN structure in the target multiGPU system. The processing time of each implementation with different data sizes have been listed in Fig. 8. The processing time of two single GPU settings (i.e., using GPU1 only and GPU2 respectively) where also recorded for comparison. It can be seen from Fig. 8 that the computational performance of the unbalanced implementation has no significant difference comparing with the two single GPU settings. In contrast, the FNN based DLB implementation has gained improvement on computational performances steadily, i.e., it processed a very large data set (e.g.,
Download:


Fig. 8. Comparison of processing times between unbalanced and balancing implementations 
5.5 Benchmarking
Lastly, this study carried out a benchmarking test. There are several data oriented DLB models on multiGPUs, and the Boyer
Download:


Fig. 9. Comparison of processing time between FNN based DLB model and the Boyer's model 
6 Conclusions and future work
To fully utilize the parallel computational power of modern GPUs, this paper presents a novel dataoriented DLB model for multiGPU systems based on an innovative FNN structure and the corresponding dataset division methods. The research started with a comprehensive investigation and analysis of the traditional load balancing models, and concluded with the main drawbacks of them, e.g., the rigidity when dealing with heterogeneous node specifications and configurations. To alleviate the load balancing issues and to effectively respond to the runtime fluctuation of cluster performance, this research has proposed a novel dataoriented DLB model for balancing and optimizing the overall parallel computational performance across multiGPU nodes. In this model, five state feedback parameters have been identified, and the FNN structure has been implemented to predict the relative computational performance in an adaptive manner. An improved scheduler can then be activated to automate the data allocation tasks according to the relative computational performance across all nodes in a cluster. Experiment results show that the proposed model can achieve substantial computational performance gain when compared with conventional techniques, and the FNN based dynamic model can address the runtime fluctuation issues effectively. The innovative model and its corresponding techniques have addressed the key challenges from large scale computational applications that are often featured by extremely large input volume and highly repetitive operational procedures. Further work will be focused on bridging the flexible FNN idealism across the GPU and CPU boundary, especially when facing the new computing device paradigm of cell CPUs, so as to progressing towards a truly hybrid and efficient taskdata distribution scheme for engineering applications.
[1]  D. B. Kirk, W. W. Hwu . Programming Massively Parallel Processors:A Handson Approach 3rd ed. New York, USA: Morgan Kaufmann, 2016 . 
[2]  R. Couturier . Designing Scientific Applications on GPUs. Boca Raton, USA: CRC Press, 2013 . 
[3]  S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, D. Glasco . GPUs and the future of parallel computing. IEEE Micro., vol.31 , no.5 , pp.7–17, 2011. doi:10.1109/MM.2011.89 
[4]  C. W. Lee, J. Ko, T. Y. Choe. Twoway partitioning of a recursive Gaussian filter in CUDA. EURASIP Journal on Image and Video Processing, vol. 2014, no. 1, Article number 33, 2014. DOI: 10.1186/16875281201433. 
[5]  J. A. Belloch, A. Gonzalez, F. J. Zaldívar Martínez, A. M. Vidal . Realtime massive convolution for audio applications on GPThe U. Journal of Supercomputing., vol.58 , no.3 , pp.449–457, 2011. doi:10.1007/s112270110610 
[6]  F. Nasse, C. Thurau, G. A. Fink. Face detection using GPUbased convolutional neural networks. In Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns, Münster, Germany, pp. 8390, 2009. DOI: 10.1007/978364203767210. 
[7]  NVIDIA. CUDA C Programming Guide v8. 0. [Online], Available: http://docs.nvidia.comcuda/cudacprogrammingguide/index.htm, 2017. 
[8]  A. Krizhevsky, I. Sutskever, G. E. Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM., vol.60 , no.6 , pp.84–90, 2017. doi:10.1145/3065386 
[9]  C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, 2015. DOI: 10.1109/CVPR.2015.7298594. 
[10]  E. Guerra, J. De Lara, A. Malizia, P. Díaz . Supporting useroriented analysis for multiview domainspecific visual languages. Information and Software Technology., vol.51 , no.4 , pp.769–784, 2009. doi:10.1016/j.infsof.2008.09.005 
[11]  X. J. Jiang, D. J. Whitehouse . Technological shifts in surface metrology. CIRP Annals., vol.61 , no.2 , pp.815–836, 2012. doi:10.1016/j.cirp.2012.05.009 
[12]  J. J. Wang, W. L. Lu, X. J. Liu, X. Q. Jiang. Highspeed parallel wavelet algorithm based on CUDA and its application in threedimensional surface texture analysis. In Proceedings of International Conference on Electric Information and Control Engineering, IEEE, Wuhan, China, pp. 22492252, 2011. DOI: 10.1109/ICEICE.2011.5778225. 
[13]  S. Chen, X. M. Li. A hybrid GPU/CPU FFT library for large FFT problems. In Proceedings of the 32nd International Performance Computing and Communications Conference, IEEE, San Diego, USA, 2013. DOI: 10.1109/PCCC.2013.6742796. 
[14]  C. L. Zhang, Y. P. Xu, J. He, J. Lu, L. Lu, Z. J. Xu. MultiGPUs Gaussian filtering for realtime big data processing. In Proceedings of the 10th International Conference on Software, Knowledge, Information Management & Applications, IEEE, Chengdu, China, 2016. DOI: 10.1109/SKIMA.2016.7916225. 
[15]  S. Schaetz, M. Uecker. A multiGPU programming library for realtime applications. In Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing, Fukuoka, Japan, pp. 231236, 2012. DOI: 10.1007/97836423307809. 
[16]  J. A. Stuart, J. D. Owens. MultiGPU MapReduce on GPU clusters. In Proceedings of 2011 IEEE International Parallel & Distributed Processing Symposium, IEEE, Anchorage, USA, pp. 10681079, 2011. DOI: 10.1109/IPDPS.2011.102. 
[17]  M. Grossman, M. Breternitz, V. Sarkar. HadoopCL: MapReduce on distributed heterogeneous platforms through seamless integration of Hadoop and OpenCL. In Proceedings of the 27th Parallel and Distributed Processing Symposium Workshops & PhD Forum, IEEE, Cambridge, MA, USA, pp. 19181927, 2013. DOI: 10.1109/IPDPSW.2013.246. 
[18]  M. Boyer, K. Skadron, S. Che, N. Jayasena. Load balancing in a changing world: Dealing with heterogeneity and performance variability. In Proceedings of ACM International Conference on Computing Frontiers, Ischia, Italy, 2013. DOI: 10.1145/2482767.2482794. 
[19]  L. Chen, O. Villa, S. Krishnamoorthy, G. R. Gao. Dynamic load balancing on singleand multiGPU systems. In Proceedings of IEEE International Symposium on Parallel & Distributed Processing, IEEE, Atlanta, USA, 2010. DOI: 10.1109/IPDPS.2010.5470413. 
[20]  A. Acosta, R. Corujo, V. Blanco, F. Almeida. Dynamic load balancing on heterogeneous multicore/multiGPU systems. In Proceedings of International Conference on High Performance Computing and Simulation, IEEE, Caen, France, pp. 467476, 2010. DOI: 10.1109/HPCS.2010.5547097. 
[21]  A. Acosta, V. Blanco, F. Almeida. Towards the dynamic load balancing on heterogeneous multiGPU systems. In Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, IEEE, Leganes, Spain, pp. 646653, 2012. DOI: 10.1109/ISPA.2012.96. 
[22]  B. Pérez, E. Stafford, J. L. Bosque, R. Beivide . Energy efficiency of load balancing for dataparallel applications in heterogeneous systems. The Journal of Supercomputing., vol.73 , no.1 , pp.330–342, 2017. doi:10.1007/s112270161864y 
[23]  R. Kaleem, R. Barik, T. Shpeisman, C. L. Hu, B. T. Lewis, K. Pingali. Adaptive heterogeneous scheduling for integrated GPUs. In Proceedings of the 23rd International Conference on Parallel Architecture and Compilation Techniques, IEEE, Edmonton, Canada, pp. 151162, 2014. DOI: 10.1145/2628071.2628088. 
[24]  C. L. Zhang, Y. P. Xu, J. L. Zhou, Z. J. Xu, L. Lu, J. Lu. Dynamic load balancing on multiGPUs system for big data processing. In Proceedings of the 23rd International Conference on Automation and Computing, IEEE, Huddersfield, UK, 2017. DOI: 10.23919/IConAC.2017.8082085. 
[25]  K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770778, 2016. DOI: 10.1109/CVPR.2016.90. 
[26]  H. Zermane, H. Mouss . Development of an internet and fuzzy based control system of manufacturing process. International Journal of Automation and Computing., vol.14 , no.6 , pp.706–718, 2017. doi:10.1007/s116330161027x 
[27]  J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, W. J. Lou. Fuzzy keyword search over encrypted data in cloud computing. In Proceedings of IEEE Conference on Computer Communications, IEEE, San Diego, CA, USA, pp. 15, 2010. DOI: 10.1109/INFCOM.2010.5462196. 
[28]  S. Krinidis, V. Chatzis . A robust fuzzy local information Cmeans clustering algorithm. IEEE Transactions on Image Processing., vol.19 , no.5 , pp.1328–1337, 2010. doi:10.1109/TIP.2010.2040763 
[29]  M. Algabri, H. Mathkour, H. Ramdane . Mobile robot navigation and obstacleavoidance using ANFIS in unknown environment. International Journal of Computer Applications., vol.91 , no.14 , pp.36–41, 2014. doi:10.5120/159525400 
[30]  R. J. Kuo, S. Y. Hong, Y. C. Huang . Integration of particle swarm optimizationbased fuzzy neural network and artificial neural network for supplier selection. Applied Mathematical Modelling., vol.34 , no.12 , pp.3976–3990, 2010. doi:10.1016/j.apm.2010.03.033 
[31]  C. L. P. Chen, Y. J. Liu, G. X. Wen . Fuzzy neural networkbased adaptive control for a class of uncertain nonlinear stochastic systems. IEEE Transactions on Cybernetics., vol.44 , no.5 , pp.583–593, 2014. doi:10.1109/TCYB.2013.2262935 
[32]  A. Saffar, R. Hooshmand, A. Khodabakhshian . A new fuzzy optimal reconfiguration of distribution systems for loss reduction and load balancing using ant colony searchbased algorithm. Applied Soft Computing., vol.11 , no.5 , pp.4021–4028, 2011. doi:10.1016/j.asoc.2011.03.003 
[33]  N. Susila, S. Chandramathi, R. Kishore . A fuzzybased firefly algorithm for dynamic load balancing in cloud computing environment. Journal of Emerging Technologies in Web Intelligence., vol.6 , no.4 , pp.435–440, 2014. doi:10.4304/jetwi.6.4.435440 
[34]  A. N. Toosi, R. Buyya. A fuzzy logicbased controller for cost and energy efficient load balancing in geodistributed data centers. In Proceedings of the 8th IEEE/ACM International Conference on Utility and Cloud Computing, IEEE, Limassol, Cyprus, pp. 186194, 2015. DOI: 10.1109/UCC.2015.35. 
[35]  H. Muhamedsalih, X. Jiang, F. Gao . Accelerated surface measurement using wavelength scanning interferometer with compensation of environmental noise. Procedia CIRP., vol.10 , pp.70–76, 2013. doi:10.1016/j.procir.2013.08.014 
[36]  S. H. Lee, J. S. Lim . Forecasting KOSPI based on a neural network with weighted fuzzy membership functions. Expert Systems with Applications., vol.38 , no.4 , pp.4259–4263, 2011. doi:10.1016/j.eswa.2010.09.093 
[37]  W. Sweldens . The lifting scheme:A construction of second generation wavelets. SIAM Journal on Mathematical Analysis., vol.29 , no.2 , pp.511–546, 1998. doi:10.1137/S0036141095289051 
[38]  S. Mittal, J. S. Vetter. A survey of CPUGPU heterogeneous computing techniques. ACM Computing Surveys, vol. 47, no. 4, Article number 69, 2015. DOI: 10.1145/2788396. 