IEEE/CAA Journal of Automatica Sinica  2017, Vol. 4 Issue(4): 668-676 PDF
A Facial Expression Emotion Recognition Based Human-robot Interaction System
Zhentao Liu, Min Wu, Weihua Cao, Luefeng Chen, Jianping Xu, Ri Zhang, Mengtian Zhou, Junwei Mao
School of Automation, China University of Geosciences, and with the Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
Abstract: A facial expression emotion recognition based human-robot interaction (FEER-HRI) system is proposed, for which a four-layer system framework is designed. The FEERHRI system enables the robots not only to recognize human emotions, but also to generate facial expression for adapting to human emotions. A facial emotion recognition method based on 2D-Gabor, uniform local binary pattern (LBP) operator, and multiclass extreme learning machine (ELM) classifier is presented, which is applied to real-time facial expression recognition for robots. Facial expressions of robots are represented by simple cartoon symbols and displayed by a LED screen equipped in the robots, which can be easily understood by human. Four scenarios, i.e., guiding, entertainment, home service and scene simulation are performed in the human-robot interaction experiment, in which smooth communication is realized by facial expression recognition of humans and facial expression generation of robots within 2 seconds. As a few prospective applications, the FEERHRI system can be applied in home service, smart home, safe driving, and so on.
Key words: Emotion generation     facial expression emotion recognition (FEER)     human-robot interaction (HRI)     system design
Ⅰ. INTRODUCTION

To make it easier and more natural to interact with robots, people put forward new demands to human-robot interaction (HRI) [1], [2]. It is hoped that robots can recognize human's facial expressions, understand emotions and give appropriate response [3]-[6]. Emotional intelligence robots have attracted great attention in recent years. Research on emotional robot involves many fields such as computer science and psychology [7]-[9]. However, current research is still in preliminary stage. There are only a few intelligent service systems with emotion. Mascot Robot System including five eye robots was proposed in [10]-[12], in which the eye robot can achieve a friendly interaction with human by eye-rolling and speech recognition. A face robot called KAPPA is introduced in [13] which can recognize emotions by facial expressions and generate six basic emotions. Minotaurus robot system is introduced in [14], where a smart human-robot interaction environment is built and the robot can interact with users by gestures, speech, and facial expressions. Although several human-robot interaction systems involve the emotion of robots, only a few researchers study on both emotion recognition and emotion expression by robots to facilitate smooth communication between humans and robots.

A facial expression emotion recognition based human-robot interaction (FEER-HRI) system is proposed, which is a sub-system of multi-modal emotional communication based human-robot interaction (MEC-HRI) system [15]. There are three NAO robots, two mobile robots, Kinect, workstation, server, eye tracker, portable electroencephalograph (EEG), and other intelligent devices in the MEC-HRI system. FEER-HRI system is designed primarily for two targets: one is the robot's abilities to recognize human emotions based on facial expressions, the other is the robot's abilities to generate emotions for emotional communication with humans instead of unemotional communication as the traditional one.

The operation processes of system consist of three steps. Firstly, the robot collects the human face image data through the Kinect and transmits it to the workstation. Secondly, the facial expression recognition method based on extreme learning machine (ELM) is used to recognize users' emotions and then the system generates robots' facial expressions for adapting to users. Thirdly, system transmits the affective control signal to the robot and the robot can respond to users by expressing its own facial expressions which are made up of some basic cartoon symbols.

The remainder of this paper is organized as follows. The architecture of MEC-HRI system and scenarios design are presented in Section Ⅱ. Facial expression feature extraction method is briefly introduced in Section Ⅲ. Experiment setup and experiment results are given in Section Ⅳ.

Ⅱ. ARCHITECTURE OF MEC-HRI SYSTEM

MEC-HRI system can realize multi-modal emotional communication through speech, facial expressions, body gestures, etc. Emotional robots and emotional information acquisition equipment/sensors are connected to a workstation, in which emotion recognition algorithms for facial expressions are embedded. In MEC-HRI system, hardware devices can be extended and algorithms can be improved.

A. Hierarchical Structure of MEC-HRI System

The hierarchy of MEC-HRI system is divided into four layers. From bottom to top, there are hardware layer, physical interface layer, data processing layer, and application layer, as shown in Fig. 1. The hardware layer is used to capture humans' emotional signals and express robots' emotions, in which the sensor module is responsible for data collection and pretreatment, as well as actuator module is responsible for the interaction with users. For instance, the high-resolution camera is used to capture real-time pictures of facial expressions and body gestures; microphones are used to collect speech signals; eye tracker and other motion tracking devices are used to acquire motion information. This module can be extended based on specific system requirements. For example, wearable equipment such as smart glove, intelligent heart-rate belt, and EEG can be used to detect physiological data of human body for emotion recognition. The actuator module is used to control the robot to interact with user based on the emotional analysis and behavior instruction from upper layers. Many interaction equipments like NAO, mobile robot, facial expression interactive software, and mobile terminal can be used to extend this module.

Physical interface layer provides the channel for data transmission, which is the bridge between software and hardware in MEC-HRI system. The network module is responsible for the initialization of network and the communication with each module. The data processing layer is the key part of system which can achieve following functions.

1) Correlation analysis and feature extraction of speech, facial expressions, gestures, and physiological signals.

2) Multi-modal information fusion based on the two-layer fusion structure, i.e., feature level and decision level.

3) In addition to emotion recognition, recognizing human's emotional states and other deep cognitive information during the interaction.

4) Getting the operating instruction of robot by multi-robot behavioral adaption mechanism.

Interactive application layer is the highest layer of the system, which provides a variety of interactive ways, such as speech, facial expressions, gestures, and multi-modal interaction. Besides, there are two interactive objects that the user can choose, one is the robot, and the other is virtual robot via a graphical interface.

B. Scenario Design

Four scenarios including guiding, entertainment, home service, and scene simulation are designed as shown in Fig. 2.

In guiding region, there is a mobile robot with functions of guiding and interacting with users effectively. When users enter robots' vision, robots welcome users according to their historical data. For a new user, the robot will give a happy expression by LED screen, and then talk with users and guide them.

In entertainment region, there are two NAO robots which can play a finger guessing game with users. Camera of NAO robot captures pictures and transmits it to the workstation, by which users' gestures are recognized and the game result is judged. NAO robot expresses emotions according to game results by speech, facial expressions, and gestures. Different users can also play this game with each other, and NAO acts as a spectator. NAO will cheer for the winner and encourage the loser.

In home service region, there are three NAO robots. Emotional communication between multi-human and multi-robot can be carried out here. NAO robots can provide services for the old, the disabled, and children. When an elder is watching TV, MEC-HRI system is monitoring their health condition through wearable sensors and talk to them. In addition, Kinect can recognize children's gestures and sign language of the disabled.

In scenario simulation region, scenarios can be simulated, e.g., coffee bar. Users can drink coffee here and talk with robots casually. Robots recognize users' emotions through multi-modal information. Meanwhile, MEC-HRI system can change the background music to adjust the atmosphere.

Ⅲ. FACIAL EXPRESSION EMOTION RECOGNITION

In order to communicate with users, FEER-HRI system needs to collect and analyze facial information. Facial expression images of the user can be acquired through Kinect equipped in the mobile robot, which are transmitted to the mobile robot through USB port. Then, through the WLAN, they link to the workstation for image processing. Finally, the users' emotional states can be obtained, and system will adapt to users in accordance with their emotions. Considering different cultural backgrounds and people's subjective feelings for understanding emotions, facial expressions are divided into six categories, including happy, angry, surprise, fear, disgust, and sad [16]. Furthermore, different classifications of facial expressions are compared to each other, from which above six categories of facial expressions are thought to be more universal [17]. Therefore, facial expressions are divided into seven basic categories in this paper, i.e., happy, angry, surprise, fear, disgust, sad, and neutral. An approach of facial expression recognition using multi-feature extraction can promote the accuracy rate of classifier, which includes three parts [18]. This process is summarized in Fig. 3. The main steps are feature collection, feature extraction, and emotion recognition.

 Download: larger image Fig. 3 Process of emotion recognition by facial expression.

Firstly, images are preprocessed using face detection [19] and segmentation. Then, facial images are divided into three regions of interest (ROI), i.e., eyes, nose, and mouth. These three regions contain most of face emotion features. Secondly, facial expression features are extracted using 2D-Gabor filter [20] and uniform LBP operator [21]. 2D-Gabor filter is robust against illumination change and face pose rotation of human face image. Moreover, 2D-Gabor filter has less calculation and strong real-time performance, which can extract local features of different scales and different directions. Fig. 4 shows the real part of 2D-Gabor filters at five scales and eight directions. When face image is filtered by these 2D-Gabor filters, the energy of other texture features is suppressed, and only the texture features corresponding to featured frequency are passed smoothly. The texture features are composed of all 2D-Gabor filters' output. Fig. 5 shows amplitude spectrum of the segmented eye image after 2D-Gabor feature extraction.

 Download: larger image Fig. 4 The real part of the 2D-Gabor filters at five scales and eight directions with the following parameters: $\sigma =2\pi$, ${{k}_{_{\max }}}=p/2$, and $f=\sqrt{2}$.

The LBP can describe image texture features, which is used in image processing. The LBP operator compares pixels with their nearby pixels and the results are stored as binary numbers. It is one of the best performing methods in texture features description. In addition, its computational efficiency is high, and it is robust against image offset and the light change. Face often moves and face image is easily affected by the light in each direction. Therefore, LBP operator is very appropriate for feature extraction of facial images. Moreover, LBP operator can well describe local features since face can be seen as the composition of local features.

However, basic LBP operator will produce too many kinds of binary patterns. As a result, the histogram of LBP is too sparse which cannot effectively describe the texture feature [22]. Excessive binary patterns will occupy more storage space and reduce the computational efficiency. To solve this problem, uniform LBP operator is used, which can reduce pattern number from $2^p$ to $p(p-1)+2$. It significantly improves the performance of LBP. Fig. 6 shows the face image processed by uniform LBP operator.

2D-Gabor cannot capture the subtle changes in each direction and frequency of the texture feature [21]. LBP operator can extract local texture features. The combination of these two methods can effectively integrate the advantages of both, which not only extracts features from multi-scale and multi-direction but also preserves local features of face image. In addition, it reduces the dimension of the data so that computational efficiency is improved. These two methods also make up for their deficiencies. The filtering process of 2D-Gabor wavelet transform can effectively reduce the influence of noise on the LBP operator, and uniform LBP operator enhances the local texture characteristics of the 2D-Gabor wavelet transform.

Figs. 7 and 8 show overall processes of facial emotion recognition. Face features are extracted using the method combining 2D-Gabor and LBP. Furthermore, principal component analysis (PCA) is used to reduce redundant features which can increase the computational efficiency. The processed facial feature is divided into two parts. One is for training and the other one is for testing. Since emotions are divided into seven categories, a multiclass classifier ELM is used for emotion recognition. In our previous works [23], it was verified that ELM which is used in facial expression recognition has its own characteristics compared with other multi-class classification methods. The computing speed of ELM is fast, the time of modeling and facial expression recognition is usually less than 0.1 second. Meanwhile, the recognition rate of facial expression is usually above 80%. As a result, ELM is adopted for FFER-HRI system, which can meet the requirement of real-time facial recognition.

Ⅳ. EXPERIMENTS ON FEER-HRI SYSTEM

The proposed facial recognition algorithm is applied in FEER-HRI system successfully and FEER-HRI system can recognize users' emotions timely and accurately.

A. Experimental Setup

MEC-HRI system consists of three NAO robots, two mobile robots, Kinect, eye tracker, two high-performance computers (i.e., a server and a workstation), portable EEG, wearable sensing devices as well as data transmission, and network-connecting devices. The topology structure of MEC-HRI system is shown in Fig. 9.

Two high-performance computers in the system are configured as HP Z840 workstations which consist of two NVIDIA Tesla K40 accelerator card. It can achieve double precision floating point 9Tflops and reach the best configuration of $1:1$ (CPU: GPU) which has faster computing speed compared to general computer. The advantages of workstations can enhance the efficiency of affective computing, and reduce the computing time to ensure smooth human-robot interaction.

When MEC-HRI system is built up, both NAO robots and mobile robots are connected to the wireless router via WIFI. The eye tracker, Kinect, and wearable sensing devices access to the mobile workstation that is responsible for capturing emotional information and controlling devices via USB interface and WIFI. Mobile workstation and wireless router are connected to the server and workstation via a hub. NAO robots can capture video images and audio data for emotion recognition of humans. In turn, NAO robots can express its own emotions by using speech, body gestures, and movement according to human emotions.

Fig. 10 shows the structure of Mobile robot, which is mainly composed of an industrial personal computer (IPC), a Kinect, a touch screen, a $16\times 32$ LED screen, and so on. The mobile robot can move around in four directions and chassis of it is equipped with some laser sensors aimed at avoiding obstacles. The speech synthesis software and microphone in IPC are used for mobile robots' emotion expression. Kinect is a 3D somatosensory camera that can capture dynamic image and conduct image recognition and voice recognition. It has RGB camera and depth camera, which provide facial recognition and 3D motion capture. The resolution of the camera is $640$ $\times$ $480$ and it outputs one face image every 4 ms-8 ms. It works well through a range limit of 1.2 m-3.5 m distance.

The LED screen is the device which can display facial expressions of mobile robot. Compared with human's seven basic expressions, nine kinds of facial expressions, i.e., angry, disgust, fear, neutral, sad, surprise, doubtful, and pitiful are designed for mobile robot. These expressions can fully reflect the emotional state of robots in the process of human-robot interaction. In the FFER-HRI system, facial expressions of robots are represented by some simple cartoon symbols which can express the expression vividly and be easily understood by human.

Fig. 11 shows nine facial expressions displayed on LED screen. Each pattern in the Fig. 11 corresponds to a facial expression, for example, a pattern with two opposite triangles represents anger; a pattern with two love images represents happiness; a pattern with two question marks represents doubt; a pattern with two symmetrical check marks represents sadness; a pattern with two inverted U-shape represents fear. In addition, two extra expressions, i.e., doubtful and pitiful, are added based on human's seven basic expressions. These two expressions are designed according to the characteristics of the robot in human-robot interaction. When the robot cannot recognize users' emotions, it can display the doubtful expression for adapting to users. When users are angry with the robot, the robot can display the pitiful expression in order to gain users' sympathy.

When system is running, Kinect and NAO robots can capture users' facial images. First of all, these image data are transmitted to the server where they are segmented into three ROI, i.e., eyes, nose and mouth. These parts contain most of the facial emotion information. After that, the method of feature extraction and expression recognition, i.e., PCA, the combination of 2D-Gabor and LBP, and ELM classifier are employed to get the final emotion state. Then, the system will make an appropriate affective decision according to the user's emotion. Finally, the server will send certain control instruction to the robots and sensors. As a result, the robot and sensors can make some emotional feedback for adapting to users. For example, mobile robot can express emotion by speech, LED display, and its movement.

B. Classification of Facial Expression

The standard face emotion corpus used in this experiment is JAFFE [24]. Fig. 12 shows the some images of this corpus. Seven emotions are included in this corpus, i.e., happy, angry, sad, surprise, neutral, disgust, and fear.

The method combining 2D-Gabor and uniform LBP is used to extract facial emotion features. As shown in Fig. 13, 800 facial features are extracted from every face image. The dimension of the features is too large, which will take a lot of computing resources. To solve this problem, PCA is adopted which can reduce features dimension from 800 to 96. These representative features are input into an ELM classifier [25] to obtain the final emotion results.

 Download: larger image Fig. 13 Gray level histogram of every segmented region.
C. Application on the Mobile Robot

In order to make mobile robot recognize human emotions, feature extraction methods and classification algorithm in Section Ⅲ should be applied in it. C++ is used to program the Kinect for capturing real-time facial image. Features extraction methods and classification algorithm are realized in MATLAB. In order to combine them, MATLAB program is compiled into dll file which can be called by C++ programming. Robots are connected to workstation via WLAN. By connecting IP and port, we can operate mobile robot from another computer. Fig. 14 shows the operation interface of MEC-HRI system which can connect and control all devices in the system. This operation interfaces show some interaction information between humans and robots. For example, the mobile robot module in Fig. 15 displays real-time images captured by the Kinect equipped in the robot and shows the recognition result of facial expression.