International Journal of Automation and Computing  2018, Vol. 15 Issue (5): 582-592 PDF
Learning to Transform Service Instructions into Actions with Reinforcement Learning and Knowledge Base
Meng-Yang Zhang1,2, Guo-Hui Tian1,2, Ci-Ci Li1,2, Jing Gong1
1 School of Control Science and Engineering, Shandong University, Jinan 253000, China;
2 Shenzhen Research Institute, Shandong University, Shenzhen 518000, China
Abstract: In order to improve the learning ability of robots, we present a reinforcement learning approach with a knowledge base for mapping natural language instructions to executable action sequences. A simulated platform with physical engine is built as interactive environment. Based on the knowledge base, a reward function with immediate rewards and delayed rewards is designed to handle sparse reward problems. Also, a list of object states is produced by retrieving the knowledge base, as a standard to define the quality of action sequences. Experimental results demonstrate that our approach yields good performance on accuracy of action sequences production.
Key words: Natural language     robot     knowledge base     reinforcement learning     object state
1 Introduction

In recent years, applications of robots exist in a lot of areas, such as industry, healthcare, education, social life, etc. It has been extensively believed that robots can increase the work efficiency, and bring convenience to people′s life. However, the ability of robots to perform complex tasks is limited, especially when it comes to tasks about home service.

As a result, there has been a growing interest in developing and investigating methods to improve robots′ operating ability. Previous works on this can be divided into two types. One type is to extract elements from the obtained information with manually designed rules and provide executable strategies[1, 2], referred to as rule construction. Another type, the mainstream for modifying robots′ ability, is knowledge construction[35], which can work as an associative mean by constructing knowledge bases specific to services, as the lack of relevant knowledge is the main cause preventing robots from completing service requirements. The knowledge bases can be constructed with logic[6], Stanford Research Institute Problem Solver (STRIPS)[7], planning definition domain language (PDDL)[8], probability[9], or other representations. Researchers of knowledge representation model scenarios as a dynamic system in a knowledge base and perform control, prediction and analysis tasks by inferring solutions purely based on this model.

Although the ability of robots for services can be modified with above approaches, there is a strict requirement on manual effort, as the construction of service rules or knowledge is a labour intensive process. And thus, attentions have been focused on increasing robots′ operating ability with available resources for decreasing human consumption.

Text is the universal information carrier stored and shared on the internet. It can be accessed easily, so exploiting text information for increasing the ability of robots is critical. Hameed[10] built a database with a record of information about personal habits so as to assist robots in understanding the intention of human beings. Tenorth et al.[11, 12] proposed a method to construct executable plans for robots by parsing and representing the online text information with logical expression and sentence processing. It has been proved that service instructions as guide can increase the ability of robots for service operation, but the process of parsing and exploiting texts is complicated and has to be done under supervision of researchers. So allowing robots to learn automatically is promising.

1.1 Related works

Inspired by recent advances in deep learning[1316], combining deep learning with reinforcement learning has made significant progress[1720]. Reinforcement learning aiming to maximize the rewards in the long term can enforce the learning ability of robot[21]. In natural language processing, reinforcement learning has been applied successfully.

He et al.[22] proposed a learning model based on deep reinforcement learning, referred to as deep reinforcement relevance network-bidirectional long short term memory (DRRN-BiLSTM). With the model, text strings can be processed and mapped to a combinatorial, natural language action space. Sentences can be represented in a format of tree structure[23, 24]. Based on this conception, Bowman et al.[25] proposed a novel model with reinforcement learning, and the model can yield good performance on parsing and sentence understanding.

Also approaches taking natural language information as guide in reinforcement learning were applied in fields of video games and information retrieval[26, 27], and yielded good performance. Similar to our goal in this paper, Branavan et al.[28] proposed a method that maps instructions into actions, but the interactive environment and involved vocabulary is relatively simple, which is not enough for performing complex tasks.

In the above-mentioned methods, there is a unified standard for judging the final result, and the environment involved is relatively simple, but they are still unsatisfactory in dealing situations with complex environment and ambiguous criterion.

1.2 Proposed approach

In this paper, we proposed a method for transforming instructions related to home service into action sequences, with a knowledge base and reinforcement learning.

The sparse rewards problem may occur in reinforcement learning when reward functions are simple and not enough to promote parameter convergence[2931]. And complex environment, especially family environment, can cause sparse rewards problems, which makes parameters difficult to converge and lowers the accuracy of results.

To address this problem, we built a hierarchical knowledge base of home service, and the knowledge base can provide service information on both object and service level. Based on the knowledge base, a reward function with immediate rewards and delayed rewards is designed. Immediate rewards are produced along with rules guiding the direction of producing proper action sequences, and delayed rewards are given by estimating the final result of task operation.

Unlike video games with clear results, win or failure, in virtual environment as in [3234], there is no uniform standard to measure the effect of home service operation. To deal with this, a standard based on object states is designed for judging the operating result of home service. The information related to object states can be obtained by retrieving the knowledge base.

Inspired by the view that the feedback of the environment from direct communication can be used for online decision making immediately[35], we build a simulated environment with physical information from the knowledge base, as is shown in Fig. 1. Actions can be chosen and operated in the simulated environment, and there is states transition along with action execution.

 Download: larger image Fig. 1. Simulated environment for reinforcement learning. The number of objects related to home service is large. As the platform is used for testifying the logic of produced action sequences, we focus more attention on the intrinsic characte- ristics of objects than the visual reality.

There are two main contributions in this paper. One is that we present a standard for estimating home tasks execution by introducing object states. Another is the application of knowledge base, which is essential for environment construction in reinforcement learning and object states acquisition.

The remainder of the paper is organized as follows. The overview of the proposed method is introduced in Section 2. The process of semantic fragment mapping is described in Section 3. The application of the knowledge base is presented in Section 4. And the implementation of reinforcement learning is shown in Section 5. The experiments and conclusion are presented in Sections 6 and 7, respectively.

2 Framework overview

The framework is composed of information source related to home service, a knowledge base and the dynamic simulation platform, as illustrated in Fig. 2.

The website, wikiHow, is chosen as information source. It contains information of a variety of services, and its content is represented in semi-structured format, which means the information is expressed in steps, so it can facilitate the process of extracting information.

Firstly, the content from information source is stored in documents, each document corresponds to a service. With the process of semantic fragment mapping, information from documents is transformed into initial action sequences with no identification on relationships between objects and operators. During the mapping process, the essential factors in documents are mapped to corresponding action functions and parameters complying no rules but the word meaning.

Then, the initial action sequences are sent to the dynamic simulation platform for logic verification. The platform plays a role as the interactive environment in reinforcement learning. It consists of virtual scene modeled based on our laboratory, object models related to home service, and a library of action functions. As basic elements, object models have characteristics of physical property, physical engine and object states. Physical property indicates objects′ size, weight, color, etc. Object states are current states. For example, the object state of a cup full of water is full. These two kinds of information can be obtained by retrieving a knowledge base. The physical effects of object models can be reflected with physical engine, which is useful for simulating service execution. The action function library includes basic actions such as grab, open, lay down, etc. There is states transition when an action function is operated, and we take object states composition as a standard to estimate service execution. Also we take into consideration the problem of variable parameters which is caused by the sentence structures.

Finally, a reward function with immediate rewards and delayed rewards is designed to produce scores on the effect of initial action sequences. Regarded as a task, each document is composed of subtasks, including sub-actions, relevant objects and object states as is illustrated in Fig. 5. When a subtask is completed, the immediate reward is given. And the value of delayed rewards stands for completion of the task, which is based on object states.

 Download: larger image Fig. 2. Framework of the proposed method. The goal of our deep learning model is to raise the ability of robots for home service by acquiring related knowledge from natural language information about family service. The information related to home service is taken as inputs, and action sequences which can be used or referenced by robots are generated as the output. The model can deal with complex family environment and produce actions oriented to home service.

The knowledge base is constructed to provide information about object and service. Object information contains physical information corresponding to object models, which can be obtained for setting model parameters. Service information takes home services as tasks, each task includes objects involved in the corresponding service, and object states. Application of knowledge base is essential for construction of simulated platform and reward function design.

3 Semantic fragment mapping

The process of semantic fragment mapping is illustrated in Fig. 3. Sentences from information source are sent to the mapping layer for semantic parsing and sense disambiguation. After that, the processed information is mapped to corresponding names of functions in an action function library. Then, the selected functions are performed in the dynamic simulation platform.

 Download: larger image Fig. 3. Process for semantic fragment mapping. The key elements of a sentence are extracted and mapped to functions in an action library. Parameters corresponding to objects are chosen randomly without identification of object relationships.

The information source is divided into documents related to home service. Each document is composed of sentences and represents a service topic.

The mapping layer is a bridge between information source and the action function library, through which initial action sequences can be obtained by mapping sentences to action functions. It consists of Stanford Lexical Parser[36], Counter and WordNet. Stanford Lexical Parser is applied here for semantic parsing by extracting noun and verb-phrases from sentences, and the parsed fragments are stored in a list as a candidate set. WordNet is a lexical database which can provide synonym collection. With it, words with similar meaning can be mapped to a unified vocabulary, so as to reduce complexity of information processing. And Counter is used to record the number of nouns in a sentence, indicating the mapping degree of sentences as a factor for function selection.

During semantic fragment mapping, we take into consideration two problems, the scale of vocabulary in information source, and action selection specific to the same behaviour.

Although the area of research is limited to home service, the scale of vocabulary indicating actions and objects is still large, which makes action functions construction difficult. Thus, we construct lists of synonyms in order to reduce the vocabulary scale. For example, the words, take and get, have the same meaning for grabbing, so both of the words will be mapped to the same action function: Grab.

Another problem is action selection with variable parameters. The structure of sentences is complicated, and the meaning of sentences can be different when the number of nouns in a sentence changes. As illustrated in Fig. 3, there are four kinds of action functions corresponding to the word, grab, which are Grab(obj, pos), Grab(obj, pos, loc), Grab(obj, pos, func) and Grab(obj). The number of parameters in action functions reflects different structures of sentences. For example, Grab(obj) means the robot should grab an object labeled as obj, and the position of the object should be obtainded by the robot itself. And the function, Grab(obj, pos), is different from Grab(obj), as the information indicating object position can be acquired from sentences. Also Grab(obj, pos, loc) means the robot grabs the object, obj, at a specific position, pos, and takes the object to a place, loc. Therefore, the Counter is used to record the number of nouns in a sentence which is taken as a factor of choosing proper action functions.

Through the mapping layer, sentences are transformed into initial action sequences. As the executing orders of actions and parameters are not considered, the sequences need to be sent to the dynamic simulation platform to test the reasonability.

4 Application of knowledge base

A knowledge base is built to provide information on both object and service level. Based on the knowledge base, reinforcement learning is implemented to transform service instructions into action sequences.

As illustrated in Fig. 4, the knowledge base is constructed in hierarchy and divided into object level and service level.

 Download: larger image Fig. 4. Structure of the knowledge base about home service. It contains information on both the object level and the service level. Information on object level includes intrinsic characte- ristics of objects, and service tasks are stored in service level.

The knowledge on object level includes physical property and functional property. The physical property denotes information of objects about their size, colour and material, which are used for model construction by setting parameters of models based on the information. And the functional property involves information about the application and state of objects. The application information is not a description stating the usage of the object, but a network that links the relevant nouns and verbs with the object, which can be used to design rules of the reward functions. The object states indicate the current state of objects on functional aspect. For example, when an empty cup is filled with water, its state has changed from empty to full. Thus, the state of a cup is represented below:

 $[{{cup}}:(containable, {\rm{ }}empty)].$

The knowledge on service level takes home service tasks as units, as illustrated in Fig. 5. Each task can be separated into subtasks representing sub-actions, and the relevant objects and their states are at the bottom of the knowledge. For instance, cleaning room is a task, with subtasks such as wiping the table, sweeping the floor, mopping the floor, etc. Each subtask involves task-relevant objects. Objects in the task of wiping a table include a table and a rag, while a broom and a dustbin are needed for sweeping the floor. The object states stored on service level are the final states after task execution, so a list of final states according to specific tasks can be produced by retrieving the knowledge base on service level.

 Download: larger image Fig. 5. Service level of knowledge base. It takes tasks or subtasks as units. Elements that make up each unit are object states.

5 Reinforcement learning

Home environment is complex. There is not a unified standard to estimate the task execution of home service, as action sequences specific to tasks are variable. In order to implement reinforcement learning on home service, it is crucial to establish a unified standard.

5.1 Acquisition of object states

We present a standard taking object states as factors for judging the final result of service execution, so object states acquisition is essential. The acquisition process is illustrated in Fig. 6.

 Download: larger image Fig. 6. Process of obtaining object states. Objects involved in service instructions can be obtained and attached to corresponding object states through the process. The knowledge base is a key component for the realization of the process.

Firstly, as content from information source takes the form of semi-structured expression, texts related to home service are represented in the form of a tuple: (request, service). The request part indicates the theme of service, which can be used to search for corresponding information on service level of the knowledge base. And the service part responds to service execution and is described in steps, which is the foundation of subtask construction. After semantic fragment mapping, noun components from request part are taken to match relevant tasks on service level of the knowledge base, so a list containing task-relevant objects is obtained:

 $N = [{n_1},{n_2}, \cdots ,{n_m}]$ (1)

where $N$ is a list of strings about object names related to tasks, and $n_i$ indicates the basic element.

Then, object names in service part can be extracted by traversing the list $N$ , and object names satisfying a mapping relationship are stored in another list $H$ :

 $H = [ob{j_1},ob{j_2}, \cdots ,ob{j_n}]$ (2)

where $obj_i$ is the name of object in service part.

The next step is to add object states to objects in list $H$ by referring to the knowledge base, and the processed list is represented as $H$ .

 $H = [ob{j_1}:stat{e_1}, ob{j_2}:stat{e_2}, \cdots , ob{j_h}:stat{e_h}]$ (3)

where $state_i$ is a tuple in a format of (name, value), where name indicates the name of object states and value is the corresponding value of name. For example, the state of a cup, $state_i$ , can be expressed as (containable: full) or (containable: empty).

5.2 Design of reward functions

To address the problem of sparse rewards in home environment, a two-level reward function is designed for producing logical action sequences from natural language, by setting rules to estimate the completion of subtasks and the final task.

The first level of the reward function, referred to as immediate rewards, is to produce scores for estimating the completion of subtasks. Immediate rewards can be divided into two parts. One part inspired by the proposal in [12, 13] is produced by mapping semantic fragments in sentences to relevant elements in simulation platform including object models and action functions. Taking a sentence as a unit, a positive score is given when nouns in the sentence can be mapped to object models, or there is a match between verbs and action functions. Also if the information represented from the mapped models and functions can be linked in the application information, there is a positive score. Another part is to estimate completion of subtasks involved in a sentence. The object states in sentence are extracted and compared with final states of objects in the knowledge base. If the result is consistent, the score is positive, or negative on the contrary.

The second level is referred to as the delayed rewards, which are used to estimate the effect of the final result.

Based on information on service level of the knowledge base, object states can be obtained so as to compare similarity with the object states after executing action sequences.

Taking documents as units, we first get object states, $S_ O$ , from a document by traversing the knowledge base.

 ${S_O} = [stat{e_1},stat{e_2}, \cdots ,stat{e_w}]$ (4)

where $state_i$ is the corresponding state of objects, and $w$ is the total number of involved objects.

Then $S_P$ , object states after operating actions, is constructed with the same order and number of objects as $S_ O$ . Test the object states with the equation below:

 $F({S_O},{S_P}) > K$ (5)

where $F$ is a function outputting the proportion of $S_P$ in $S_ O$ , $K$ is a threshold in interval (0, 1). When the output of function $F$ is larger than $K$ , the produced action sequences are reasonable.

5.3 Implementation of reinforcement learning

We adopt policy gradient algorithm as a way of getting the optimal policy $p=(a|s, \theta)$ which is used to get optimal action composition by tuning parameters $\theta$ for maximum expected rewards[37], where $a$ represents the chosen actions and $s$ is the state.

Firstly, the data set is represented as follows.

 $D = [{d_1},{d_2}, \cdots ,{d_N}]$ (6)
 ${d_i} = [{u_1},{u_2}, \cdots ,{u_M}]$ (7)

where $d$ is a document representing specific service tasks, and $u_i$ is a sentence in a document.

The information in $d_i$ is extracted and mapped into action sequences $a=[a_1, a_2, \cdots]$ , the action $a_i$ is represented in a form of triple, $a_i=(fn, par, W')$ , where $fn$ represents the name of action functions in simulation platform, $par$ are parameters in $fn$ , and $W'$ specifies the mapping words from documents.

Then, the state $s=(\varepsilon, d, j, W)$ from documents to action sequences is constructed, where $\varepsilon$ is the composition of current object states in simulation environment, $d$ is the document containing the corresponding service information, $j$ indicates the index of sentences in a document and $W$ contains the words mapped from the produced action sequences. Following the distribution $p=(s'|s, a)$ , a new state $s'$ will be produced when an action $a$ is executed at the state $s$ .

Based on those information, the value function is built as

 ${V_\theta }(s) = {E_{p(h|\theta )}}[r(h)].$ (8)

The history $h=(s_0, a_0, \cdots, s_n)$ makes a record on the executed actions and experienced states during processing the document, and $r(h)$ is the reward of the history $h$ .

Finally, the policy gradient algorithm is employed to maximize the parameter $\theta$ with following rules:

 $\frac{\partial }{{\partial \theta }}{V_\theta }(s) = {E_{p(h|\theta )}}[r(h)\sum\limits_t {\frac{\partial }{{\partial \theta }}} \log p({a_t}|{s_t};\theta )]$ (9)

where $p=(a|s, \theta)$ is a softmax function, and its derivative of logarithmic form is represented like this

 $\frac{\partial }{{\partial \theta }}\log p(a|s;\theta ) = \varPhi (s,a) - \sum\limits_{a'} \varPhi (s,a')p(a'|s;\theta )$ (10)

where $\varPhi (s, a)$ is the feature representation and we get samples by using the distribution $p(h|\theta)$ for obtaining the gradient of the value function. The parameter $\theta$ is updated with the following rules:

 $\varDelta = \sum\limits_t {(\varPhi (} {s_t},{a_t}) - \sum\limits_{a'} \varPhi ({s_t},a')p(a'|{s_t};\theta ))$ (11)
 $\theta = \theta + r(h)\varDelta.$ (12)
6 Experiment and discussion

Our main objective for producing action sequences is to find a way to identify the relationship of semantic elements in a sentence effectively. In this section, we first collect information and build necessary dataset including the home service information and the knowledge base. Then, the interactive environment, the simulated platform, is constructed based on the dataset. We evaluate our method with baseline methods from both aspects of the convergence rate and correctness. Finally, other problems about the production of action sequences will be discussed.

6.1 Dataset

We choose wikiHow, a website which provides information related to how to serve, as information source. Based on its sitemap, an XML file which makes a record on categories and path of related information, we obtain documents corresponding to home service. The specific information on documents is shown in Tables 1 and 2.

Table 1
Basic statistics of documents

Table 2
Statistics about the dataset of home service

Our dataset consists of 1 000 documents related to home service. Based on the sitemap, the documents can be categorized into housekeeping, house decoration, cooking and cleaning. Housekeeping contains a variety of service related to house keeping, such as how to tidy up a wardrobe, or how to clean the bathroom. House decoration gives instructions on how to make home environment comfortable, mainly focusing on item placement. Cooking and cleaning indicate execution of simple tasks, like how to cook coffee or clean a bidet. In order to reduce the complexity of information processing, the first sentence in each steps, which encapsulates the whole paragraph, is extracted and stored, while removing other information including tips and notes. The representation of information is shown in Fig. 7.

The data exhibits certain qualities that make for a challenging learning problem. We extract nouns and verbs from sentences as key elements, also the vocabulary is reduced with the synsets in WordNet.

 Download: larger image Fig. 7. Information on how to clean wood furniture after removing unnecessary information. The description for task execution is separated into several parts and each part is presented in steps. The format is suitable for information processing.

6.2 Construction of simulated platform

In order to realize real-time interaction, the simulated platform with physical engine is built with Unity 3D. The platform consists of a library of action functions and a simulated scenario with object models, as is illustrated in Fig. 8.

 Download: larger image Fig. 8. Construction of simulated platform. The knowledge base can provide physical information for model construction. The library of action functions is the bridge between the agent and the environment, and action execution can lead to transition of object states.

Action functions indicate basic behaviors from robots, which have an influence on the states of objects when executed. During the mapping process of semantic fragments, the names of action functions are assigned to verb phrases from sentences. Also, action functions with different parameters are considered due to the diversity of sentences representation.

Unlike the platform constructed in [35], we decrease the visual effect reflected by the models because the objective of this platform is to test the logic of action sequences derived from instructions. Therefore, attention has been given to the size, shape and structure of the models, not the texture, in order to save computation consumption. Model construction is based on physical property information which applies to object information including colour, size, weight, etc.

In this paper, Protege 3.4.4 is used to construct the knowledge base, and we store the information of both physical and functional property with RDF (resource description framework).

6.3 Experiments

In this section, we describe experimental results along with qualitative analysis. Since the goal of the proposed method is to produce logical action sequences, metrics such as bilingual evaluation understudy (BLEU) and perplexity which are used for dialogue quality evaluation are not suitable. So we evaluate the logic of produced results with human judgments.

Comparative experiments are designed to test the validity of the proposed method. We take the method with only the policy gradient algorithm as the baseline, and the difference between the proposed method and the baseline is that the former one is integrated with immediate rewards for the sparse problem. Also, we designed a method for which the parameters are tuned by training with 500 correct action sequences constructed manually. These methods are compared in aspects of the correct rate and rewards, in order to prove the feasibility of the proposed method on home service. The experimental results are shown in Fig. 9.

 Download: larger image Fig. 9. Comparison of results based on three methods. The advantages of the proposed method are stated in both aspects of correct rate and rewards.

The results illustrated in Fig. 9 (a) demonstrate that the baseline gets the lowest score on accuracy, while the method trained with the manually constructed action sequences acquires the highest accuracy. The accuracy of the proposed method is lower but close to the best one, what should be noted is that the proposed method can save human resource to get a result of good quality.

In Fig. 9 (b), we can see the baseline obtains the lowest score of rewards, and the score of the method trained with correct sequences is a little better, because these two methods take the final result as the only standard for judgment. The reward of our proposed method is the highest because of the immediate rewards for subtasks, also it indicates the tendency for convergence towards high rate of accuracy, which makes clear the guidance of immediate rewards for proper action sequences.

We also make comparison with results on different categories, as is illustrated in Fig. 10. The method trained with manual samples yields better performance, but experimental results indicate our proposed method has the same performance as the best one in areas of housekeeping, house decoration and cleaning. The correct rate of the three approaches are relatively low in cooking compared with other areas. One reason is the small number of documents in cooking area. And we ascribe another to the absence of knowledge on cooking, by analyzing service categories in the knowledge base, we find the knowledge base cannot provide enough information of objects and the corresponding states. Also information on cooking includes various objects which are not as common as tools like table, chair, broom.

 Download: larger image Fig. 10. Comparison of results based on three methods. The advantages of the proposed method are stated in both aspects of correct rate and rewards.

6.4 Discussions

With the help of the knowledge base and the simulated platform, the proposed method can exploit the self-learning ability of reinforcement learning to obtain optimal policies. Compared with traditional methods on increasing the intelligence of robots, the proposed method can save manual effort.

One aspect to be noted is the feature representation. As the paper aims to find an effective way to produce action sequences from instructions, we take the list of object states and the chosen action functions as the states for reinforcement learning, without considering image features as part of the states which are applied in [18, 26, 32, 38].

7 Conclusions

In this paper, we presented a reinforcement learning approach for inducing a mapping between instructions of home service and action sequences. Our method provides contributions in two aspects. First, we propose a way to judge the result of home service operation by taking object states as evaluation factors. Second, the knowledge base is employed as associative means for reinforcement learning implementation. The experimental results have demonstrated that the proposed method can yield good performance on producing proper action sequences.

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61773239) and Shenzhen Future Industry Special Fund (No. JCYJ20160331174814755).

References