 智能系统学报  2020, Vol. 15 Issue (2): 317-322  DOI: 10.11992/tis.201809033

SHEN Xiangxiang, HOU Xinwen, YIN Chuanhuan. State attention in deep reinforcement learning[J]. CAAI Transactions on Intelligent Systems, 2020, 15(2): 317-322. DOI: 10.11992/tis.201809033.

1. 北京交通大学 交通数据分析与挖掘北京市重点实验室，北京 100044;
2. 中国科学院自动化研究所 智能系统与工程研究中心，北京 110016

State attention in deep reinforcement learning
SHEN Xiangxiang 1, HOU Xinwen 2, YIN Chuanhuan 1
1. Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China;
2. Center for Research on Intelligent System and Engineering, Institute of Automation, Chinese Academy of Sciences, Beijing 110016, China
Abstract: Through artificial intelligence, significant achievements beyond the human level have been made in the field of board games and video games since the emergence of deep reinforcement learning. However, the real-time strategic game StarCraft is a huge challenging platform for artificial intelligence researchers due to its huge state space and action space. Considering that the level of baseline agents trained by DeepMind using classical deep reinforcement learning algorithm A3C in StarCraft II mini-game is still far from that of ordinary amateur players, by adopting a more simplified network structure and combining the attention mechanism with rewards in reinforcement learning, an A3C algorithm based on state attention is proposed to solve this problem. The trained agent achieves the highest score, which is 71 points higher than Deepmind’s baseline agent in individual interplanetary mini games with fewer feature layers.
Key words: deep learning    reinforcement learning    attention mechanism    A3C    StarCraft II mini-games    agent    micromanagement

1)采用的网络结构比Deepmind提供的基线智能体的网络结构更加简洁。

2)将强化学习中的奖励与注意力机制结合起来，每一个时间步，智能体更加关注有价值的游戏状态。

1 强化学习

 ${R_t} = \sum\limits_{k = 0}^\infty {{\gamma ^k}r\left( {{s_{t + k}},{a_{t + k}}} \right)}$

 ${\max _\pi }{E_{s\sim {P_0}}}\left[ {{R_t}|{s_t} = s} \right]$

${P_0}$ 为状态 $s$ 的先验分布，动作值函数表示为

 ${Q^\pi }\left( {s,a} \right) = {E_\pi }\left[ {{R_t}|{s_t} = s,{a_t} = a} \right]$

 ${V^\pi }\left( s \right) = {E_\pi }\left[ {{R_t}|{s_t} = s} \right]$

A3C算法是将策略函数和价值函数相结合的强化学习方法，对目标函数式(1)：

 ${J_\theta }\left( s \right) = {E_{{\pi _\theta }}}\left[ {{R_t}|{s_t} = s} \right]$ (1)

 ${\nabla _\theta }{J_\theta }\left( s \right) = {E_{{\pi _\theta }}}\left[ {\nabla _\theta }\ln {\pi _\theta }\left( {{a_t}|{s_t}} \right) {R_t}|{s_t} = s \right]$

 ${\nabla _\theta }{J_\theta }\left( s \right) = {E_{{\pi _\theta }}}\left[ {\nabla _\theta }\ln {\pi _\theta }\left( {{a_t}|{s_t}} \right) \left[ {{R_t} - b\left( {{s_t}} \right)} \right]|{s_t} = s \right]$

 ${\nabla _\theta }{J_\theta }\left( s \right) = {E_{{\pi _\theta }}}\left[ \begin{array}{l} {\nabla _\theta }\ln {\pi _\theta }\left( {{a_t}|{s_t}} \right) \left[ {Q^\pi }\left( {{s_t},{a_t}} \right) - {V^\pi }\left( {{s_t}} \right) \right]|{s_t} = s \\ \end{array} \right]$

 ${\left[ {{G^\pi }\left( {{s_t}} \right) - V_{\theta '}^\pi \left( {{s_t}} \right)} \right]^2}$ (2)

 ${G^\pi }\left( {{s_t}} \right) = {E_{{\pi _\theta }}}\left[ {\sum\limits_{k = 0}^n {\gamma ^k}r\left( {{s_{t + k}},{a_{t + k}}} \right) + {\gamma ^{n + 1}}V_{\theta '}^\pi \left( {{s_{t + n + 1}}} \right) } \right]$

 ${J_\theta }\left( {{s_t}} \right) = {E_{{\pi _\theta }}}\left[ {{A^\pi }\left( {{s_t},{a_t}} \right)} \right] + \delta H\left[ {{\pi _\theta }\left( {{s_t}} \right)} \right]$

 ${\nabla _\theta }{J_\theta }\left( {{s_t}} \right) = {E_{{\pi _\theta }}}\left[ {\nabla _\theta }\ln {\pi _\theta }\left( {{a_t}|{s_t}} \right) \left[ {Q^\pi }\left( {{s_t},{a_t}} \right) - {V^\pi }\left( {{s_t}} \right) \right]{\rm{ + }} \delta {\nabla _\theta }H\left[ {{\pi _\theta }\left( {{s_t}} \right)} \right] \right]$

A3C算法同时为了提高学习的稳定性并且加快学习速度，利用异步的方法，将多个智能体在不同的线程中运行，共同更新一个策略网络。

2 基于状态注意力的A3C算法

 ${R_t} = \sum\limits_{k = 0}^\infty {{\gamma ^k}{w_\vartheta }\left( {{s_{t + k}}} \right)r\left( {{s_{t + k}},{a_{t + k}}} \right)}$

 ${Q^\pi }\left( {{s_t},{a_t}} \right) = {w_\vartheta }\left( {{s_t}} \right)r\left( {{s_t},{a_t}} \right) + \gamma {Q^\pi }\left( {{s_{t + 1}},{a_{t + 1}}} \right)$ (4)
 ${V^\pi }\left( {{s_t}} \right) = {E_{{\pi _\theta }}}\left[ {w_\vartheta }\left( {{s_t}} \right)r\left( {{s_t}} \right) + \gamma {V^\pi }\left( {{s_{t + 1}}} \right) \right]$ (5)

 ${J'_{{w_\vartheta }}}\left( {{s_t}} \right) = - {\left[ {G_{{w_\vartheta }}^\pi \left( {{s_t}} \right) - {V^\pi }\left( {{s_t}} \right)} \right]^2}$ (6)

 $G_{{w_\vartheta }}^\pi \left( {{s_t}} \right) = {E_{{\pi _\theta }}}\left[ \displaystyle\sum\limits_{k = 0}^n {\gamma ^k}{w_\vartheta }\left( {{s_{t + k}}} \right) r\left( {{s_{t + k}},{a_{t + k}}} \right) + {\gamma ^{n + 1}}{V^\pi }\left( {{s_{t + n + 1}}} \right) \right]$

 ${\nabla _\vartheta }{{J'}_{{w_\vartheta }}}\left( {{s_t}} \right) = - 2\left[ {G_{{w_\vartheta }}^\pi \left( {{s_t}} \right) - {V^\pi }\left( {{s_t}} \right)} \right] {\nabla _{{w_\vartheta }}}G_{{w_\vartheta }}^\pi \left( {{s_t}} \right){\nabla _\vartheta }{w_\vartheta }\left( {{s_t}} \right)$
3 实验验证

3.1 网络结构

3.2 实验结果验证

4 结束语

