﻿ 基于PPO的异构UUV集群任务分配算法
 舰船科学技术  2024, Vol. 46 Issue (12): 84-89    DOI: 10.3404/j.issn.1672-7649.2024.12.015 PDF

1. 中国船舶集团有限公司第七一六研究所，江苏 连云港 222061;
2. 中国人民解放军 92578部队，北京 100071

Heterogeneous UUV cluster task allocation algorithm based on PPO
DONG Jingwei1, YAO Yao1, FENG Jingxiang1, LI Yazhe1, YOU Yue2
1. The 716 Research Institute of CSSC, Lianyungang 222061, China;
2. No.92578 Unit of PLA, Beijing 100071, China
Abstract: The Assignment problem problem of the UUV cluster is one of the important problems for the formation of the underwater function of the UUV cluster. However, due to the communication and detection capabilities, UUV can only obtain limited information underwater and cannot be used well. A task allocation algorithm based on deep reinforcement learning is proposed. Aiming at the problem of lack of underwater information and scarce rewards, the Curiosity module is added on the basis of the near end strategy optimization algorithm, giving agents an expectation to reduce the uncertainty in the environment, encouraging UUV to explore the unpredictable part of the environment, and realizing the optimal task allocation of UUV clusters. The final simulation experiment shows that compared to traditional intelligent algorithms, it converges faster and has stronger reliability.
Key words: task allocation     proximal policy optimization     cluster
0 引　言

1 环境模型 1.1 问题描述

 $< U,T,K > 。$ (1)

 $U = \{ {U_1},{U_2}, \cdots {U_{{N_U}}}\}。$ (2)

UUV自身携带武器的集合为$re{s_i} = \{ re{s_{i1}},re{s_{i2}}, \cdots , re{s_{im}}\}$，其中，$re{s_{im}}$为第$i$艘UUV携带的第$m$钟武器资源的数量。

 ${R^{{T_i}}} = [R_1^{{T_i}},R_2^{{T_i}}, \cdots ,R_m^{{T_i}}]。$ (3)

 $Va{l_{q,{t_e}}} = Va{l_{q,{t_0}}} \cdot {e^{ - \beta ({t_e} - {t_0})}} 。$ (4)

2）任务的效益函数

 $W = \sum\limits_{}^{{N_T}} {\sum\limits_{}^{{N_U}} {({\omega _1}Val - {\omega _2}Cost)} }。$ (11)

2 基于深度强化学习的异构UUV集群任务分配算法 2.1 深度强化学习

 图 1 强化学习一般框架 Fig. 1 General framework for reinforcement learning

 $MDP = (S,A,{P_{sa}},R)。$ (12)

1）策略（Policy）：智能体的动作函数，用以指导智能体根据特定策略执行动作。

2）价值（Value）：用于评估动作或者状态的好坏程度，智能体根据值函数选择动作。

3）模型（Model）：用以模拟智能体所处的环境。

2.2 策略梯度算法

 ${\pi _\theta }(a\left| s \right.) = P[a\left| s \right.,\theta ]。$ (13)

 ${\pi _\theta }(a\left| s \right.) = P\{ {a_t} = a\left| {{s_t} = s,{\theta _t} = \theta } \right.\} 。$ (14)

 ${\theta _{t + 1}} = {\theta _t} + \alpha \nabla \hat J({\theta _t})。$ (15)

2.3 PPO算法介绍

 ${r_t}(\theta ) = {r_t}({\theta _{\text{old}}}) = \frac{{{\pi _{{\theta _{\text{old}}}}}({a_t}\left| {{s_t}} \right.)}}{{{\pi _{{\theta _{\text{old}}}}}({a_t}\left| {{s_t}} \right.)}} = 1 。$ (16)

 $\begin{split} & {J^{CLIP}}(\theta ) = \\ &E[\min ({r_t}(\theta )A_t^{{\pi _{{\theta _{\text{old}}}}}}),clip({r_t}(\theta ),1 - \varepsilon ,1 + \varepsilon )A_t^{{\pi _{{\theta _{\text{old}}}}}}]。\end{split}$ (17)
2.4 PPO+Curiosity算法

 图 2 PPO+Curiosity算法完整视图 Fig. 2 PPO+Curiosity algorithm complete view
3 仿真与分析 3.1 环境设置

 图 3 异构UUV集群任务场景模拟图 Fig. 3 Simulation diagram of heterogeneous UUV cluster task scenarios
3.2 仿真结果

1艘Ⅰ型侦察UUV和3艘Ⅱ型察打一体UUV，共生成10个任务点。

 图 4 存在10任务点时分配结果 Fig. 4 Allocation results at 10 nodes

1艘Ⅰ型侦察UUV和3艘Ⅱ型察打一体UUV，共生成20个任务点。

 图 5 存在20任务点时分配结果 Fig. 5 Allocation results at 20 nodes

 图 6 求解时间对比结果 Fig. 6 Solution time comparison results
3.3 PPO算法与PPO+Curiosity算法进行比较

 图 7 训练奖励对比图 Fig. 7 Comparison chart of training rewards
4 结　语

 [1] ROGOWSKI P, TERRILL E, CHEN J, et al. Observations of the frontal region of a buoyant river plume using an autonomous underwater vehicle[J]. Journal of Geophysical Research, 2014, 119(11): 7547-7567. [2] YAN Z , HAO B , LIU Y , et al. Movement control in recovering UUV based on two-stage discrete t-s fuzzy model[J]. Discrete Dynamics in Nature and Society, 2014, 2014: 362787. [3] 陶伟, 张晓霜. 国外水下无人集群应用及关键技术研究[J]. 舰船电子工程, 2021, 41(2): 9-13+54. TAO Wei, ZHANG Xiaoshuang. Research on the application and key technologies of underwater unmanned cluster abroad[J]. Ship Electronic Engineering, 2021, 41(2): 9-13+54. [4] FENG J, YAO Y , WANG H, et al. Multi-AUV terminal guidance method based on underwater visual positioning[C]// 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 2020: 314−319. [5] 冯景祥, 姚尧, 潘峰, 等. 国外水下无人装备研究现状及发展趋势[J]. 舰船科学技术, 2021, 43(23): 1-8. FENG Jingxiang YAO Yao, PAN Feng, et al. Research status and development trends of underwater unmanned equipment abroad[J]. Ship Science and Technology, 2021, 43(23): 1-8. [6] 冯景祥, 谢飞跃, 张平, 等. 美海上分布式作战研究现状及发展趋势[C]//中国指挥与控制学会. 第九届中国指挥控制大会论文集. 兵器工业出版社, 2021: 150−155. FENG Jingxiang, XIE Feiyue, ZHANG Ping, et al. Research status and development trends of distributed operations at sea in the united states [C]//Chinese Command and Control Society. Proceedings of the 9th China Command and Control Conference. Ordnance Industry Press, 2021: 150−155. [7] 李亚哲, 姚尧, 冯景祥, 等. 基于有限状态机的UUV集群围捕策略研究[J]. 舰船电子对抗, 2022, 45(1): 22-27. LI Yazhe, YAO Yao, FENG Jingxiang, et al. Research on UUV cluster siege strategy based on finite state machine[J]. Ship Electronic Countermeasures, 2022, 45(1): 22-27. [8] 吴俊成, 周锐, 冉华明, 等. 遗传算法和拍卖算法在任务分配中的性能比较 [J]. 电光与控制, 2016, 23(2): 11–15+82. WU Juncheng, ZHOU Rui, RAN Huaming, et al. Comparison of performance between genetic algorithm and auction algorithm in task allocation [J] Electro Optics and Control, 2016, 23(2): 11–15+82. [9] SUN S, SONG B, WANG P, et al. Real-time mission-motion planner for multi-UUVs cooperative work using tri-Level Programing[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23, 1260–1273. [10] DING Yingying, HE Yan, JIANG Jingping. Multi-robot cooperation method based on the ant algorithm[J]. Robot, 2003, 25(5): 414−418. [11] ZHAO Wenlai, HU Huosheng. An adaptive task assignment algorithm for a swarm of autonomous underwater vehicles[J]. IEEE Transactions on Robotics, 2016, 32(2), 466−477. [12] 郝冠捷, 姚尧, 常鹏, 等. 基于深度强化学习的分布式UUV集群任务分配算法[J]. 指挥控制与仿真, 2023, 45(3): 25-33. HAO Guanjie, YAO Yao, CHANG Peng, et al. Distributed UUV cluster task allocation algorithm based on deep reinforcement learning[J]. Command Control and Simulation, 2023, 45(3): 25-33. DOI:10.3969/j.issn.1673-3819.2023.03.004 [13] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal Policy Optimization Alorithms[J]. ARXIV, 2017: 1707.06347.