Introduction
-
As to existing multi-agent reinforcement learning methods, the policy of each agent can be very sensitive and fragile to the policies of the agents other than itself. Therefore, to enhance the system robustness, we leverage the minimax strategy to define the objective function, which takes the external disturbance received by the agents as part of the cumulative reward. By doing so, the reward space of the agents can be expanded to enhance the exploration ability of agents.
-
To address the problem of sparse rewards, we employ a hindsight experience replay mechanism to improve the sampling strategy of experience replay and promote the utilization efficiency of the sampled data. Different from the existing methods of avoiding repeated sampling in the same batch of data, we divide the experience data into multiple batches and put them into multiple threads for processing. By doing this, data diversity and the utilization of experience data can be improved.
-
We verify the effectiveness of the proposed algorithm in the MAMT search scenario. Evaluation results show that the PHER-M3DDPG algorithm can adaptively adjust the strategies of agents and outperform the existing algorithms in terms of convergence speed and task completion time.
Related work
Traditional algorithms for multi-target search
RL algorithms for multi-target search
Solutions for sparse rewards
Background and preliminary
Markov decision process
Deterministic policy gradient algorithm
PHER-M3DDPG algorithm
Architecture design of the PHER-M3DDPG algorithm
Minimax optimization
Parallel hindsight experience replay
Complexity analysis
Performance evaluation
Simulation settings
Parameters | Value |
---|---|
Discount factor (\(\gamma \)) | 0.95 |
Inertial update rate (\(\tau \)) | 0.01 |
Batch samples batchsize | 1024 |
Learning rate of the actor and critic network (\({\alpha }\)) | 0.001 |
Maximum number of episodes (\({S_{\max }}\)) | 60,000 |
Maximum step length (\({E_{\max }}\)) | 25,000 |
Number of the agents (N) | 3 |
Number of the targets (N) | 3 |
Number of the obstacles (J) | 2 |
Sampled data size (K) | 256 |
Batch of data size (k) | 128 |
Results analysis
-
As the number of CPUs increases, the agents take less time to complete the task, and the shortest time can reach 45 s.
-
The time required for agents to complete the task is not only related to the number of CPUs, but also the choice of k. We comprehensively consider the selection of k with different number of CPUs and found that the optimal batch of data is k = 128.
-
From Fig. 8a, we know that the PHER-M3DDPG algorithm reaches the convergence state in about 5000 episodes. It is proved that the parallel hindsight experience replay mechanism can effectively improve the data utilization rate and accelerate the convergence speed of the system.
-
As shown in Fig. 8b, to demonstrate the superiority of the PHER-M3DDPG algorithm in the environment of the MAMT search, we compare the reward values of different algorithms within 5000 episodes. The comparison results show that within 5000 episodes, the winning rate of the PHER-MADDPG algorithm is obviously superior to other algorithms. Due to solving the sparse rewards problem, the PHER-MADDPG algorithm can better train agents and improve the search efficiency of multi-agent cooperation.
-
Due to the original environment, the agents have the problem of sparse rewards in the process of searching for the target. Therefore, the HER mechanism is adopted to set dense rewards during the task to motivate agents to find the targets faster. As shown in Fig. 8c, due to the large float of the curve fluctuations in M3DDPG training, to better compare the task completion time, we compared the PHER-M3DDPG algorithm with three different algorithms with 40,000 episodes. Through comparative experiments, we know that the PHER-M3DDPG algorithm reaches a convergence state in about 1000 episodes. Compared with the other four algorithms, the task completion time of agents with the PHER-M3DDPG algorithm is about 42 s. From the figure, we know that the curve of the PHER-M3DDPG algorithm is more stable, which shows that the minimax strategy can better enhance the robustness of the system.