Path planning of multi-agent systems in unknown environment with neural kernel smoothing and reinforcement learning

doi:10.1016/j.neucom.2016.08.108

Neurocomputing

Volume 233, 12 April 2017, Pages 34-42

https://doi.org/10.1016/j.neucom.2016.08.108 Get rights and content

Abstract

Path planning is a basic task of robot navigation, especially for autonomous robots. It is more complex and difficult for multi-agent systems. The popular reinforcement learning method cannot solve the path planning problem directly in unknown environment.

In this paper, the classical multi-agent reinforcement learning algorithm is modified such that it does not need the unvisited state. The neural networks and kernel smoothing techniques are applied to approximate greedy actions by estimating the unknown environment. Experimental and simulation results show that the proposed algorithms can generate paths in unknown environment for multiple agents.

Introduction

A multi-agent system includes several intelligent agents in an environment. Each agent has its independent behavior and coordinates with the others [34]. An important benefit of using the multi-agent system is that it can model the cooperation of real life situations. The multi-agent system is also used to learn new behaviors such that the performance of natural systems is predicted [39]. There are many other applications of the multi-agent systems, such as robotics team [13], distributed control [16], resource management [27], collaborative decision [3], and data mining [31].

Path planning generates a path from a starting-point to an ending-point with respect to some restrictions. It is important in some real-life problems, such as mobile robots, logistics, and game design [29]. The path planning problem can be solved by state searching. The single-agent path planning has limit states in an environment. For the multi-agent path planning, many units move simultaneously in the environment, and the dimension of the configuration space increases exponentially [22]. The path planning of the multi-agent system becomes more difficult [17].

There are several methods of the multi-agent path planning. In [16], the high-gain decentralized control law is designed to solve the consensus problem [40]. In [4], [12], the grid graph method is applied to design a state-task graph. In [37], the undirected graph is used for multi-agent path planning. In [11], the random tree algorithm is extended to update the paths. Artificial intelligence methods have learning ability and they can solve the path planning problem directly [6]. For the single agent, the genetic algorithm is used in [9], kernel smoothing is applied in [23], fuzzy logic is used in [21], and iterative learning approach is applied in [24].

The reinforcement learning (RL) is the most popular method for multi-agent system [18], [35]. It is also called multi-agent reinforcement learning (MARL) [7]. The objective of MARL is to maximize a reward function defined by the environment and agents, such that the agents can interact with the environment [18]. At each learning step, the agent senses the environment, takes an action, and transits to a new state the environment [2]. The quality of each transition is evaluated by a reward function. The agents know which action has the most reward [32]. In order to generate a good action, RL feedback is needed [10]. The Q-learning algorithm is a popular RL feedback [33]. Watkins and Dayan [38] give the Q-learning algorithm for the single-agent. Abdi et al. [1] propose the multi-agents Q-learning method. Both of them assume that the environment can be described by a set of states.

In unknown or dynamics environment, if the desired information are available, the supervised learning methods can be applied. In [23], the neural networks are used to estimate the moving obstacles. In [15], the velocity is estimated for the unknown and bounded disturbances. In [13], single-agent path planning is realized by RL and the neural networks. In [19], RL is applied to multi-agent case with fuzzy logic technique, it does not have learning mechanism. However, if the desired information are not available, the above methods do not work.

This paper takes both advantages of the unsupervised learning (kernel smoothing) [26] and the supervised learning (neural networks) [25]. We combine them with RL to solve multi-agent path planning in the unknown environment. This hybrid algorithm includes two training phases:

1.
The Win/Learn Fast-Policy Hill Climbing method (WoLF-PHC) [5] is modified. In this stage, the task model consists of the transitions and the reward functions. Each agent does not know the other actions and reward functions. The agents explore in the unknown environment and collect state–action information through the unsupervised kernel smoothing and modified WoLF-PHC.
2.
The neural networks are integrated with the RL algorithm. We use the states in Stage 1 as desired values, such that the supervised learning of the neural networks can be applied. Finally, the action (controller) is given by a greedy policy from the state–action Q-table.

In this paper, we also show that after finite iterations, Q-values converge. Our algorithm is tested by two mobile robots Khepera in the dynamic environment. The experimental results show that our algorithms are simple and effective for multi-agent path planning in unknown environment.

Section snippets

Reinforcement learning for path planning of multi-agent systems

The current state of the environment which the agent observe is defined as $s_{t}$ and the action of the agent is defined as $a_{t} .$ The model of the single-agent reinforcement learning is a Markov decision process. It is defined as:

Definition 1

Single-agent RL process in dynamic environment is $f_{t} : S_{t} \times A_{t} \times S_{t - 1} \to [0, 1] ρ_{t} : S_{t} \times A_{t} \times S_{t - 1} \to R$ where S_t and S_t−1 are the environment states at t and $t - 1$ , respectively, A_t is the agent actions at time t, f_t is the state transition probability function from the time $t - 1$ to the time t, and

Reinforcement learning with kernel smoothing and neural networks

RL algorithm needs all states of the agents. If some states are not available, the MARL discussed above does not work. In this section, we use kernel smoothing and the neural networks to estimate the unavailable states, then send them to the MARL.

Simulation and experimental results

The objectives of the simulations are to check and validate the operation of our proposed method. We also use two mobile robots to compare our algorithms with the other techniques.

Conclusion

In this paper, we successfully overcome the difficulty of reinforcement learning for unknown environment. Combination of the kernel approximation and neural networks with the WoLF-PHC algorithm can get over the drawback of the reinforcement learning, such as slow learning speed, time consuming and impossible learning in unknown environments.

The robustness of the changes in the environment depends on the approximation and generalization of the kernel method. The simulations and the experimental

David Luviano Cruz is a Ph.D. student in the Automatic Control Department at Cinvestav-IPN. He received his undergraduate degree in Electronic Engineering, in 2007, and M.S. in automatic control, in 2010, from Cinvestav-IPN. His research interests include learning and interaction in multi-agent systems.

References (41)

J. Abdi et al.
Emotional temporal difference Q-learning signals in multi-agent system cooperationreal case studies
IET Intell. Transp. Syst.
(2013)
I. Arel et al.
Reinforcement learning-based multi-agent system for network traffic signal control
IET Intell. Transp. Syst.
(2010)
D. Barbucha
Search modes for the cooperative multi-agent system solving the vehicle routing problem
Neurocomputing
(2012)
S. Bhattacharya, M. Likhachev, V. Kumar, Multi-agent path planning with multiple tasks and distance constraints, in:...
M. Bowling et al.
Multiagent learning using a variable learning rate
Artif. Intell.
(2002)
S.T. Brassai et al.
Artificial intelligence in the path planning optimization of mobile agent navigation
Proc. Econ. Finance
(2012)
L. Busoniu et al.
A comprehensive survey of multiagent reinforcement learning
IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.
(2008)
L.Busoniu, R. Babuska, B. De Schutter, Multi-agent Reinforcement Learning: An overview. Innovations in MASs and...
Z. Cai et al.
Cooperative coevolutionary adaptive genetic algorithm in path planning of cooperative multi-mobile robot systems
J. Intell. Robot. Syst.
(2002)
V. Cherkassky et al.
Learning from Data: Concepts, Theory and Methods
(1998)

V.R. Desaraju, J.P. How, Decentralized path planning for multi-agent teams in complex environments using...

J. Fu et al.

Adaptive consensus tracking of high-order nonlinear multi-agent systems with directed communication graphs

Int. J. Control Autom. Syst.

(2014)

V. Ganapathy, S. Chin, H. Kusama Joe, Neural Q-learning controller for mobile robot, in: IEEE/ASME International...

V. Ganapathy, S.C. Yun, W.L.D. Lui, Utilization of webots and Khepera II as a platform for neural Q-learning...

H. Hu et al.

Second-order consensus of multi-agent systems with unknown but bounded disturbance

Int. J. Control Autom. Syst.

(2013)

A. Wei et al.

Consensus of linear multi-agent systems subject to actuator saturation

Int. J. Control Autom. Syst.

(2013)

S.-H. Ji, J.-S. Choi, B.-H. Lee, A computational interactive approach to multi-agent motion planning, Int. J. Control...

L.P. Kaelbling et al.

Reinforcement learninga survey

J. Artif. Intell. Res.

(1996)

M. Kaya et al.

Modular fuzzy-reinforcement learning approach with internal model capabilities for multiagent systems

IEEE Trans. Syst. Man Cybern. Part B: Cybern.

(2004)

K-Team Corporation, 2013....

Cited by (54)

An empirical evaluation of Q-learning in autonomous mobile robots in static and dynamic environments using simulation
2023, Decision Analytics Journal
Path planning plays a crucial role in the navigation of mobile robots. Among the various path planning techniques, Q-learning (QL) has gained popularity as a reinforcement learning approach that exhibits the ability to learn without significant prior knowledge of the environment. However, despite the introduction of enhanced versions of Q-learning, specifically distance metric and moving target Q-learning (DMMTQL) and distortion and optimization Q-learning (DOQL), the validation of these algorithms in real-world scenarios is incomplete. In this study, we conduct real-world experiments to assess the performance of DMMTQL and DOQL in two distinct environments: one featuring static and another with dynamic obstacles. Our investigation involves comparing the real-world results obtained from DMMTQL and DOQL with those obtained from QL and contrasting them with simulation results. The findings from our real-world experiments demonstrate that both DMMTQL and DOQL outperform QL in terms of path planning effectiveness. Both improved QL algorithms are able to analyse and decide the optimal path with free collision for the mobile robots. When comparing the improvements achieved by DMMTQL and DOQL with simulation results, we observe similar outcomes in most aspects, with the exception of the time taken and distance travelled metrics.
A modified Q-learning path planning approach using distortion concept and optimization in dynamic environment for autonomous mobile robot
2023, Computers and Industrial Engineering
Autonomous mobile robot path planning in unknown and dynamic environment is a crucial task for successful mobile robot navigation. This study proposes an improved Q-learning (IQL) algorithm to address the challenges of path planning in such environments. To this end, three different modes are introduced into the IQL algorithm, namely the normal mode, the distortion mode, and the optimization mode. The normal mode operates according to the standard Q-learning procedures. The distortion mode distorts the Q-values of states around dynamic obstacles to facilitate avoidance, while the optimization mode is employed to overcome the local minimum problem. The efficacy of the IQL algorithm is assessed through a series of comparative studies involving fourteen navigation environments, each with distinct obstacle layouts and types. Comparative analyses are performed based on several metrics, including computational time, travelled distance, collision rate, and success rate. The proposed IQL algorithm exhibits a lower collision rate and a higher success rate when compared to dynamic window approach, influence zone and inflated A*.
A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance
2022, Neurocomputing
Citation Excerpt :
When the environment is complex, a large number of on-line computing makes the system insensitive to the state of the environment. In recent years, deep learning has shown remarkable ability in processing large-scale data, such as controlling the role in the game [11–15], analyzing and forecasting the data of the stock market [16–18], and controlling high-dimensional robots [19,20]. Deep reinforcement learning provides a new idea for some complex tasks, which are challenging for traditional methods.
In a dynamic environment, the moving obstacle makes the path planning of the manipulator very difficult. Therefore, this paper proposes a path planning with dynamic obstacle avoidance method of the manipulator based on a deep reinforcement learning algorithm soft actor-critic (SAC). To avoid the moving obstacle in the environment and make real-time planning, we design a comprehensive reward function of dynamic obstacle avoidance and target approach. Aiming at the problem of low sample utilization caused by random sampling, in this paper, prioritized experience replay (PER) is employed to change the weight of samples, and then improve the sampling efficiency. In addition, we carry out the simulation experiment and give the results. The result shows that this method can effectively avoid moving obstacles in the environment, and complete the planning task with a high success rate.
Modified Q-learning with distance metric and virtual target on path planning of mobile robot
2022, Expert Systems with Applications
Path planning is an essential element in mobile robot navigation. One of the popular path planners is Q-learning – a type of reinforcement learning that learns with little or no prior knowledge of the environment. Despite the successful implementation of Q-learning reported in numerous studies, its slow convergence associated with the curse of dimensionality may limit the performance in practice. To solve this problem, an Improved Q-learning (IQL) with three modifications is introduced in this study. First, a distance metric is added to Q-learning to guide the agent moves towards the target. Second, the Q function of Q-learning is modified to overcome dead-ends more effectively. Lastly, the virtual target concept is introduced in Q-learning to bypass dead-ends. Experimental results across twenty types of navigation maps show that the proposed strategies accelerate the learning speed of IQL in comparison with the Q-learning. Besides, performance comparison with seven well-known path planners indicates its efficiency in terms of the path smoothness, time taken, shortest distance and total distance used.
Deep deterministic policy gradient algorithm for crowd-evacuation path planning
2021, Computers and Industrial Engineering
Citation Excerpt :
It improves the cooperation and processing speed among agents, and performs effective information sharing among multi-agent. Cruz et al. (Cruz and Yu, 2017) used neural networks to estimate an unavailable state, improved the MARL algorithm, and realized path planning for an unknown environment. Cui et al. (Cui et al., 2020) successfully introduced shared space, enabling agents to share information and perform path planning more effectively.
In existing evacuation methods, the large number of pedestrians and the complex environment will affect the efficiency of evacuation. Therefore, we propose a hierarchical evacuation method based on multi-agent deep reinforcement learning (MADRL) to solve the above problem. First, we use a two-level evacuation mechanism to guide evacuations, the crowd is divided into leaders and followers. Second, in the upper level, leaders perform path planning to guide the evacuation. To obtain the best evacuation path, we propose the efficient multi-agent deep deterministic policy gradient (E-MADDPG) algorithm for crowd-evacuation path planning. E-MADDPG algorithm combines learning curves to improve the fixed experience pool of MADDPG algorithm and uses high-priority experience playback strategy to improve the sampling strategy. The improvement increases the learning efficiency of the algorithm. Meanwhile we extract pedestrian motion trajectories from real motion videos to reduce the state space of algorithm. Third, in the bottom layer, followers use the relative velocity obstacle (RVO) algorithm to avoid collisions and follow leaders to evacuate. Finally, experimental results illustrate that the E-MADDPG algorithm can improve path planning efficiency, while the proposed method can improve the efficiency of crowd evacuation.
Robust learning for collision-free trajectory in space environment with limited a priori information
2021, Acta Astronautica
Citation Excerpt :
Recently, to solve trajectory planning problems in unknown or partially-known environments, innovative planning methods with learning ability are developed. Especially, reinforcement learning techniques that learn the optimal policies through environment interactions are investigated [16,17]. Antonello et al. [18] proposed a reactive autonomous navigation system, where the behaviors of target seeking and obstacle avoidance are simultaneously learned based on distance information.
For on-orbit repair and space debris removal missions, a collision-free trajectory to the target is to be planned in a dynamically changing environment. Although such a trajectory can be learned in a ground simulator through repetitive trials, the established environment is often of limited precision due to space perturbations and measurement errors. Considering possible discrepancies between simulation and the real world, a tunnel, instead of a single trajectory, should be learned on the ground. In this paper, a robust planning method based on Q-learning is proposed for space missions with a priori environment information of limited precision. Based on a specific on-orbit repair scenario, reward functions in accordance with the multiple mission objectives are designed. The trajectories learned under different parameter randomization settings are combined and a robust tunnel is generated in the discrete grid world. By keeping the spacecraft inside the tunnel in the actual mission, collisions with the dynamic obstacles would be avoided and the goal of target rendezvous would be achieved. At last, a numerical simulation is carried out and the proposed method is validated under both nominal and randomized conditions.

View all citing articles on Scopus

Wen Yu received the B.S. degree from Tsinghua University, Beijing, China, in 1990, and the M.S. and Ph.D. degrees, both in Electrical Engineering, from Northeastern University, Shenyang, China, in 1992 and 1995, respectively. From 1995 to 1996, he served as a Lecturer in the Department of Automatic Control at Northeastern University, Shenyang, China. Since 1996, he has been with the Centro de Investigación y de Estudios Avanzados, Instituto Politécnico Nacional (CINVESTAVIPN), Mexico City, Mexico, where he is currently a Professor with the Departamento de Control Automatico. From 2002 to 2003, he held research positions with the Instituto Mexicano del Petroleo. He was a Senior Visiting Research Fellow with Queen's University Belfast, Belfast, U.K., from 2006 to 2007, and a Visiting Associate Professor with the University of California, Santa Cruz, from 2009 to 2010. He also holds a visiting professorship at Northeastern University in China from 2006. Dr. Wen Yu serves as an Associate Editor of Neurocomputing, and Journal of Intelligent and Fuzzy Systems. He is a Member of the Mexican Academy of Sciences.

View full text

Path planning of multi-agent systems in unknown environment with neural kernel smoothing and reinforcement learning

Abstract

Introduction

Section snippets

Reinforcement learning for path planning of multi-agent systems

Reinforcement learning with kernel smoothing and neural networks

Simulation and experimental results

Conclusion

Emotional temporal difference Q-learning signals in multi-agent system cooperationreal case studies

IET Intell. Transp. Syst.

Reinforcement learning-based multi-agent system for network traffic signal control

IET Intell. Transp. Syst.

Search modes for the cooperative multi-agent system solving the vehicle routing problem

Neurocomputing

Multiagent learning using a variable learning rate

Artif. Intell.

Artificial intelligence in the path planning optimization of mobile agent navigation

Proc. Econ. Finance

A comprehensive survey of multiagent reinforcement learning

IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.

Cooperative coevolutionary adaptive genetic algorithm in path planning of cooperative multi-mobile robot systems

J. Intell. Robot. Syst.

Learning from Data: Concepts, Theory and Methods

Adaptive consensus tracking of high-order nonlinear multi-agent systems with directed communication graphs

Int. J. Control Autom. Syst.

Second-order consensus of multi-agent systems with unknown but bounded disturbance

Int. J. Control Autom. Syst.

Consensus of linear multi-agent systems subject to actuator saturation

Int. J. Control Autom. Syst.

Reinforcement learninga survey

J. Artif. Intell. Res.

Modular fuzzy-reinforcement learning approach with internal model capabilities for multiagent systems

IEEE Trans. Syst. Man Cybern. Part B: Cybern.