Learning from delayed rewards

doi:10.1016/0921-8890(95)00026-C

Robotics and Autonomous Systems

Volume 15, Issue 4, October 1995, Pages 233-235

https://doi.org/10.1016/0921-8890(95)00026-C Get rights and content

References (0)

Cited by (123)

Immune deep reinforcement learning-based path planning for mobile robot in unknown environment
2023, Applied Soft Computing
A new deep deterministic policy gradient (DDPG) integrating kinematics analysis and immune optimization (KAI-DDPG) is proposed to address the drawbacks of DDPG in path planning. An orientation angle reward component, linear velocity reward factor, and safety performance reward factor are added to the DDPG reward function based on kinematic modeling and analysis of mobile robots. A multi-objective performance index turns the path planning problem into one of multi-objective optimization. We propose KA-DDPG, which uses the orientation angle, linear speed, and safety degree as evaluation indices, and information entropy to alter the influence coefficient of the multi-objective function in the reward function. KAI-DDPG is proposed to address the low learning and training efficiency of KA-DDPG, using immune optimization to optimize the experience samples in the experience buffer pool. Performance indices of traditional path planning and the proposed techniques are compared on a gazebo simulation platform, and the results suggest that KAI-DDPG can mitigate the drawbacks of DDPG, such as a protracted training cycle and poor path planning technique, and can broaden the range of application.
Learning Heterogeneous Agent Cooperation via Multiagent League Training
2023, IFAC-PapersOnLine
Many multiagent systems in the real world include multiple types of agents with different abilities and functionality. Such heterogeneous multiagent systems have significant practical advantages. However, they also come with challenges compared with homogeneous systems for multiagent reinforcement learning, such as the non-stationary problem and the policy version iteration issue. This work proposes a general-purpose reinforcement learning algorithm named Heterogeneous League Training (HLT) to address heterogeneous multiagent problems. HLT keeps track of a pool of policies that agents have explored during training, gathering a league of heterogeneous policies to facilitate future policy optimization. Moreover, a hyper-network is introduced to increase the diversity of agent behaviors when collaborating with teammates having different levels of cooperation skills. We use heterogeneous benchmark tasks to demonstrate that (1) HLT promotes the success rate in cooperative heterogeneous tasks; (2) HLT is an effective approach to solving the policy version iteration problem; (3) HLT provides a practical way to assess the difficulty of learning each role in a heterogeneous team.
Disclosing the interactive mechanism behind scientists’ topic selection behavior from the perspective of the productivity and the impact
2023, Journal of Informetrics
The productivity and the impact are two most recognized aspects to evaluate the research performance of scientists. Figuring out whether and how these two factors shape the evolution of scientists’ research interests may facilitate researchers to go deep into scientists’ topic selection behavior. In this paper, we employ Microsoft Academic Graph as our data source, and propose two correlation metrics, by which over 20,000 scientists’ publication sequence from the computer science field are analyzed. We confirm that the productivity and the impact are related to the evolution of scientists’ research interests, and scientists tend to select topics which help them produce the productivity and the impact. To further explore how these two factors affects topic selection behavior, we propose a novel Q seashore walk model based on the interactive mechanism hypothesis. Our analysis results based on the simulation data are consistent with those based on the empirical data, which confirms the validity of our model and reports the evidence for the interactive mechanism. Based on the simulation data, we also analyze the role of reward for scientists’ research performance, and find that “too much is as bad as too little”. This research may help researchers deeply understand the process of topic selection, and provide a theoretical basis for research and development policy formulation.
A theoretical demonstration for reinforcement learning of PI control dynamics for optimal speed control of DC motors by using Twin Delay Deep Deterministic Policy Gradient Algorithm
2023, Expert Systems with Applications
Citation Excerpt :
Later, Q-Learning was first to come out in 1989, which is a special type of RL approach. Watkins used the letter Q for the value function, which is based on the theory of Markov decision processes (Watkins, 1989; Watkins & Dayan, 1992). However, Q-Learning has not attracted interest in its domains until DQN algorithms were developed (Mnih et al., 2013).
To benefit from the advantages of Reinforcement Learning (RL) in industrial control applications, RL methods can be used for optimal tuning of the classical controllers based on the simulation scenarios of operating conditions. In this study, the Twin Delay Deep Deterministic (TD3) policy gradient method, which is an effective actor-critic RL strategy, is implemented to learn optimal Proportional Integral (PI) controller dynamics from a Direct Current (DC) motor speed control simulation environment. For this purpose, the PI controller dynamics are introduced to the actor-network by using the PI-based observer states from the control simulation environment. A suitable Simulink simulation environment is adapted to perform the training process of the TD3 algorithm. The actor-network learns the optimal PI controller dynamics by using the reward mechanism that implements the minimization of the optimal control objective function. A setpoint filter is used to describe the desired setpoint response, and step disturbance signals with random amplitude are incorporated in the simulation environment to improve disturbance rejection control skills with the help of experience based learning in the designed control simulation environment. When the training task is completed, the optimal PI controller coefficients are obtained from the weight coefficients of the actor-network. The performance of the optimal PI dynamics, which were learned by using the TD3 algorithm and Deep Deterministic Policy Gradient algorithm, are compared. Moreover, control performance improvement of this RL based PI controller tuning method (RL-PI) is demonstrated relative to performances of both integer and fractional order PI controllers that were tuned by using several popular metaheuristic optimization algorithms such as Genetic Algorithm, Particle Swarm Optimization, Grey Wolf Optimization and Differential Evolution.
A differential evolution with reinforcement learning for multi-objective assembly line feeding problem
2022, Computers and Industrial Engineering
Citation Excerpt :
Finally, a tuple consisting of {s, a, s’, r} is stored for agent learning. Here, we employ Q-learning (Watkins, 1989), a value-based RL algorithm, as the agent. RL uses the state to describe the properties of the environment.
This paper studies a multi-objective assembly line feeding problem (MALFP), which is a new variant of the assembly line feeding problem in automobile manufacturers. In this problem, part families are delivered through five feeding policies to minimize three objectives simultaneously. To describe the problem, a novel multi-objective mathematical model is formulated. It not only overcomes the difficulty of determining perfect weights for objectives without prior knowledge, but also complements the traditional model by considering extended decisions on receiving warehouses, an extra cost item for policy switching, and a hybrid inventory strategy. To solve the problem, an innovative multi-objective differential evolution with a reinforcement learning (RL) based operator selection mechanism (MODE-RLOSM) is proposed. By solving MALFP with MODE-RLOSM, near-optimal candidate solutions that are suitable for different working conditions are provided to managers for making trade-offs and implementations. Compared with state-of-the-art optimization algorithms as well as a practical decision tree approach, the proposed algorithm shows superiority in cost saving, solution quality, and convergence efficiency. Through ablation study, sensitivity analysis, and RL behavior analysis, we investigate components in MODE-RLOSM and verify their effectiveness and robustness. In addition to bringing significant cost savings, the obtained solution also gives us production enlightenment and thus improves the decision-making efficiency of the enterprise. In our research, we illustrate the influence of part diversity on policy selection, give managers suggestions under different objective preferences, and find it uneconomical to pursue a specific objective excessively.
Deep understanding of big geospatial data for self-driving: Data, technologies, and systems
2022, Future Generation Computer Systems
Citation Excerpt :
A widely applied method is to represent the reward function as a linear combination of functions of a number of manually selected features [71–74]. Q-Learning [75] is one of the most commonly used RL algorithms. It is a model-free algorithm that learns an estimation of the utility of a state–action pair.
With the continued development of Autonomous Vehicle System (AVS), self-driving related technologies have attracted much attention over the past decade. In this light, we survey existing literature regarding self-driving related data, technologies, and systems. We present details of representative studies regarding collision avoidance, automatic lane-changing maneuver, object detection (including pedestrian detection and obstacle detection), and vehicle trajectory prediction, respectively. This survey summarizes the findings of existing self-driving studies, thus uncovering new insights that may guide researchers and software engineers in fields of self-driving data management systems and autonomous vehicle systems.

View all citing articles on Scopus

^∗: Tel.: +31 20 525-7463, Fax: +31 20 525-7490.

View full text

Guest editorialLearning from delayed rewards

Guest editorial
Learning from delayed rewards