Top

Neural Computing and Applications

Published in:

23-02-2024 | Original Article

A double Actor-Critic learning system embedding improved Monte Carlo tree search

Authors: Hongjun Zhu, Yong Xie, Suijun Zheng

Published in: Neural Computing and Applications | Issue 15/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

As the bias between the estimated value and the true value, overestimation is a basic problem in reinforcement learning, which leads to a lower total reward because of the incorrect action decisions. In order to reduce the impact of overestimation on reinforcement learning, we propose a double Actor-Critic learning system embedding improved Monte Carlo Tree Search (DAC-IMCTS). The proposed learning system consists of a reference module, a simulation module and an outcome module. The reference module and the simulation module are designed to compute the upper bound and lower bound of the expected reward of the agent, respectively. And the outcome module is developed to learn the agent’s control policies. The reference module, constructed based on the Actor-Critic framework, provides an upper confidence bound of the expected reward. Different from the classic Actor-Critic learning system, we introduce a simulation module into the new learning system to estimate the lower confidence bound of the expected reward. We propose an improved MCTS in this module to sample the policy distribution more efficiency. Based on the lower and upper confidence bounds, we propose a confidence interval weighted estimation algorithm (CIWE) in the outcome module for generating the target expected reward. We then prove that the target expected reward generated by our method has zero expectation bias, which reduces the overestimation that exists in the classic Actor-Critic learning system. We evaluate our learning system on OpenAI Gym experimental tasks. The experimental results show that our proposed model and algorithm outperform the state-of-the-art learning systems.

previous article Lifelong learning with selective attention over seen classes and memorized instances

next article iCapS-MS: an improved Capuchin Search Algorithm-based mobile-sink sojourn location optimization and data collection scheme for Wireless Sensor Networks

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning, pp 4295–4304. PMLR

Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238–1274CrossRef

Zhou T, Tang D, Zhu H, Zhang Z (2021) Multi-agent reinforcement learning for online scheduling in smart factories. Robot Comput-Integr Manuf 72:102202CrossRef

Fischer T, Krauss C (2018) Deep learning with long short-term memory networks for financial market predictions. Eur J Oper Res 270(2):654–669MathSciNetCrossRef

Namdari A, Samani MA, Durrani TS (2022) Lithium-ion battery prognostics through reinforcement learning based on entropy measures. Algorithms 15(11):393CrossRef

Chen S-A, Tangkaratt V, Lin H-T, Sugiyama M (2020) Active deep q-learning with demonstration. Mach Learn 109(9):1699–1725MathSciNetCrossRef

Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30

Meng L, Yazidi A, Goodwin M, Engelstad P (2022) Expert q-learning: deep reinforcement learning with coarse state values from offline expert examples. In: Proceedings of the northern lights deep learning workshop, vol 3

10.

Panag TS, Dhillon J (2021) Predator-prey optimization based clustering algorithm for wireless sensor networks. Neural Comput Appl 33(17):11415–11435CrossRef

11.

Thrun S, Schwartz A (1993) Issues in using function approximation for reinforcement learning. In: Proceedings of the 1993 connectionist models summer school Hillsdale, NJ. Lawrence Erlbaum, vol 6, pp 1–9

12.

Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

13.

Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12

14.

Lv P, Wang X, Cheng Y, Duan Z, Chen CP (2020) Integrated double estimator architecture for reinforcement learning. IEEE Trans Cybern 52(5):3111–3122

15.

Fujimoto S, Hoof H, Meger D (2018) Addressing function approximation error in Actor-Critic methods. In: International conference on machine learning, pp 1587–1596. PMLR

16.

Wu H, Zhang J, Wang Z, Lin Y, Li H (2022) Sub-avg: overestimation reduction for cooperative multi-agent reinforcement learning. Neurocomputing 474:94–106CrossRef

17.

Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43CrossRef

18.

Lu Q, Tao F, Zhou S, Wang Z (2021) Incorporating Actor-Critic in Monte Carlo tree search for symbolic regression. Neural Comput Appl 33(14):8495–8511CrossRef

19.

Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W (2016) Openai gym. arXiv preprint arXiv:1606.01540

20.

Baxter J, Tridgell A, Weaver L (1999) Knightcap: a chess program that learns by combining td (lambda) with game-tree search. arXiv preprint arXiv:cs/9901002

21.

Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. In: International conference on machine learning, pp 387–395. PMLR

22.

Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp 1928–1937. PMLR

23.

Walȩdzik K, Mańdziuk J (2018) Applying hybrid Monte Carlo tree search methods to risk-aware project scheduling problem. Inf Sci 460:450–468CrossRef

24.

Kocsis L, Szepesvári C (2006) Bandit based monte-carlo planning. In: European conference on machine learning, pp 282–293. Springer

25.

Luo S (2020) Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl Soft Comput 91:106208CrossRef

26.

Snyder RD, Koehler AB, Hyndman RJ, Ord JK (2004) Exponential smoothing models: means and variances for lead-time demand. Eur J Oper Res 158(2):444–455MathSciNetCrossRef

27.

Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70CrossRef

28.

Sabry M, Khalifa A (2019) On the reduction of variance and overestimation of deep q-learning. arXiv preprint arXiv:1910.05983

29.

Jadon S (2020) A survey of loss functions for semantic segmentation. In: 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), pp 1–7. IEEE

30.

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

Title: A double Actor-Critic learning system embedding improved Monte Carlo tree search
Authors: Hongjun Zhu
Yong Xie
Suijun Zheng
Publication date: 23-02-2024
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 15/2024
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-024-09513-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 15/2024

An enhanced chameleon swarm algorithm for global optimization and multi-level thresholding medical image segmentation

A novel formulation for predicting the shear strength of RC walls using meta-heuristic algorithms

COVID-19 drug repurposing model based on pigeon-inspired optimizer and rough sets theory

IndianFoodNet: effective Indian multi-food identification and recommendation for hypertensive patients using deep convolutional neural network

CASA: cost-effective EV charging scheduling based on deep reinforcement learning

HBNet: an integrated approach for resolving class imbalance and global local feature fusion for accurate breast cancer classification

Premium Partner