3.1 Hierarchical decomposition in the RL domain
Decomposition of MDPs can be broadly classified into two categories. First, static decomposition which partially or totally requires the implementation designers to define the hierarchy [
28,
30]. Second, dynamic decomposition, in which hierarchy components, their positions, and abstractions are determined during the simulation process [
11,
16,
32]. Both techniques focus on decomposing the state or action space into more manageable parts, and statically assign each learner to one of these parts. None of these techniques allow the migration of agents between different parts of the decomposition.
Parr and Russell [
28] proposed a RL approach called HAMQ-learning that combines Q-learning with hierarchical abstract machines (HAMs). This approach effectively reduces the size of the state space by limiting the learning policies to a set of HAMs. However, state decomposition in this form is hard to apply, since there is no guarantee that it will not affect the modularity of the design or produce HAMs that have large state space.
Dietterich [
11] has shown that an MDP can be decomposed into a hierarchy of smaller MDPs based on the nature of the problem and its flexibility to be decomposed into smaller sub-goals. This research also proposed a MAXQ procedure that decomposes the value function of an MDP into an additive combination of smaller value functions of the smaller MDPs. An important advantage of MAXQ decomposition is that it is a dynamic decomposition, unlike the technique used in HAMQ-learning [
5].
The MAXQ procedure attempts to reduce large problems into smaller problems, but does not take into account the probabilistic prior knowledge of the agent about the problem space. This issue can be addressed by incorporating Bayesian reinforcement learning priors on models, value functions or policies [
9]. Cao and Ray [
9] presented an approach that extends MAXQ by incorporating priors on the primitive environment model and on goal pseudo-rewards. Priors are statistic information of previous policies and problem models that can help a reinforcement agent to accelerate its learning process. In multi-goal reinforcement learning, priors can be extracted from models or policies of previous learned goals. This approach is a static decomposition approach. In addition, the probabilistic priors should be given in advance in order to incorporate them in the learning process.
Cai et al. [
8] proposed a combined hierarchical reinforcement learning method for multi-robot cooperation in completely unknown environments. This method is a result of the integration of options with the MAXQ hierarchical reinforcement learning method. The MAXQ method is used to identify the problem hierarchy. The proposed method obtains all the required learning parameters through learning without any need for an explicit environment model. The cooperation strategy is then built based on the learned parameters. In this method, multiple simulations are required to build the problem hierarchy which is a time consuming process.
Sutton et al. [
30] proposed the concept of options which is a form of knowledge abstraction for MDPs. An option can be viewed as a primitive task that is composed of three elements: a learning policy
\(\pi : S \rightarrow A \), where S is the state set and A is the action set, a termination condition
\(\beta : S^{+} \rightarrow [0,1]\) and an initial set of states
\(I \subseteq S\). An agent can perform an option if
\(s_t \in I\), where
\(s_t\) is the current state of the agent. An agent chooses an option then follows its policy until the policy termination condition becomes valid. In this case, the agent can select another option. A main disadvantage of this approach is that the options need to be determined in advance. In addition, it is difficult to decompose MDPs using this approach because many decomposition elements need to be determined for each option.
Jardim et al. [
20] proposed a dynamic decomposition hierarchical RL method. This method is based on the idea that to reach the goal, the learner must pass through closely connected states (sub-goals). The sub-goals can be detected by intersecting several paths that lead to the goal while the agent is interacting with the environment. Temporal abstractions (options) can then be identified using the sub-goals. A drawback of this method is that it requires multiple simulations to define the sub-goals. In addition, this method is time consuming and cannot easily be applied to large learning problems.
Generally, multi-agent cooperation problems can be modelled based on the assumption that the state space of
n agents represents a joint state of all agents, where each agent
i has access to a partial view
\(s_i\) from the set of joint states
\(s=\{s_1, \ldots ,s_{n-1},s_n\}\). In the same manner, the joint action is modelled as
\(\{a_1, \ldots , a_{n-1},a_n\}\), where each agent
i may only have access to partial view
\(a_i\). One simple approach to modelling multi-agent coordination is discussed in the survey study of Barto and Mahadevan [
5]. It shows that the concurrency model of joint state and action spaces can be extended to learn task-level coordination by replacing actions with options. However, this approach does not guarantee convergence to an optimal policy since learning low level policies varies at the same time as learning high level policies.
The study of Barto and Mahadevan [
5] discussed another hierarchical reinforcement learning cooperation approach. This approach is a hyper approach that combines options [
5] and MAXQ decomposition [
11] together. An option
\(o= \langle I,\pi ,\beta \rangle \) is extended to a multi-option
\(\overrightarrow{o}= \langle o_1,\ldots ,o_n \rangle \), where
\(o_i\) is the option that is executed by agent
i. A joint action value of a main task
p, a state s and a multi-option
\(\overrightarrow{o}\) is denoted as
\(Q(p,s,\overrightarrow{o})\). Then the MAXQ decomposition of the Q-function can be extended for the joint action-values.
Hengst [
16] proposed HEXQ, a hierarchical RL algorithm, that automatically decomposes and solves MDPs. It uses state variables to construct a hierarchy of sub-MDPs, where the maximum number of hierarchy levels is limited to the number of state variables. The results are interlinked small MDPs. As discussed in [
16] the main limitation of HEXQ is that it must discover nested sub-MDPs and find policies for their exits (exits are non-predictable state transitions and not counted as edges of the graph) with probability of 1. This requires that the problem space must have state variables that change over a long time scale.
Tosic and Vilalta [
32] proposed a RL conceptual framework for agents’ collaboration in large-scale problems. The main idea here is to reduce the complexity of RL in large-scale problems through modelling RL as a process of three levels: single learner level, co-learning among small groups of agents and learning at the system level. An important advantage is that it supports dynamic adaption of coalition among agents based on continuous exploration and adaption of the three layered architecture of the proposed model.
The proposed model of Tosic and Vilalta [
32] does not specify any communication scheme among its three RL learning levels. Moreover, the model suffers from the absence of detailed algorithmic specifications on how RL can be implemented in this three layered learning architecture.
Guestrin and Gordon [
14] proposed a distributed planning algorithm in hierarchical factored MDPs that solves large decision problems by decomposing them into sub-decision problems (subsystems). The subsystems are organised in a tree structure. Any subsystem has two types of variables: internal variables and external variables. Internal variables are the variables that can be used by the value function of the subsystem, while the external variables cannot be used because their dynamics are unknown. Although the algorithm does not guarantee convergence to an optimal solution, its output plan is equivalent to the output of a centralised decision system. This proposal has some limitations. First, although coordination and communication between agents are not centralised, they are restricted by the subsystem tree structure. Second, the definition of a subsystem as an MDP composed of internal and external variables only fits decision problems.
Gunady et al. [
15] proposed a RL solution for the problem of territory division on hide-and-seek games. The territory division problem is the problem of dividing the search environment between cooperative seekers to reach optimal seeking performance. The researchers combined a hierarchical RL approach with state aggregation in order to reduce the state space. In state aggregation, similar states are grouped together in two directions: topological aggregation and hiding aggregation. In topological aggregation, the states are divided into regions based on the distribution of obstacles. In hiding aggregation, hiding places are grouped together and treated as the target of aggregation action. A disadvantage of this algorithm is that it requires the model information of the environment to be known in advance.
The distributed hierarchical learning model described in this paper is based on the structure of modern software systems, where a system is decomposed into multiple subsystems. There are no restrictions on the structure of the system hierarchy. Additionally, there are two levels of learning and coordination between subsystems: at the subsystem level; and at the system level. A major goal of this model is to handle dynamic migration of learners between subsystems in a distributed system to increase the overall learning speed.
3.2 Cooperative hunting strategy
Yong and Miikkulainen [
37] described two cooperative hunting strategies that can be implemented in the hunter prey problem. The first is a cooperative hunting strategy for non-communicating teams that involves two different roles of hunters: chasers and blockers. The role of a chaser is to follow the prey movement, while the role of the blocker is to move in a horizontal direction to the prey, staying in the vertical axis of the prey. This allows the blocker to limit the movement of the prey. The second strategy is also a cooperative hunting strategy for non-communicating teams, but only involves chasers. In order for two chasers to sandwich the prey, at least two chasers are required to follow the prey in opposite directions to eventually surround the prey agent. Both hunting strategies were experimentally proven to be successful. One main advantage of these strategies is that no communication is required between hunters. However, both strategies require the prey position to be part of the state definition to provide the chasers and/or blockers with knowledge of the prey’s position.
Lee [
24] proposed a hunting strategy that also involved hunter roles of chaser and blocker. In this paper, the roles of hunters are semantically similar to the roles of the hunters of the first strategy of Yong and Miikkulainen [
37]. However, the description of roles is relatively different. The chasers drive the prey to a corner of the grid, while the role of blockers is to surround the prey so it does not have enough space to escape. The hunting is considered successful if the prey is captured. A main difference between blockers in this paper and [
37] is that blockers are required to communicate to surround the prey agent. This is considered as a disadvantage of the hunting strategy of [
24]. Communication is a disadvantage of the hunting strategy, because communication between agents requires extra computation.