1 Introduction
2 Background
2.1 Markov decision processes
-
\(\mathcal{S}\) is the set of all states the agent can encounter,
-
\(\mathcal{A}\) is the set of all actions available,
-
\(\mathcal{T}(s,a,s^\prime) = P(s^\prime | s,a)\) is the transition function,
-
\(\mathcal{R}(s,a,s^\prime) = E(r | s,a,s^\prime), \) is the reward function, and
-
\(\gamma \in [0, 1]\) is the discount factor.
2.2 Helicopter hovering
x
| x-axis position |
y
| y-axis position |
z
| z-axis position |
u
| x-axis velocity |
v
| y-axis velocity |
w
| z-axis velocity |
ϕ | Rotation around x-axis (roll) |
θ | Rotation around y-axis (pitch) |
ω | Rotation around z-axis (yaw) |
a
1
| Longitudinal cyclic pitch (aileron) |
a
2
| Latitudinal cyclic pitch (elevator) |
a
3
| Tail rotor collective pitch (rudder) |
a
4
| Main rotor collective pitch |
2.3 Generalized tasks
3 The 2008 generalized helicopter hovering task
-
\(wind_u \in [-5, 5], \) wind velocity in m/s in the x-axis, and
-
\(wind_v \in [-5,5], \) wind velocity in m/s in the y-axis.
3.1 Evolving helicopter policies
Topology |
r
| σ |
\(r_\mathcal{G}\)
|
\(\sigma_\mathcal{G}\)
|
---|---|---|---|---|
SLP | −496.22 | 25.00 | −2.508e6 | 2.345e5 |
MLP | −132.60 | 2.17 | −2001.89 | 46.43 |
3.2 Learning helicopter models
Method |
t
|
r
m
|
r
a
| σ |
---|---|---|---|---|
EC-MER | 562.94 | −1.55e4 | −1.184e6 | 3.268e6 |
EC-MENO | 611.10 | −223.19 | −4988.82 | 6722.97 |
LR | 2.05 | −142.25 | −974.24 | 305.68 |
3.3 Model-free approach
3.4 Model-based approach
4 The 2009 generalized helicopter hovering task
-
\(amp \in [-5, 5], \) maximum velocity,
-
\(freq \in [0, 20\pi], \) cycles per second,
-
\(phase \in [0, 2\pi], \) fraction of the wave period, and
-
\(center \in [0, 5], \) center amplitude of sine wave.
4.1 Hybrid approach
4.2 Adding safeguards
4.3 Competition and post-competition analysis
5 The fully generalized helicopter hovering task
C
u
| −0.18 |
D
u
| 0.00 |
w
u
| 0.1941 |
C
v
| −0.43 |
D
v
| −0.54 |
w
v
| 0.2975 |
C
w
| −0.49 |
D
w
| −42.15 |
w
w
| 0.6058 |
C
p
| −12.78 |
D
p
| 33.04 |
w
p
| 0.1508 |
C
q
| −10.12 |
D
q
| −33.32 |
w
q
| 0.2492 |
C
r
| −8.16 |
D
r
| 70.54 |
w
r
| 0.0734 |
5.1 Fixed resampling approach
5.2 Selection races approach
1: \({\mathbb{S} = \emptyset}\) // selected individuals |
2: \({\mathbb{D} = \emptyset}\) // discarded individuals |
3: \({\mathbb{U} = \{x_i \mid i = 1, \dots, \lambda\}}\) // undecided individuals |
4: \(t \leftarrow 1\) // current iteration |
5: for all
\({x_i \in \mathbb{U}}\)
do
|
6: \(X_{i, t} \leftarrow evaluate(x_i)\) // initial evaluation |
7: \(LB_i \leftarrow 0, UB_i \leftarrow 0\) // initial lower and upper bounds |
8: end for
|
9: while
\({t < t_{limit} \wedge |\mathbb{S}| < \mu}\)
do
|
10: \(t \leftarrow t + 1\)
|
11: // reevaluate undecided policies |
12: for all
\({x_i \in \mathbb{U}}\)
do
|
13: \(X_{i, t} \leftarrow evaluate(x_i)\)
|
14: \(\hat{X}_i \leftarrow \frac{1}{t} \sum_{j=1}^t X_{i,j}\)
|
15: // update LB
i
and UB
i
using Bayesian confidence bounds c
i,t
|
16: \(LB_i \leftarrow \hat{X}_i - c_{i, t}, UB_i \leftarrow \hat{X}_i + c_{i, t}\)
|
17: end for
|
18: for all
\({x_i \in \mathbb{U}}\)
do
|
19: if
\(|\{x_j \in \mathbb{U} \mid LB_i > UB_j\}| \geq \lambda - \mu - |\mathbb{D}|\)
then
|
20: \({\mathbb{S} \leftarrow \mathbb{S} \cup {x_i}}\) // select |
21: \({\mathbb{U} \leftarrow \mathbb{U} \setminus {x_i}}\)
|
22: else if
\({|\{x_j \in \mathbb{U} \mid LB_i< UB_j\}| \geq \mu - |\mathbb{S}|}\)
then
|
23: \({\mathbb{D} \leftarrow \mathbb{D} \cup {x_i}}\) // discard |
24: \({\mathbb{U} \leftarrow \mathbb{U} \setminus {x_i}}\)
|
25: end if
|
26: end for
|
27: end while
|
28: // update t
limit
depending on \({|\mathbb{S}|}\)
|
29: if
\({|\mathbb{S}| = \mu}\)
then
|
30: \(t_{limit} = max(t_{min}, \frac{1}{\alpha} t_{limit})\)
|
31: else
|
32: t
limit
= min(t
max
, α t
limit
) |
33: end if
|
34: // select best undecided policies if \({\mathbb{S}}\) is not full |
35: while
\({|\mathbb{S}| < \mu}\)
do
|
36: \(x_{i} \leftarrow \arg max_{x_{j} \in {\mathbb{U}}} \hat{X}_{j}\)
|
37: \({\mathbb{S} \leftarrow \mathbb{S} \cup {x_i}}\)
|
38: \({\mathbb{U} \leftarrow \mathbb{U} \setminus {x_i}}\)
|
39: end while
|
40: return
\({\mathbb{S}}\)
|