1 Introduction
1.1 Related work
1.1.1 Conventional beam training techniques
1.1.2 Machine learning-based beam training techniques
1.2 Motivations and contributions
-
A novel DRL-based beam training algorithm is proposed, where the DNN learns from historical beam measurements to switch between different beam training techniques in order to maximise the expected long-term reward. A flexible reward model is proposed to control the balance between performance and the beam training overhead so that the DRL model can be trained to meet the power or data rate requirement of different applications.
-
An EE/SE maximisation beam training strategy is proposed, which can maximise the EE or SE for data transmission by controlling the number of activated RF chains. The EE/SE maximisation strategy is included in the DRL-based beam training algorithm.
2 System model and performance metrics
2.1 Blockage model
2.2 Signal model
2.3 Performance metrics
-
DRL-EE The DNN is trained to select the best beam training method to maximise the long-term expected reward, where the reward function is a weighted sum of the EE in bit/Joule and the beam training overhead.
-
DRL-SE The DNN is trained to select the best beam training method to maximise the long-term expected reward, where the reward function is a weighted sum of the SE in bit/s/Hz and the beam training overhead.
2.3.1 Spectral efficiency (SE)
2.3.2 Energy efficiency (EE)
Parameters | Values |
---|---|
Common power \(P_{\textrm{common}}\) | 10 W |
Power per RF chain \(P_{\textrm{RF}}\) | 100 mW |
Power per transmit or receive antenna \(P_{\textrm{TX}}\) or \(P_{\textrm{RX}}\) | 100 mW |
Power per phase shifter \(P_{\textrm{PS}}\) | 100 mW |
3 Methods
3.1 Deep reinforcement learning-based beam training algorithm
3.1.1 DRL-based beam training framework
3.1.2 RL learning framework
3.1.3 Environment
3.1.4 State
-
The EE or SE values \({\textbf{u}}_t\in {\mathbb {R}}^{T+1}\), depending on which performance metric is considered. The EE or SE can reflect the joint impact of the channel condition, the number of RF chains used and the selected beam training method. For EE, the vector \({\textbf{u}}_t\) is given by \({\textbf{u}}_t=[E_{t-T},E_{t-T+1},\ldots ,E_{t-1},{\bar{E}}_t]^{\text {T}}\), where the first T entries are the EE achieved at the past T time steps, and the last entry \({\bar{E}}_t\) is the EE tested via the pre-assessment at the current time step t. For SE, the vector \({\textbf{u}}_t\) is given by \({\textbf{u}}_t=[R_{t-T},R_{t-T+1},\ldots ,R_{t-1},{\bar{R}}_t]^{\text {T}}\), which contains the corresponding \((T+1)\) SE values.
-
The number of RF chains \({\textbf{n}}_t\in {\mathbb {R}}^{T+1}\), which provides additional information on the resulting EE or SE in the vector \({\textbf{u}}_t\). The vector \({\textbf{n}}_t\) is given by \({\textbf{n}}_t=[N^{\star }_{\textrm{RF},{t-T}},N^{\star }_{\textrm{RF},{t-T+1}},\ldots ,N^{\star }_{\textrm{RF}, {t-1}},{\bar{N}}^{\star }_{\textrm{RF},t}]^{\text {T}}\), where \(N^{\star }_{\textrm{RF},t}\) is equivalent to either \(N^{\textrm{EE},\star }_{\textrm{RF}}\) or \(N^{\textrm{SE},\star }_{\textrm{RF}}\) depending on which mode is switched on. The last element \({\bar{N}}^{\star }_{\textrm{RF},t}\) is set to \({\bar{N}}^{\star }_{\textrm{RF},t}=x\), which always represents the number of tracked beam pairs that are used to perform the pre-assessment at the current time step t.
-
The indices of selected beam training methods \({\textbf{a}}_t\in {\mathbb {R}}^{T+1}\), i.e. the indices of selected actions, which label the chosen beam training method to its achieved EE or SE value in the vector \({\textbf{u}}_t\). The vector \({\textbf{a}}_t\) is given by \({{\textbf{a}}_t}=[a_{t-T},a_{t-T+1},\ldots ,a_{t-1},{\bar{a}}_t]^{\text {T}}\), where \({\bar{a}}_t\) is a constant value representing the operation of performing the pre-assessment at time t. The actions and their indices are introduced in the next subsection.
-
The spacings between adjacent beam training steps \({\textbf{d}}_t\in {\mathbb {R}}^{T+1}\), which imply the spatial correlation between channels at different locations. The vector \({\textbf{d}}_t\) is given by \({{\textbf{d}}_t}=[d_{t-T},d_{t-T+1},\ldots ,d_{t-1},d_{t}]^{\text {T}}\), where \(d_{t}\) is the distance in metres from the sample taken at time t to the previous one taken at time \((t-1)\). In practice, the acquisition of \(d_{t}\) requires the use of a UE’s GPS. Alternatively, this feature can be replaced with temporal sampling intervals between adjacent beam training steps. The vector \({\textbf{d}}_t\) can also be treated as equivalent to implementing the beam training algorithm at uniform time intervals with the UE moving at a varying speed over time. s
3.1.5 Action
3.1.6 Reward
3.1.7 DRL-based adaptive beam training algorithm
3.2 Energy efficiency and spectral efficiency maximisation beam training strategy
3.2.1 Local beam training method
3.2.2 EE/SE maximisation beam training strategy
Beam pair no. n | BS beam index p | UE beam index q | Average channel gain \({\overline{\nu }}_{n}\) (dB) |
---|---|---|---|
1 | 18 | 4 | 14.39 |
2 | 6 | 1 | 9.66 |
3 | 17 | 7 | 8.43 |
4 | 62 | 5 | 4.83 |
... | ... | ... | ... |
15 | 16 | 10 | − 36.34 |
3.2.3 Discussions on beam training overhead
Beam training methods | Maximum no. of BM |
---|---|
A | x |
B | \((9\times 9)x+x\) |
C | \((25\times 9)x+x\) |
D | \(N_{\textrm{TX}}N_{\textrm{RX}}+x\) |
3.3 Alternative beam training strategies for benchmarking
3.3.1 Multi-armed Bandit-based beam training strategy
3.3.2 Maximum reward beam training strategy
3.3.3 Randomised beam training strategy
4 Results and discussions
Parameters | Values |
---|---|
BS antenna array | 8-by-8 URA |
UE antenna array | 4-by-4 URA |
Carrier frequency | 30 GHz |
No. of subcarriers N | 64 |
No. of NLOS clusters L | 20 |
No. of tracked beam pairs x | 3 |
Channel bandwidth B | 100 MHz |
Noise variance \(\sigma _{\textrm{n}}^{2}\) | 0.1 |
DRL learning rate \(\eta\) | 0.001 |
DRL discount factor \(\gamma\) | 0.9 |
No. of DRL training episodes | 500–2000 |
MAB step-size \(\eta '\) | 0.5 |
DRL/MAB exploration factor \(\epsilon\) | 0.1 |
DRL mini-batch size | 64 |
Length of DRL experience buffer \({\mathcal {D}}\) | 100,000 |
UE velocity v | 1 m/s |
4.1 Preliminary experiments
4.2 Effects of reward function
4.2.1 DRL-EE
\(\alpha\) | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
---|---|---|---|---|---|
(a) DRL-EE | |||||
DRL | 4 | 176 | 207 | 442 | 961 |
MAB | 44 | 174 | 254 | 338 | 517 |
MR | 1734 | 1735 | 1738 | 1745 | 1749 |
RAND | 433 | ||||
A | 0 | ||||
B | 206 | ||||
C | 483 | ||||
D | 1051 | ||||
(b) DRL-SE | |||||
DRL | 3 | 106 | 217 | 393 | 1080 |
MAB | 45 | 144 | 241 | 317 | 495 |
MR | 1818 | 1804 | 1803 | 1811 | 1819 |
RAND | 451 | ||||
A | 0 | ||||
B | 227 | ||||
C | 506 | ||||
D | 1077 |
4.2.2 DRL-SE
\(\alpha\) | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
---|---|---|---|---|---|
(a) DRL-EE | |||||
DRL | 3.0 | 6.5 | 6.6 | 6.8 | 7.0 |
MAB | 3.3 | 5.9 | 6.7 | 6.9 | 7.1 |
MR | 3.0 | 6.0 | 6.8 | 7.1 | 7.3 |
RAND | 6.1 | ||||
A | 3.0 | ||||
B | 6.8 | ||||
C | 7.3 | ||||
D | 7.4 | ||||
(b) DRL-SE | |||||
DRL | 3.0 | 6.4 | 9.3 | 9.8 | 10.5 |
MAB | 3.5 | 6.5 | 8.8 | 9.5 | 9.9 |
MR | 3.0 | 6.6 | 8.9 | 9.4 | 10.1 |
RAND | 8.3 | ||||
A | 3.0 | ||||
B | 9.4 | ||||
C | 10.1 | ||||
D | 10.5 |
4.3 Impact of state vector size
#
BM) are labelled. All DRL models obtain higher rewards with fewer measurements than MAB, MR and RAND. Among DRL models, DRL-7 yields the lowest reward, which implies that including more past measurements in the state vector may complicate the learning process by providing redundant information to the DNN. DRL-5 achieves nearly the highest reward and results in the lowest beam training overhead. Therefore, we consider a DNN trained with \(T=5\) past measurements as an optimal model for DRL-EE. The number of past measurements T does not make a huge difference on the final EE result but it does affect the number of beam measurements required.SNR (dB) | 0 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|
Total power consumption (Watt) | 33.3 | 34.8 | 37.5 | 43.9 | 63.0 |
SE for DRL-3 (bit/s/Hz) | 11.3 | 17.8 | 26.0 | 36.1 | 49.0 |
SE for DRL-5 (bit/s/Hz) | 10.8 | 17.6 | 26.0 | 36.3 | 48.5 |
SE for DRL-7 (bit/s/Hz) | 11.0 | 17.6 | 26.0 | 36.1 | 49.4 |
Beam training algorithms | Time (s) | Reward |
---|---|---|
DRL-3 | 7.12 | − 0.17 |
DRL-5 | 7.11 | − 0.17 |
DRL-7 | 7.22 | − 0.18 |
MAB | 7.07 | − 0.20 |
RAND | 7.39 | − 0.60 |
MR | 46.56 | − 1.43 |
4.4 Effects of random blockages
SNR (dB) | 0 | 10 | 20 | |||
---|---|---|---|---|---|---|
\(\gamma\) | 0.1 | 0.9 | 0.1 | 0.9 | 0.1 | 0.9 |
\(\varrho =0.1\) | 70 | 49 | 214 | 224 | 233 | 267 |
\(\varrho =0.3\) | 49 | 101 | 204 | 210 | 230 | 278 |
\(\varrho =0.5\) | 18 | 203 | 183 | 236 | 226 | 283 |