Based on above ideas, we propose the power allocation algorithm of NOMA for the secure game. In consideration of the inherent relation between
S and
E, the work mode of
E determines the state of
S; similarly,
S can influence the environment of
E by adjusting
α. In the first step of the algorithm, we initialize the Q-table denoted by
Q(
m,
α) which is used for updating the reward values of state-action pairs. For each experiment,
E first selects a work mode randomly, which determines
S to adopt an instantaneous
αt accordingly, where
αt denotes the power allocation factor at time
t. It should be emphasized that we do not expect that
S always selects the appropriate power allocation factor by searching in the Q-table. To avoid getting the local optimal solution, we use
ε−greedy policy when
S chooses a value of
α. Specifically,
S searches for the current optimal
α in Q-table with probability
ε, otherwise chooses a value in the range of [
αmin,
αmax] randomly. At this time slot,
S transmits a signal with power
αtPS and computes the system data rate as reward value
RS from the environment. Then,
E changes the work mode from
m to
mt+1 according to the system data rate. By incorporating the instantaneous reward value
RS and the accumulated experience in Q-table, the update process of Q-table presented by the authors in [
33] can be formulated as:
$$\begin{array}{*{20}l} Q(m_{t}, \alpha_{t}){\leftarrow}Q(m_{t}, \alpha_{t})&{+}\zeta[R_{S}\\ &{+}\rho \max Q(m_{t+1}, \alpha)-Q(m_{t}, \alpha_{t})], \end{array} $$
(21)