Reinforcement learning is to influence the environment by the agent, the environment makes corresponding changes according to the action and feeds back to the agent, and then the agent selects the next action according to the current state. Reinforcement learning is divided into value-based and policy-based, where the value-based algorithm is to select the action with the highest score by calculating the score of the action, for example, the comparison methods Q-learning, DQN, and Dueling DQN in this paper, while we use another policy-based method in this paper, PPO, which uses some kind of policy to select actions based on the determination of the value function. In the discrete action space of this paper, an action is selected by the observed information for reverse transmission, and then the probability of the selected action is enhanced and weakened according to the network rewards directly, where the probability of good action being selected is increased and bad actions is weakened in the next choice. The training of the algorithm alternates between sampling data interactively with the environment and optimizing an alternative objective function using stochastic gradient ascent.
(1)
Clipped Surrogate Objective: PPO algorithm is a new type of Policy Gradient algorithm, Policy Gradient algorithm is very sensitive to the step size, but it is difficult to choose the appropriate step size, and the difference between the change of old and new policy during training is not conducive to learning if it is too large. In order to solve these problems, PPO uses the Actor-Critic structure to estimate the V-value using the critic network, so as to achieve a single-step update and uses importance sampling to sample more data with a fixed policy different from the current policy and use it repeatedly. The expression of the update gradient of the policy gradient is written as
$$\begin{aligned} \widehat{g}=\widehat{E}_{(s_{t},a_{t})\sim \pi _{\theta }}\left[ \bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})\widehat{A}^{\theta }_{t} \right] \end{aligned}$$
(21)
where
\(\pi _{\theta }\) is the strategy with parameter
\(\theta\), and
\(\widehat{A}^{\theta }_{t}\) is the estimate of moment
t for the dominance function, which is the dominance function, indicating the advantage obtained by taking action
a in state
s at this time, and
\(\widehat{E}_{(s_{t},a_{t})\sim \pi _{\theta }}\) indicates the average value over the sample. The algorithm usually constructs an objective function with a gradient equal to the gradient of the sub-strategy when performing the gradient update, and then performs gradient ascent for this objective function, where the objective function is given as
$$\begin{aligned} L^{PG}(\theta )=\widehat{E}_{(s_{t},a_{t})\sim \pi _{\theta }}\left[ log\pi _{\theta }(a_{t}|s_{t}\widehat{A}^{\theta }_{t}) \right] \end{aligned}$$
(22)
The importance sampling here is like we need to sample
x from the distribution
p and then calculate the expected value after bringing in
f(
x). If we have no way to sample the data in the distribution
p, we can sample
x from another distribution
q, as follows.
$$\begin{aligned} \int f(x)p(x)dx=\int f(x)\frac{p(x)}{q(x)}q(x)dx=E_{x\sim q}\left[ f(x)\frac{p(x)}{q(x)}\right] \end{aligned}$$
(23)
where
p(
x)/
q(
x) becomes the importance weight to correct the difference between these two distributions, we transform the expectation of
x sampled from
p into the expectation of
x sampled from
q. The policy gradient and objective functions after importance sampling are shown below, respectively.
$$\begin{aligned} \widehat{g}^{'}=\widehat{E}_{(s_{t},a_{t})\sim \pi _{\theta ^{'}}}\left[ \frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'} }(a_{t}|s_{t})}\bigtriangledown _{\theta }log\pi _{\theta }(a_{t}|s_{t})\widehat{A}^{\theta ^{'}}_{t} \right] \end{aligned}$$
(24)
$$\begin{aligned} L^{'PG}(\theta )=\widehat{E}_{(s_{t},a_{t})\sim \pi _{\theta ^{'}}}\left[ \frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'} }(a_{t}|s_{t})} \widehat{A}^{\theta ^{'}}_{t}\right] \end{aligned}$$
(25)
where
\(\theta\) is the target strategy and
\(\theta ^{'}\) is the behavioral strategy, it can be seen that the samples when estimating the strategy gradient are sampled from the behavioral strategy
\(\theta ^{'}\), and
\(\widehat{A}^{\theta ^{'}}_{t}\) is also estimated from the behavioral strategy
\(\theta ^{'}\). If this proxy target function is updated directly, it will result in too much difference in
\(\frac{\pi _{\theta }(a_{t}|s_{t})}{\pi _{\theta ^{'} }(a_{t}|s_{t})}\), We have chosen the Clipped Surrogate Objective approach to avoid this situation, as follows.
$$\begin{aligned} L(s,a,\theta _{k},\theta )= & {} min\left( \frac{\pi _{\theta }(a|s)}{\pi _{\theta _{k} }(a|s)} A^{\pi _{\theta _{k}}}(s,a),\right. \nonumber \\&\left. clip\left( \frac{\pi _{\theta }(a|s)}{\pi _{\theta _{k} }(a|s)},1-\varepsilon ,1+\varepsilon \right) A^{\pi _{\theta _{k}}}(s,a)\right) . \end{aligned}$$
(26)
where
\(\varepsilon\) is a parameter to measure the degree of deviation between the new strategy and the old strategy, when
\(A^{\pi _{\theta _{k}}}(s,a)>0\), that is, the advantage function is positive, increasing the possibility of the corresponding action appearing
\(\pi _{\theta _{k}}(a|s)\), the value of the objective function can be increased, once
\(\pi _{\theta _{k}}(a|s)\) exceeds
\(1 + \varepsilon\), the function will be truncated to
\((1+\varepsilon )A^{\pi _{\theta _{k}}}(s,a)\), at this time
\(\pi _{\theta _{k}}\) and then increase is useless. Similarly, when
\(A^{\pi _{\theta _{k}}}(s,a)<0\), the possibility of the corresponding action appearing
\(\pi _{\theta _{k}}\) need to be reduced, the value of the objective function can be increased , once
\(\pi _{\theta _{k}}\) is less than
\(1 - \varepsilon\), the objective function will be truncated to
\(1 - \varepsilon\), at this time
\(\pi _{\theta _{k}}\) will not be reduced. This is to prevent the situation where
\(\frac{\pi _{\theta }(a|s)}{\pi _{\theta _{k}}(a|s)}\) are too different.
(2)
Generalized Advantage Estimation(GAE): The GAE method is used in PPO to estimate the advantage function, which is shown in the following equation
$$\begin{aligned} A^{GAE(\delta ,\lambda )}_{t}=\sum \limits _{l=0}^{\infty }(\delta \lambda )^{l}[R_{t+l}+\delta V(s_{t+l+1})-V(s_{t+1})] \end{aligned}$$
(27)
where
V is the state value function and
\(\delta\) is the weight value, GAE takes into account the advantage of each subsequent moment of state
s, respectively, and then weights and sums according to the distance from the current state.