We will now introduce a reinforcement learning approach based on stochastic approximation to estimate
\(\nu (G)\). The underlying idea relies on the idea of tours and regeneration introduced in [
2‐
4]. We will compare the mean squared error of the new estimator with that of MH-MCMC and RDS, and see how the stability of the sample paths can be controlled.
Estimator
Let
\(V_0 \subset V\) with
\(|V_0|<< |V|\). We assume that the nodes inside
\(V_0\) are known beforehand. Consider a simple random walk
\(\{X_n\}\) on
G with transition probabilities
\(p(j|i) = 1/d(i)\) if
\((i,j) \in E\) and zero otherwise. A random walk tour is defined as the sequence of nodes visited by the random walk during successive return to the set
\(V_0\). Let
\(\tau _n :=\) successive times to visit
\(V_0\) and let
\(\xi _k : = \tau _k-\tau _{k-1}\). We denote the nodes visited in the
kth tour as
\(X_1^{(k)}, X_2^{(k)},\ldots , X_{\xi _k}^{(k)}\). Note that considering
\(V_0\) helps to tackle a disconnected graph
2 with RW theory and makes tours shorter. Moreover, the tours are independent of each other and can have
massively parallel implementation. The estimators derived below and later in “
Ratio with tours estimator (RT-estimator)” section exploit the independence of the tours and the result that expected sum of functions of nodes visited in a tour is proportional to
\(\sum _{u \in V} g(u)\) [
4, Lemma 3].
Define
\(Y_n := X_{\tau _n}\). Then
\(\{(Y_n, \tau _n)\}\) is a semi-Markov process on
\(V_0\) [
18, Chapter 5]. In particular,
\(\{Y_n\}\) is a Markov chain on
\(V_0\) with transition probability matrix (say)
\([p_Y(j|i)]\). We have
\(\xi _1 := \min \{n > 0: X_n \in V_0\}\). For a prescribed
\(g: V \mapsto \mathbb {R}\), define
$$\begin{aligned} T_i&: =\mathbb {E} _i[\xi _1], \\ h(i)&: =\mathbb {E} _i\left[ \sum _{m=1}^{\xi _1}g(X_m)\right] , \ i \in V_0. \end{aligned}$$
Consider an average cost Markov decision problem (MDP), then the Poisson equation for the semi-Markov process
\(\{(Y_n, \tau _n)\}\) is [
18, Chapter 7]
$$\begin{aligned} \mathcal {V}(i) = h(i) - \beta T_i + \sum _{j \in V_0}p_Y(j|i)\mathcal {V}(j), \ i \in V_0, \end{aligned}$$
(6)
which is to be solved for the pair
\((\mathcal {V},\beta )\), where
\(\mathcal {V}: V_0 \mapsto \mathbb {R}\) and
\(\beta \in \mathbb {R}\). Under mild conditions, (
6) has the solution
\((V^*, \beta ^*)\). The optimal
\(\beta ^*\) is the average expected cost stationary average of
g,
\( \mathbb {E} _{\pi }[g(X_1)]\) [
18, Theorem 7.6]. In the following, we provide numerical ways to solve (
6). This could be achieved using the classical MDP methods like relative value iteration; instead we look for solutions from reinforcement learning in which the knowledge of transition probability
\([p_Y(j|i)]\) is not needed. Stochastic approximation provides a simple and easily tunable solution as follows. The relative value iteration algorithm to solve (
6) is
$$\begin{aligned} \mathcal {V}_{n+1}(i) = h(i) - \mathcal {V}_n(i_0) T_i + \sum _j p_Y(j|i)V_n(j). \end{aligned}$$
(7)
We can implement this using stochastic approximation as follows: let
\(\{Z_n, n \ge 1 \}\) be from the stationary distribution of the underlying RW conditioned on being in
\(V_0\). Construct a tour for
\(n\ge 1\) by starting a SRW
\(X^{(n)}_i, i \ge 0,\) with
\(X^{(n)}_0 = Z_n\) and observing its sample path until it returns to
\(V_0\).
A learning algorithm for (
6) along the lines of [
19] then is, for
\(i \in V_0\),
$$\begin{aligned} \mathcal {V}_{n+1}(i) & = {} \mathcal {V}_n(i) \ + a(n)\mathbb {I}\{Z_n = i\} \times \nonumber \\&\quad \left[ \left( \sum _{m=1}^{\xi _n}g(X^{(n)}_m)\right) - \mathcal {V}_n(i_0)\xi _n + \mathcal {V}_n(X^{(n)}_{\xi _n}) - \mathcal {V}_n(i)\right] , \end{aligned}$$
(8)
where
\(a(n) > 0\) are stepsizes satisfying
\(\sum _n a(n) = \infty , \ \sum _n a(n)^2 < \infty \). (One good choice is
\(a(n) = 1/\lceil \frac{n}{N}\rceil \) for
\(N = 50\) or 100.) Here
\(\mathbb {I}\{A\}\) denotes indicator function for the set
A. Also,
\(i_0\) is a prescribed element of
\(V_0\). One can use other normalizations in place of
\(\mathcal {V}_n(i_0)\), such as
\(\frac{1}{|V_0|}\sum _j \mathcal {V}_n(j)\) or
\(\min _i \mathcal {V}_n(i)\), see e.g., [
20]. Then this normalizing term (
\(\mathcal {V}_n(i_0)\) in (
8)) converges to
\(\beta ^*\),
\( \mathbb {E} _{\pi }[g(X_1)]\), as
n increases to
\(\infty .\)
Taking expectations on both sides of (
8), we obtain a deterministic iteration that can be viewed as an incremental version of the relative value iteration (
7) with suitably scaled stepsize
\(\tilde{a}(n) := \frac{a(n)}{|V|}\). This can be analyzed the same way as the stochastic approximation scheme with the same o.d.e. limit and therefore the same (deterministic) asymptotic limit. This establishes the asymptotic unbiasedness of the RL estimator.
The normalizing term used in (
8) (
\(\mathcal {V}_n(i_0)\),
\(\frac{1}{|V_0|}\sum _j \mathcal {V}_n(j)\) or
\(\min _i \mathcal {V}_n(i)\)), along with the underlying random walk as the Metropolis–Hastings, forms our estimator
\(\widehat{\nu }_{\text {RL}}(G)\) in RL-based approach. The iteration in (
8) is the stochastic approximation analog of it which replaces conditional expectation w.r.t. transition probabilities with an actual sample and then makes an incremental correction based on it, with a slowly decreasing stepsize that ensures averaging. The latter is a standard aspect of stochastic approximation theory. The smaller the stepsize, the less the fluctuations but slower the speed; thus, there is a trade-off between the two.
RL methods can be thought of as a cross between a pure deterministic iteration such as the relative value iteration above and pure MCMC, trading off variance against per iterate computation. The gain is significant if the number of neighbors of a node is much smaller than the number of nodes, because we are essentially replacing averaging over the latter by averaging over neighbors. The \(\mathcal {V}\)-dependent terms can be thought of as control variates to reduce variance.
Extension of RL-technique to uniform stationary average case
The stochastic approximation iteration in (
8) converges to
\(\beta \), which is
\( \mathbb {E} _\pi [g(X_1)],\) where
\(\pi \) is the stationary distribution of the underlying walk. To make it converge to
\({\nu }(G)\), we can use the Metropolis–Hastings random walk with uniform target distribution. However, we can avoid the use of Metropolis–Hastings algorithm by the following modification, motivated from importance sampling that achieves the convergence to
\({\nu }(G)\) with the simple random walk (SRW). We propose
$$\begin{aligned} {\mathcal {V}_{n+1}(i)}&= \mathcal {V}_n(i) \ + a(n)\mathbb {I}\{z = i\} \times \Gamma _{\xi _n}^{(n)} \times \\&\quad \left[ \left( \sum _{m=1}^{\xi _n}g(X^{(n)}_m)\right) - \mathcal {V}_n(i_0)\xi _n + \mathcal {V}_n(X^{(n)}_{\xi _n}) - \mathcal {V}_n(i)\right] , \end{aligned}$$
where
$$\begin{aligned} \Gamma _{m}^{(n)} = \prod _{k=1}^m \left( \frac{p(X_k^{(n)} | X_{k-1}^{(n)})}{q(X_k^{(n)} | X_{k-1}^{(n)})} \right) . \end{aligned}$$
Here
\(q(\cdot |\cdot )\) is the transition probability of the random walk with which we simulate the algorithm and
\(p(\cdot |\cdot )\) corresponds to the transition probability of the random walk with respect to which we need the stationary average. The transition probability
p can belong to any random walk having uniform stationary distribution such that
\(q(\cdot |\cdot ) >0 \) whenever
\(p(\cdot |\cdot ) >0\). One example is to use
p as the transition probability of Metropolis–Hastings algorithm with target stationary distribution as uniform and
q as the transition probability of a lazy version of simple random walk, i.e., with transition probability matrix
\((\varvec{I}+\varvec{P}_{\scriptscriptstyle \text {SRW}})/2\). In comparison with basic Metropolis–Hastings sampling, such importance sampling avoids the API requests for probing the degree of all the neighboring nodes, instead requires only one such, viz., that of the sampled node. Note that the self-loops wherein the chain re-visits a node immediately are not wasted transitions, because it amounts to re-application of a map to the earlier iterate which is distinct from its single application.
The reinforcement learning scheme introduced above is the semi-Markov version of the scheme proposed in [
20] and [
21].