Formally
$$\begin{aligned} || b_1^T U ||_2^2&=\sum _{\ell \in G'}\left( \sum _{i \in G_m} A_{ji} A_{\ell i}\right) ^2 \quad . \end{aligned}$$
(19)
Fixing one
\(\ell \in G{'}\), the inner squared sum becomes
$$\begin{aligned} \left( \sum _{i \in G_m} A_{ji} A_{\ell i}\right) ^2 & = {} \sum _{i \in G_m} \left( A_{ji} A_{\ell i}\right) ^{2} +2 \sum _{i,h \in G_m, i\ne h} A_{ji} A_{\ell i}A_{jh} A_{\ell h} \end{aligned}$$
(20)
$$\begin{aligned}& = {} \sum _{i \in G_m}A_{ji} A_{\ell i}+2 \sum _{i,h \in G_m, i\ne h} A_{ji} A_{\ell i}A_{jh} A_{\ell h} , \end{aligned}$$
(21)
where we used the fact that we are assuming binary adjacency matrices, i.e.
\(A_{ij}^{2}=A_{ij}\). Taking the expected value
$$\begin{aligned} || b_1^T U ||_2^2\approx & {} \mathbb {E}\left[ \sum _{\ell \ne j \in G{'}} \left( \sum _{i \in G_m}A_{ji} A_{\ell i}+2 \sum _{i,h \in G_m, i\ne h} A_{ji} A_{\ell i}A_{jh} A_{\ell h}\right) \right] \end{aligned}$$
(22)
$$\begin{aligned}& = {} \sum _{\ell \ne j \in G{'}} \sum _{i \in G_m} \mathbb {E}\left[ A_{ji} A_{\ell i}\right] +2 \, \sum _{\ell \ne j \in G{'}} \,\sum _{i,h \in G_m, i\ne h} \mathbb {E}\left[ A_{ji} A_{\ell i}A_{jh} A_{\ell h}\right] \quad . \end{aligned}$$
(23)
In a SBM, a common assumption is that of conditional independence: edges are conditionally independent, given the parameters. Thus
\(\sum _{\ell \ne j \in G{'}} \sum _{i \in G_m}\,\mathbb {E}\left[ A_{ji}\, A_{\ell i}\right] =\sum _{i \in G_{m}}\,\left( \mathbb {E}\left[ A_{ji}\right] \,\sum _{\ell \ne j \in G{'}}\mathbb {E}\left[ A_{\ell i}\right] \right)\).
$$\begin{aligned} || b_1^T U ||_2^2\approx & {} \sum _{i \in G_{m}}\,\left( \mathbb {E}\left[ A_{ji}\right] \,\sum _{\ell \in G{'}}\mathbb {E}\left[ A_{\ell i}\right] \right) +2 \sum _{i,h \in G_{m}, h\ne i} \mathbb {E}\left[ A_{ji}\right] \mathbb {E}\left[ A_{jh}\right] \, \sum _{\ell \ne j \in G^{i}}\mathbb {E}\left[ A_{\ell i}\right] \,\mathbb {E}\left[ A_{\ell h}\right] \quad . \end{aligned}$$
(24)
The expression above depends on the colors of
\(i,\,h,\,l,j\) being
r or not. For instance, if
\(q_{i}=q_{h}=r\), then
$$\begin{aligned} \sum _{\ell \ne j \in G{'}}\mathbb {E}\left[ A_{\ell i}\right] \,\mathbb {E}\left[ A_{\ell h}\right] =\sum _{\ell \ne j \in G{'}}\mathbb {E}^{2} \left[ A_{\ell i}\right] \,=N{'}_{r} p_{in}^{2}+N{'}_{nr} p_{out}^{2}\quad . \end{aligned}$$
(25)
All of the other color combinations of terms scale with similar behavior. Completing the calculations accounting for the whole
\(\sum _{i,h \in G_{m}, h\ne i} \mathbb {E}\left[ A_{ji}\right] \mathbb {E}\left[ A_{jh}\right] \, \sum _{\ell \ne j \in G{'}}\mathbb {E}\left[ A_{\ell i}\right] \,\mathbb {E}\left[ A_{\ell h}\right]\) gives extra factors of the order of
\(\left( N_{r}^{m}\right) ^{2}p_{in}^{2}\). Because
\(p_{in}\sim \frac{1}{N}\) in the sparse regime, the total weight of this term is
\(\left( N_{r}^{m}\right) ^{2}p_{in}^{2} \times N_{r}{'}\, p_{in}^{2} \sim \frac{1}{N^{2}}\). This is negligible compared to the term
\(\sum _{i \in G_{m}}\,\left( \mathbb {E}\left[ A_{ji}\right] \,\sum _{\ell \in G{'}}\mathbb {E}\left[ A_{\ell i}\right] \right)\). The quantity
\(\mathbb {E}\left[ A_{ji}\right]\) can be estimated using same reasoning as done for
\(b_{1}\) above. The most important terms are thus:
$$\begin{aligned} ||b_{1}^{T}U||_{2}^{2}(r)\approx & {} N^{m}_{r}p_{in} \left( N{'}_{r}p_{in}+(N-M-N{'}_{r})p_{out}\right) \end{aligned}$$
(26)
$$\begin{aligned}&+N^{m}_{nr}p_{out}\left( \frac{N{'}_{nr}}{K-1}\, p_{in}+\left( N-M- \frac{N{'}_{nr}}{K-1}\right) \, p_{out}\right) \end{aligned}$$
(27)
$$\begin{aligned} ||b_{1}^{T}U||_{2}^{2}(nr)\approx & {} \frac{N^{m}_{nr}}{K-1}\,p_{in} \left( \frac{N{'}_{nr}}{K-1}\,p_{in}+\left( N-M-\frac{N{'}_{nr}}{K-1}\right) \,p_{out}\right) \end{aligned}$$
(28)
$$\begin{aligned}&+N^{m}_{nr}\,\left( \frac{K-2}{K-1}\right) \,p_{out}\left( \frac{N{'}_{nr}}{K-1}\, p_{in}+\left( N-M- \frac{N{'}_{nr}}{K-1}\right) \, p_{out}\right) \end{aligned}$$
(29)
$$\begin{aligned}&+N^{m}_{r}\, p_{out}\left( N{'}_{r}\, p_{in}+\left( N-M-N{'}_{r}\right) \, p_{out}\right) \quad . \end{aligned}$$
(30)
The difference between these two terms is
$$\begin{aligned}&||b_{1}^{T}U||_{2}^{2}(r) -||b_{1}^{T}U||_{2}^{2}(nr) \\&\quad \approx N_{r}^{m}N{'}_{r}\left( p_{in}-p_{out}\right) ^{2}-\frac{N_{nr}^{m}}{K-1}\frac{N{'}_{nr}}{K-1}\, \left( p_{in}-p_{out}\right) ^{2} \\&\qquad +(N-M) \frac{N^{m}_{r}}{K-1}p_{in}^{2} \,(1-\epsilon )\,\epsilon -(N-M) \frac{N^{m}_{nr}}{K-1}p_{in}^{2} \,(1-\epsilon )\,\epsilon \\&\quad = p_{in}^{2}\, (1-\epsilon )\, \left[ \left( N_{r}^{m}N{'}_{r}-\frac{N_{nr}^{m}}{K-1}\frac{N{'}_{nr}}{K-1}\right) \, (1-\epsilon ) +(N-M)\, \epsilon \, \left( N_{r}^{m}-\frac{N_{nr}^{m}}{K-1}\right) \right] . \end{aligned}$$
(31)
The first term inside the bracket is:
$$\begin{aligned} N_{r}^{m}N{'}_{r}-\frac{N_{nr}^{m}}{K-1}\frac{N{'}_{nr}}{K-1}& = {} \left( \frac{M}{K}+M\delta \right) \, \left( \frac{N-M}{K}-M\delta \right) \\&- \left( \frac{M}{K}-\frac{M\delta }{K-1}\right) \,\left( \frac{N-M}{K}+\frac{M\delta }{K-1}\right) \end{aligned}$$
(32)
$$\begin{aligned}& = {} (N-M) \, \frac{M\,\delta }{K-1} - \frac{M^{2}\,\delta }{K-1}-(M\, \delta )^{2}\frac{K-2}{(K-1)^{2}}\, K \end{aligned}$$
(33)
$$\begin{aligned}& = {} \frac{M\, \delta }{K-1}\,\left[ N-2M- M\, \delta \, \frac{K-2}{K-1}\, K\right] \quad . \end{aligned}$$
(34)
The last term inside the brackets is:
$$\begin{aligned} (N-M) \, \epsilon \, \left( N_{r}^{m}-\frac{N_{nr}^{m}}{K-1}\right) = (N-M) \, \epsilon \, M \delta \frac{K}{K-1} >0 \quad . \end{aligned}$$
(35)
From this we can notice that, in the case of
\(M \ll N\), i.e. when the initial sample size is very small compared to the system size, we obtain that:
$$\begin{aligned} ||b_{1}^{T}U||_{2}^{2}(r) -||b_{1}^{T}U||_{2}^{2}(nr) \approx p_{in}^{2}\, (1-\epsilon )\, \frac{M\, \delta }{K-1}\, \left[ 1+\epsilon (K-1)\right] \, N>0 \quad . \end{aligned}$$
(36)
This means that the best scoring
\(j \in G{'}\), based on the TCEC criterion, has color
r. The score difference with candidates of color different than
r is also increasing with the bias
\(\delta\), thus reinforcing this bias as more nodes of color
r are added to the sample
\(G_{m}\). Instead, if
\(M=\alpha \,N\) but with
\(\alpha\) finite and
\(0<\alpha <1\), and we also assume that
\(\epsilon \ll 1\) so that
\(\epsilon\) is small enough that the dominant term is that of Eq. (
32):
$$\begin{aligned} ||b_{1}^{T}U||_{2}^{2}(r) -||b_{1}^{T}U||_{2}^{2}(nr) \approx p_{in}^{2}\, (1-\epsilon )\,\frac{M\, \delta }{K-1}\,\left[ N-2M-M\delta \frac{K-2}{K-1}\,K\right] \quad . \end{aligned}$$
(37)
The term inside the brackets can become negative as
M increases. The sampling thus proceeds as following: at the beginning, when
M is very small compared to
N,
\(\delta\) is initially increasing. In turns, also
M is increasing, as more nodes are added to the sample. However, at a certain point, as both
M and
\(\delta\) increase, the best candidate will switch to be of color different than
r, this happens when the term
\(\left[ N-2M-M\delta \frac{K-2}{K-1}\,K\right]\) becomes negative. This happens earlier when
K is bigger. Finally, an Erdos–Renyj case is obtained when
\(\epsilon =1\), i.e.
\(p_{in}=p_{out}=p\). In this case all the terms are zero, i.e. the TCEC score difference is zero and there is no preferred candidates based on the color. In the generalized TCEC score, i.e.
\(\alpha >0\), then the node with higher degree
\(d_{in}^{G_{m}}(j)\) will be chosen as best candidate. Finally notice that in the disassortative case, i.e.
\(\epsilon >1\), there is a change of sign that makes nodes of color different than
r have higher score. In this case then the sampling dynamics keep jumping to nodes of different colors, hence the sample is more balanced.