Motivated by TGN, FPGN extracts memory and saves the last neighbors for each vertex so that FPGN can build the subgraph of target vertices and achieve follower prediction. However, TGN can only train for the samples in each batch separately. In follower prediction, there is no guarantee that each round of batch can provide the prediction module with features for the vertices that need to be predicted. To solve the problem, we design the module of FPGN based on TGN. More specifically, FPGN updates only the memory and neighbors of each vertex in each batch. The prediction module and back-propagation will only be done after these updates are completed.
3.3.1 Memory update
To better learn the temporal graph information, FPGN mines the vertex memory with temporal information. Given a visitor record triplet (u, v, t), which means that the visitor u arrives at the location v at timestamp t with message \({\textbf {M}}_{(u,v)} = {\textbf {R}}(u) || {\textbf {R}}(v)\). The memory of time t for u is defined as \({\textbf {F}}_u(t)\), and \({\textbf {F}}_u(0) = \{0\}^{1 \times d}\) where d is the dimension of the memory. The memory of u and v is calculated separately. For the sake of description, we only describe the process of updating u’s memory.
To better explore the historical activity message of vertices, all activity timestamps
\(T_u(t)\) of
u earlier than time
t will be extracted and synthesized into vectors
\({\textbf {T}}_u(t) \in \mathbb {R}^{1 \times |T_u(t)|}\). And memories
\({\textbf {M}}_u(t)\) of
u at those times are also involved in the vertex memory update task. FPGN first encodes the activity of
u:
$$\begin{aligned} {\textbf {X}}_{t, u} = \cos ({\textbf {W}}_t \times ({\textbf {T}}_u(t) - t_u \cdot {\textbf {I}}_{(t,u)}) + {\textbf {b}}_t), \end{aligned}$$
(8)
where
\(\cos (\cdot )\) is the cosine function;
\({\textbf {W}}_t\) is a matrix of trainable parameters and
\({\textbf {b}}_t\) is a trainable parameter;
\(t_u\) is the timestamp of the last activity of
u;
\({\textbf {I}}_{(t,u)} \in \mathbb {R}^{1 \times |T_u(t)|}\) is a vector whose elements are all 1 with dimensions
\(1 \times |T_u(t)|\). It should be noted that
\(t_u\) is initialized with a value of 0.
In order to extract information about the past activities of
u, we need to extract all neighbors of
u before time
t from the adjacency matrix set
A. This can be done by defining the set of neighbors of
u at time
t as follows:
$$\begin{aligned} \mathcal {N}_u^t = \{(v', t') |\ {\textbf {A}}_{u,v'}^{t'}=1, 0 \le t' < t\}, \end{aligned}$$
(9)
where
\({\textbf {A}}^{t}\) is the adjacency matrix at time
t, and
\({\textbf {A}}_{u, v}^{t} = 1\) if
u and
v connected with each other at time
t. After obtaining these neighboring vertices, we can then get the message between
u and them, these neighbors’ memories at that time, and the past memories of
u at all timestamps before
t:
$$\begin{aligned} \begin{aligned}&{\textbf {M}}_{\mathcal {N}_u^t} = \bigcup _{v' ,\exists (v', t') \in \mathcal {N}_u^t} {\textbf {M}}_{(u, v')}, \\&{\textbf {X}}_{\mathcal {N}_u^t} = \bigcup _{v',\exists (v', t') \in \mathcal {N}_u^t} {\textbf {F}}_{v'}(t'), \\&{\textbf {X}}_u = \bigcup _{t', \exists (v', t') \in \mathcal {N}_u^t} {\textbf {F}}_{u}(t'), \end{aligned} \end{aligned}$$
(10)
then, concatenate
\({\textbf {X}}_u\),
\({\textbf {X}}_{\mathcal {N}_u^t}\) and
\({\textbf {M}}_{\mathcal {N}_u^t}\), as well as
\({\textbf {X}}_{t, u}\) to obtain the initial memory of
u:
$$\begin{aligned} {\textbf {X}}_{t, u}^{init} = {\textbf {X}}_u || {\textbf {X}}_{\mathcal {N}_u^t} || {\textbf {M}}_{\mathcal {N}_u^t} || {\textbf {X}}_{t, u}, \end{aligned}$$
(11)
where
\({\textbf {X}}_{t, u}^{init}\) is the initial memory of
u at timestamp
t.
Because early activity records should not have too much influence on the present, and each vertex memory at every moment contains activity information before that moment, to make FPGN pay more attention to recent activity information and reduce its computational cost, we only keep the latest \(\mathcal {K}\) sets of initial memory \(\bar{{\textbf {X}}}_u^{init}(t)\) in \({\textbf {X}}_u^{init}(t)\).
LSTM is a widely used recurrent neural network architecture that is particularly suited to processing sequential data, such as speech and text [
35,
38]. There are many works that handle temporal graphs using LSTM and have achieved good performance [
6,
50] nowadays. However, recent research has found that the newer Gated Recurrent Unit (GRU) [
10] architecture offers comparable performance to LSTM while being more computationally efficient [
5,
5,
51], making it an attractive alternative for processing temporal graph datasets. Therefore, in this work, we chose to use GRU to obtain vertex memories.
To be more precise, during each time step, a GRU unit utilizes the initial memories
\(\bar{{\textbf {X}}}_u^{init}(t)\) and the most recent memories of
u, denoted as
\({\textbf {F}}_{u}(t')\), to compute the reset gate
\({\textbf {p}}_{t}\) and update gate
\({\textbf {q}}_{t}\) in the following manner:
$$\begin{aligned} \begin{aligned} {\textbf {p}}_{t}&= \sigma ({\textbf {W}}_{px}\bar{{\textbf {X}}}_u^{init}(t) + {\textbf {W}}_{pf}{} {\textbf {F}}_{u}(t') + {\textbf {b}}_{p}) \\ {\textbf {q}}_{t}&= \sigma ({\textbf {W}}_{qx}\bar{{\textbf {X}}}_u^{init}(t) + {\textbf {W}}_{qf}{} {\textbf {F}}_{u}(t') + {\textbf {b}}_{q}), \end{aligned} \end{aligned}$$
(12)
where
\(\sigma \) is non-linear activation functions,
\({\textbf {W}}_{px}\),
\({\textbf {W}}_{pf}\) and
\({\textbf {W}}_{qx}\),
\({\textbf {W}}_{qf}\) are weight matrices, and
\({\textbf {b}}_{p}\),
\({\textbf {b}}_{q}\) are bias vectors. The reset gate controls how much of the previous memories should be ignored, while the update gate controls how many new memories should be added to the initial memories. The candidate memory
\(\widetilde{{\textbf {F}}}_{u}(t)\) is then computed as follows:
$$\begin{aligned} \widetilde{{\textbf {F}}}_{u}(t)=\sigma ({\textbf {W}}_{x}\bar{{\textbf {X}}}_u^{init}(t) +{\textbf {W}}_{f}({\textbf {p}}_{t} \odot {\textbf {F}}_{u}(t'))+{\textbf {b}}_f) \end{aligned}$$
(13)
where
\(\odot \) denotes the element-wise product,
\({\textbf {W}}_{x}\) and
\({\textbf {W}}_{f}\) are weight matrices and
\({\textbf {b}}_f\) is a bias vector. Finally, the new memory of
u \({\textbf {F}}_{u}(t)\) is calculated as a linear interpolation between the previous memory and the candidate memory, controlled by the update gate:
$$\begin{aligned} {\textbf {F}}_{u}(t) = (1-{\textbf {q}}_{t})\odot {\textbf {F}}_{u}(t') + {\textbf {q}}_{t}\odot \widetilde{{\textbf {F}}}_{u}(t). \end{aligned}$$
(14)
GRU’s mechanism allows it to process temporal graphs by selectively retaining or discarding information as time passes. This feature enables GRU to store and analyze data over an extended period, making it an effective tool for handling dynamic and evolving datasets. After completing the training of all data through the above method, we can obtain the final vertex memories:
$$\begin{aligned} {\textbf {F}} = \bigcup _{n \in U \cup V}{} {\textbf {F}}_{n}(\Gamma _{n}), \end{aligned}$$
(15)
where
\(\Gamma _{n}\) is the last timestamp of vertex
n.
3.3.2 Neighbors update
Each round of memory update for each vertex requires tracing the vertices that were previously connected to it. Therefore, after each round of vertex memory update for a triplet (u, v, t), the edges involved in that round of update need to be saved into the adjacency matrix \({\textbf {A}}^t\), that is to say, \({\textbf {A}}^t_{u,v} = 1\).
Additionally, after acquiring
\({\textbf {F}}\), utilizing all the connection information for improved follower predictions is imperative. Thus, amalgamating all the adjacency matrices from each time step into a single matrix becomes necessary:
$$\begin{aligned} {\textbf {A}}_a = \max _{t=1}^{T} \{ {\textbf {A}}^{t} \} \end{aligned}$$
(16)