Introduction
-
We introduce a deep RL framework for task-driven discovery of incomplete networks. This formulation allows us to train models of environment dynamics and reward offline.
-
We show that, for a variety of complex learning scenarios, the added feature of learning from closely related scenarios leads to substantial performance improvements relative to existing online discovery methods.
-
We show that network embedding can play an important role in the convergence properties of the RL algorithm. It does so by imposing structure on the network state space and prioritizing navigation over this space.
-
Among a class of embedding algorithms, we identify Pagerank (PPR) as a suitable network embedding algorithm for the selective harvesting task. Our combined approach of PPR embedding and offline planning achieves substantial reductions in training time.
-
Leveraging several evaluation metrics, we delineate learning regimes where embedding alone stops being effective and planning is required.
-
Our approach is able to generalize well to unseen real network topologies and new downstream tasks. Specifically, we show that policies discovered by training on synthetically generated networks translate well to detection of anomalous nodes in real-world networks.
Related work
Method | State space | Action space | Observability | Learning goal | Learning framework | Policy training | State embedding |
---|---|---|---|---|---|---|---|
NOL (LaRock et al. 2018) | Large | Dynamic | Partial | Vertex property | MDP | Online | No |
\(D^3TS\) (Murai et al. 2017) | Large | Dynamic | Partial | Vertex property | Supervised | Online | No |
GCPN (You et al. 2018) | Small | Fixed | Full | Graph property | MDP | Offline on given dataset | No |
NAC | Large | Dynamic | Partial | Vertex property | MDP | Offline and online on designed dataset | Yes |
Problem definition
-
State space: Let \(\{M^i\}\) be a set of random graph models, each defining a set of network instances \(\{G^i\}\) defined over a set of nodes V. We define the state space as \({{\mathcal {S}}}= \bigcup _{M^i}\{s_t=G^i_t\}\), the set of partially-observed network instances \(G_t^i =\{V_t^i, E_t^i\}\) discovered at intermediate time-steps t, where \(V^i_t \subseteq V\) and \(E^i_t \subseteq E\). Each \(v \in V^i_t\) has a label \(C(v) \in \{\)0, 1,*\(\}\), representing non-target, target and unobserved node states, respectively. \(E^i_t\) is the set of edges induced by \(V^i_t\). The state space includes the initial state \(G^i_0\) that contains at least one target node, as well as the terminal state which is the fully observed network instance \(G^i\).
-
Action space: For each network instance \(G^i\), we define the action space as \({{\mathcal {A}}}=\{A_t\}\), where \(A_t=\{a=v\}\) is the set of boundary nodes v, observed at time-step \(t: \{v \in V_t, C(v)=*\}\).
-
Transition function: \(T(s_t,a_t,s_{t+1})=P(s_{t+1}|s_t,a_t)\) encodes the transition probability from state \(s_t\) to \(s_{t+1}\) given that action \(a_t\) was taken. Let v be the selected node by action \(a_t\). Then \(s_{t+1}= s_t \cup C(v) \cup N_v \cup E_{N_v}\), where \(N_v\) are the neighbors of node v and \(E_{N_v}\) are all the edges incident to \(N_v\). For each network instance \(G^i\), the transition function is deterministic: \(T(s_t,a_t,s_{t+1})=1.\)
-
Reward function: \(r(s_t, a_t)\) returns the reward gained by executing action \(a_t\) in state \(s_t\) and is defined as: \(r(s_t, a_t) = 1\) if \(C(a_t)=1\). The total cumulative, action-specific reward, also referenced as the action-value function Q, is defined as,with \(\gamma \in [0, 1]\) representing a discount factor that captures the utility of exploring future graph states and h is the horizon length. Figure 1 gives a simple illustration of how this cumulative reward is computed over a network topology. In the next section, we describe in detail our deep reinforcement learning algorithm.$$\begin{aligned} Q(s,a)=\left[ \sum _{t=0}^h \gamma ^t r_{t+1}|s,a\right] , \end{aligned}$$(1)
Network actor critic (NAC) algorithm
Compression of network state space
Offline learning and policy optimization
Training and network details
NAC performance results
Datasets
Synthetic datasets
Model | Type | Parameters |
---|---|---|
SBM | Background | \(k=[1,10], p_i=[0.01,0.4],r=[0.005,0.25],i=1 \ldots k\) |
LFR | Background | \(\tau _1=[3,2], \tau _2=(1,1.9], \mu =[0.1,0.4], \langle d \rangle =[32,256]\), \(d_{\max }=[256,2048],\min _{c}=[256,1000], \max _{c}=[512,2000]\) |
ER | Foreground | \(n_f=\{30,40,80\},k_f=\{1,2,4\},p_f=[0.5,1]\) |
Real datasets
Baselines
Name | # Nodes | # Edges | Target type | Target size |
---|---|---|---|---|
Facebook politician | 5908 | 41,729 | Synthetic | 80 |
Facebook TV shows | 3892 | 17,262 | Synthetic | 80 |
Livejournal | ≈ 4000 k | ≈ 35,000 k | Real | ≈ 1400 |