1 Background
Social network is a multidisciplinary research area for both academia and industry, including social network modeling, social network analysis, and data mining. An interesting problem in social network analysis is influence maximization (IM), which can be applied in marketing to deploy business strategies. Typically, IM is the problem that given a graph
G as a social network, an influence spread model and an integer
k select the top
k nodes as seeds to maximize the expected influence spread (EIS) through
G. One corresponding issue in marketing is product promotion. In order to advertise a new product efficiently within a limited budget, a company may choose a few people as seeds who will be given free samples. It is likely that those people will recommend others, such as their friends, relatives or co-workers, to try this product. Eventually, a great number of people may adopt the product due to such ‘word-of-mouth’ effect [
1]-[
6]. Intuitively, the initial seed selection is a key factor that will impact on the success of the product promotion. Therefore, it is important to design applicative influence spread model and efficient search algorithm to find the most influential people in social networks.
IM was first investigated as an combinatorial optimization problem by Kempe et al. in [
5]. They considered two influence spread models, namely, Independent Cascade (IC; [
2],[
3]) and Linear Threshold (LT; [
7],[
8]), and proved a series of theoretical results. After that, the two models have been extensively studied (please see, e.g., [
9]-[
15] for recent works). In this paper, we focus upon the LT model. Let
S be a set of initially active nodes; the influence, under the LT model, propagates in a threshold manner. That is, a node
v is activated if and only if the sum of influence it receives from its active neighbors exceeds a threshold
λ(
v) chosen uniformly at random.
As we understand, a crucial part of IM is how to compute the EIS given a node, since only we know the EIS of each node, and then we could find a seed set to maximize the combinatorial EIS. The exact EIS computation was left as an open problem in [
5] and has attracted a great deal of attentions in recent years (see, e.g., [
9]-[
11],[
13],[
15],[
16]). In [
11], Chen et al. proved that computing the exact EIS under the LT model is #P-hard. Therefore, a polynomial time exact solution does not exist unless
P= NP. But based on the observations in [
11],[
15], the influence diminishes rapidly during the diffusion in many real-world social networks under the LT model. In other words, the influence spread of a seed is limited within a small number of hops. It has been shown that the influence spread under the LT model can be computed by searching simple paths starting from the seeds [
11],[
15]. Therefore, we can define a hop constraint
T such that given a seed
v, we only take paths within
T hops to estimate the EIS of
v. The main contributions of this paper are as follows:
1.
We develop an exact algorithm for computing the EIS within four hops. Instead of finding simple paths, we compute the EIS of a node by finding cycles through it. In this study, a cycle of length l is defined as a path visiting a node twice and visiting other l−2 nodes exactly once. The detailed algorithm is given in the ‘Methods’ section.
2.
For the case that T>4, we develop an approximation algorithm to estimate EIS based on random walk. The experimental results in the ‘Results and discussion’ section show that more precise and quick results can be obtained by using a combination of our exact and approximation algorithms rather than using methods based on simple path.
3.
When applying the standard greedy algorithm to IM, it will repeatedly run EIS estimation (EISE) until the top k influential nodes are selected. To further reduce the running time, we construct two lists to save the influence diffused by each node and the active probability of each node, respectively. Moreover, we develop two algorithms to update the two lists when adding a new seed so that the next one with the maximum marginal gain can be directly obtained without running the EISE. The update algorithms are represented in the ‘Influence maximization’ section. It is able to say that the two lists contain all the information for doing the seed selection, and they can be easily and quickly updated by our update algorithms.
4.
We compare our algorithm with some well-known IM algorithms on four real-world social networks. The experimental results show that our algorithm is more accurate than others in finding the most influential nodes, and it is also better than or competitive with them in terms of running time.
The rest of this paper is organized as follows: The ‘Related work’ section introduces the related works. ‘Problem description’ section gives the problem descriptions of both EISE and IM. ‘Methods’ and ‘Influence maximization’ sections study the two problems, respectively. In detail, ‘A deterministic algorithm’ section efficiently solves the EISE assuming that the influence spread is negligible after four hops. ‘A randomized algorithm’ section presents an approximation algorithm for general EISE. The ‘Influence maximization’ section presents a fast method to solve IM by using the algorithms proposed in the ‘Methods’ section. Finally, ‘Results and discussion’ section gives the simulation results, and the ‘Conclusion’ section concludes this paper.
In the literature, the IM problem has been extensively studied under the IC and LT models. Kempe et al. in [
5] first showed that it is NP-hard to determine the optimum for IM under the two models, and by showing that the EIS function is monotone and submodular, they proved that the standard greedy algorithm brings a
-factor approximation solution. In mathematics, a set function
is monotone and submodular if ∀
S2⊆
S1, we have
f(
S1)≥
f(
S2) and
f(
S1∪{
u})−
f(
S1)≥
f(
S2∪{
u})−
f(
S2), where
u is an arbitrary item. In such cases, a
-factor approximation solution can be obtained by picking the item with the maximum marginal gain repeatedly [
17]. In [
5], how to compute the exact marginal gain (i.e., compute the EIS increment when adding a node) under the two models was left as an open problem, and they estimated it by running the Monte Carlo (MC) simulation, which is not computational efficient (e.g., it takes days to select 50 seeds in a moderate size graph of 30K nodes [
11]). Motivated by improving the running time performance, many algorithms have been proposed. Leskovec et al. developed a Cost-Effective Lazy Forward (CELF) algorithm, which is up to 700 times faster than the greedy algorithm with Monte Carlo simulation [
16]. But as the results shown in [
9], CELF still cannot be applied to find seeds in large social networks, and it takes several hours to select 50 seeds in a graph with tens of thousands of nodes. To further reduce the running time, Goyal et al. [
13] developed an extension of CELF, called CELF++, which was showed 0.35 to 0.55 faster than CELF. In [
9], Chen et al. proposed two new greedy algorithms, namely NewGreedy and MixedGreedy. NewGreedy reduces the running time by deleting edges having no contribution to influence spread (similar idea was also proposed in [
18]), and MixedGreedy which is a combination of NewGreedy and CELF (it uses NewGreedy as the first step and applies CELF for the remaining rounds). Based on the experiments, they showed that MixedGreedy is much faster than both NewGreedy and CELF.
Based on the IC model, Chen et al. also proposed a new influence spread model, called Maximum Influence Arborescence (MIA), to further reduce the running time of EISE. The efficiency of MIA was demonstrated in [
10]. Besides selecting nodes greedily, Wang et al. [
19] proposed a community-based algorithm for mining the top
k influential nodes under the IC model, and Jiang et al. in [
14] proposed a heuristic algorithm based on Simulated Annealing.
In terms of LT model, after Kempe et al. proposed the greedy algorithm [
5], the most recent works for IM under this model are [
10],[
12],[
15]. In [
10], Chen et al. proved that the EIS under LT model can be computed in linear time in a directed acyclic graph, and they proposed an algorithm called Local Directed Acyclic Graph (LDAG). Given a general graph, it first converts the original graph into small acyclic graphs, and it only considers the EIS of a node within its local graph when computing the marginal gain. In [
12], Narayanam and Narahari developed an algorithm for the LT model that selects the nodes based on the Shapley Value. In [
15], Goyal et al. proposed an algorithm called SIMPATH, which estimates the EIS by searching for the simple paths starting from seeds. Since it is computationally expensive to find all the simple paths, they adopted a parameter
η to prune them. They also applied the vertex cover optimization to cut down the number of iterations. Based on their experimental results, SIMPATH showed its merits from the aspects of running time and seed quality.
3 Problem description
Many introductions about the LT model and IM problem can be found in detail in papers cited above. Here, for the sake of completeness, we give a brief description for the LT model and formal definitions for IM and EISE.
Definition 1
Let G(V,E) be a directed graph; we define
-
Nin(v) (respectively Nout(v)) to be the set of incoming (respectively outgoing) neighbors of v (∀v∈V).
-
λ(v) to be the threshold of v, which is a real number in the range of [ 0,1] chosen uniformly at random.
-
x(v) to be a 0 to 1 variable which indicates whether v is active or not.
According to Definition 1, given a weighted directed graph G(V,E,w), where w(e)∈[ 0,1] (∀e∈E) is a weight function, the sum of influence v receives can be formulated as . Without loss of generality, we assume (∀v∈V). In the LT model, time is discrete. Given a seed set S, at time 0, we have ∀v∈S, x(v)=1, and ∀u∈(V∖S), x(u)=0. At any particular time t, a node v∈V becomes active if . Finally, the influence spread process stops at a time slot when there is no newly activated node.
Definition 2.
EISE: Given a weighted directed graph G(V,E,w) and a set S⊆V of nodes, EISE is the problem of estimating the expected number of active nodes at the end of the influence spread. EISE
T
is the problem that given an integer T, estimates the expected number of nodes that are active at time T.
For the rest of this paper, given a seed set S, we denote by σ(S) the expected number of nodes that are eventually active and denote by σ
T
(S) the expected number of nodes that are activated within T time slots. We can say that σ(S) is an expected number among the probability distributions of active nodes given S and σ
T
(S) is a time limited version of σ(S).
Definition 3.
IM: Given a weighted directed graph G(V,E,w) and a parameter k, the IM problem is to find a seed set S of cardinality k to maximize σ(S).
As the experimental results shown in [
15], under the LT model, the EIS is negligible after a small number of hops (usually three or four hops) in many real-world social networks. Therefore, to solve the IM problem, it is sufficient to compute
σ
T
(
S) instead of
σ(
S) for some small value of
T.
5 Influence maximization
Considering the computational efficiency, we define a hop constraint for EISE, and we present two algorithms in ‘Methods’ section to compute σ
T
(v) in v’s local area (T hops). The proposed algorithms are worth applying to solve the IM problem greedily. Given a weighted directed graph G(V,E,w), the standard greedy algorithm will run EISE O(n) times to select a seed, where n denotes the number of nodes. To further reduce the running time, we construct an influence list IL to store the EIS of nodes in the induced graph of G∖S, where S is the current seed set. Let v1,v2,⋯,v
n
be the nodes in the input graph. Given a parameter T, initially we have IL={l1=σ
T
(v1),⋯,l
n
=σ
T
(v
n
)}, since S=∅. After adding a node v
i
into S, all the nodes, whose local area include v
i
, have to be updated. Instead of running EISE, we update them by building an incoming tree rooted at v
i
(Algorithm 3).
In Algorithm 3, the incoming tree is node repeated, including all the simple path of length T ending at v. denotes the EIS from i
j
to i0 via path (i
j
,ij−1,⋯,i0), where i0=v, and denotes the EIS of v in the induce graph of V∖S∖{i1,⋯,i
j
}. Thus, denotes the entire influence diffused from i
j
through path (i
j
,ij−1,⋯,i1,π), where π is a path of length no more than T−j starting from v and does not contain any node in {i1,i2,⋯,i
j
}. It is clear that after steps 2 to 7, ∀u∈(V∖S), the influence diffused from u through v is removed from the corresponding element in IL. Consider now the running time. Algorithm 3 generates at most O(Δ
j
) nodes in depth j (1≤j≤T). For each node i
j
in depth j, can be computed by building an outgoing tree of depth T−j rooted at v, which can be done by DFS in O(ΔT−j) time. Therefore, Algorithm 3 runs in O(Δ
T
) time, considering T as a constant. Compared with running EISE for all the nodes, it is much faster when T and Δ are relatively small.
In addition to IL, we construct another list, namely, probability list PL, to store the nodes’ active probabilities at time T. When S=∅, obviously PL={p1=0,⋯,p
n
=0}. Similarly, after adding a node v
i
into S, the active probabilities of nodes in v
i
’s local area need to be updated. The algorithm of updating PL is given in Algorithm 4.
Algorithm 4 searches the simple paths of length T starting from v and updates the active probability of a node i
j
according to step 5, in which is the influence spread from v to i
j
through path (i0,⋯,i
j
), and 1−p
v
is the increment of v’ active probability when it is added into S. In the outgoing tree, there are O(Δ
T
) nodes; thus, PL can be updated in O(Δ
T
) time.
Assume v
i
is a newly added node; then, the marginal gain is l
i
(1−p
i
). Since both Algorithms 3 and 4 run in O(Δ
T
) time, we can find the node with the maximum marginal gain in O(Δ
T
+n) time. Next, we present an algorithm, which consists of two steps, for influence maximization based on a time parameter T (IMT). Given a weighted directed graph G(V,E,w), the first step is to compute the EIS of each node v∈V. Such computation is based on the assumption that the EIS is negligible after T hops. The second step contains two parts, the first part is to choose a node with the maximum marginal gain and the second part is to update the two lists: IL and PL. Let v be the last added node; the updating is limited to the local area of v (T hops from v).
The running time of Algorithm 5 highly depends on
T and the maximum degree
Δ. In [
15], when estimating the EIS of a node by searching simple paths, a parameter
η is used to prune a path once its influence spread is less than
η. To further reduce the running time, when building the incoming and outgoing trees (step 6), we prune the paths in the same way. It is worthy to mention that in [
15], the EISE of a node
v misses all the outgoing simple paths of
v whose product of weights is less than
η. When building the incoming (respectively outgoing) tree rooted at
v, our algorithm also neglects a number of paths; however, the losses are now evenly distributed to all the nodes in
v’s local area. Thus, the impact is less significant.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ZL, LF, and KY formulated the problem and did the algorithm design and implementation. WW and BT contributed to the theoretical part of algorithm design and organized this research. All authors read and approved the final manuscript.