Researchers have been interested in assessing influence in social networks, and many approaches were provided to rank users according to their influence [
8]. Some researches are based on
network topology and centrality measures [
9]. Others approaches try to establish a ranking of nodes using
diffusion-based or random-walk-based algorithms like HITS [
10] algorithms or PageRank [
11]. A novel family extends network topology approaches to take into account
information fusion about different interactions that can be considered in the influence assessment. In the following, we present major works on
Twitter for each type of approach.
While measuring users influence in
Twitter, many criteria can be considered. Leavitt et al. [
2] use four features to measure influence, which are:
replies, retweets, and
mentions in addition to number of
followers. They give statistics related to these measures and do not offer a global influence score based on all the proposed criteria. Cha et al. [
12] define three influence measures in
Twitter, the indegree influence, which is the number of
followers, indicating the size of a user’s audience or popularity; the
mention influence corresponds to the number of a user’s
mentions, indicating his ability to engage others in
mentions; and the
retweet influence, which is the number of
retweets, indicating the ability of a user to write content to be forwarded to others. The authors compute the value of each relation for 6 million users and compare them. To do this, they sort users according to each different relation, after that, they quantify how a user’s rank varies across different relations. Spearman’s rank correlation is used as a measure of the association strength between two rank sets. They found that
followers number represents a user’s popularity, but is not related to other important relations, such as
retweets and
mentions. Their result suggests that
followers number alone reveals very little about a user’s influence. This research does not provide a global influence measure and only influence measures according to each relation separately. Chen et al. [
13] propose a local ranking method named ClusterRank, which considers the number of neighbors and the clustering coefficient. Bakshy et al. [
14] followed a different approach to estimate influential users: they use shortned URL diffusion cascades and consider that users producing the largest cascades are the most influential. The presented results are obtained from a survey of 1.6 million users over a period of two months in 2009. In this work, the definition of influence is limited to the ability to be the first to publish URL which is then
retweeted by followers. Brown et al. in [
15] believe that the location of a node in the network may play a more important role than its indegree. For example, a node located in the center of the network, having few highly influential neighbors, may be more influential than a node having a larger number of less influentials neighbors. Considering this fact,
k-shell decomposition algorithm can be useful [
16]. Basically, the principle of the
k-shell decomposition is to assign a core index
ks to each node such that nodes with the lowest values are located at the periphery of the network, while nodes with the highest values are located in the center of the network. The innermost nodes thus form the core of the network. They observe that the results of the
k-shell decomposition on
Twitter network are highly skewed. Therefore they propose a modified algorithm that uses a logarithmic mapping, to produce fewer and more meaningful
k-shell values. Correlation between users relations were considered in [
17] to identify and measure social influence as a source of correlation between the individuals behaviors with social ties. Authors study the phenomenon that a user’s behavior can induce his/her friends to behave in similar way. To do this, they use logistic regression to quantify social correlation. This is measured as a function of only one variable: the number of active friends the user has. After this, the
shuffle test is used to decide if influence is a likely source of correlation. The techniques used provide only a qualitative indication of the influence existence and not a quantitative measure. Qasem et al. [
18] presented a new approach of influential users detection. The proposed approach detects the users who increase the size of social network by attracting new users into the network. In [
19], users review the features that can be extracted from
Twitter for the purpose of user classification and detecting influential users in real-life based on their
Twitter profile, they cite many features such as scalar features (e.g., number of followers), users interactions and term occurrences (URLs, punctuation, etc.). After that, the authors use non-linear classifier under the form of kernelized SUM and logistic regression. It consists at representing a user under various forms of bags of words. The results are interesting but are valid only for the considered data set and restricted to the used domains (automotive and banking domains).
The disadvantage of the network topology-based algorithms is to consider information about the users, and not to consider the interaction among users through a sequence of relations. In Twitter, the user’s influence is impacted by the information diffusion between the users. Nevertheless, these studies help us to recognize the criteria to take into account in the influence assessment.
Other researches propose to rank nodes using
diffusion-based or random-walk-based algorithms, with a common assumption that a node is expected to be influential if it points to many highly influential neighbors. In this context, user’s influence were ranked based on the classical random walk algorithm, such as PageRank. The main idea behind PageRank is that “more important pages (web sites) are likely to receive more links from other pages.” Many variants of the PageRank algorithm were proposed to improve it and adapt it to
Twitter. A notable one was TunkRank [
20], and it uses a constant to represent the
retweet probability, combined with the people whom the user concerned and the fans who concerned this user. The user’s influence was the expected number of the people influenced by the released information in TunkRank. Ghosh et al. [
21] propose Collusionrank, a PageRank-like approach, to overcome link farming in
Twitter. They negatively bias the initial scores towards nodes identified as spammers. Then, since a user should be penalized for following spammers and not for being followed by spammers, the Collusionrank score of a node is computed based on the score of its
followings (instead of its
followers). Thus, users who
follow a larger number of spammers, or who
follow those who in turn
follow spammers, get a negative score of higher magnitude and are pushed down in the ranking. On the basis of PageRank, LeaderRank [
22] introduces a ground node
g, which has two directed links to every node in the original network, so that the network becomes strongly connected. LeaderRank converges faster, since the network is strongly connected. The results showed that LeaderRank outperformed PageRank in terms of ranking effectiveness, as well as robustness against manipulations and noisy data. Li et al. [
23] improve LeaderRank by introducing a weighted mechanism: nodes with different in-degrees get different ranks from the ground node. In [
24], authors define a measure based on topic similarity and structure in the links between users. Influence is considered as the fact of
following other users regarding topic interests. In this context, the authors propose TwitterRank, an extension of the PageRank algorithm, to measure the topic-sensitive influence of users. Although the idea is promising, the experimental results show that there are some
follow links between users not because of the topic similarity between them, also the method ignored other important relations, such as
mentions and
replies. Ashwini et al. [
25] consider that
Twitter is a platform of information diffusion and study the problem of identification of influential users. They propose ProfileRank, an information diffusion mode based on random walks that estimates users influence. ProfileRank is based on the principle that an influential user creates a pertinent content. The limit of this approach is that influence is assessed based only on the
retweet relation and the method ignores the other relations.
In the same context of the diffusion-based algorithms, some researchers proposed variants of the HITS algorithm (hyperlink-induced topic search), a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. HITS assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages. Romero et al. [
26] propose the IP-algorithm to measure influence. In this paper, influence is considered as the degree of content propagation in the network (
retweets). In addition, authors believe that a user’s influence depends not only on the size of the influenced audience, but also on their passivity. The passivity of a user is his passive information consuming without forwarding the content to the network. The algorithm showed better accuracy than other influence measures, such as PageRank, the number of
followers and number of
mentions. Although passivity seems a good influence indicator, this work ignored other important relation such as
reply. The diffusion-based algorithms, such as variants of the PageRank and HITS, were designed considering the information propagation in the network. Their shortcoming is the lack of relations and interactions combination.
In recent works,
information fusion is considered to address limitations of existing methods. In [
27], authors propose a combination of two models for ranking users’ influence: The PageRank algorithm [
11] and HMM (hidden markov model). They build a HMM to observe the influence evolution over time and use three relations:
retweet,
mention, and
reply. The model is evaluated using a survey as ground-truth for influence ranking. The proposed model differs from the others by combining the important relations. However, as the purpose is to rank users’ influence, a user’s given influence does not reveal information about its influence degree (high or low influence), and the model’s output is only useful in users ranking. Moreover, the authors do not offer a measure of influence by exploiting the combination of criteria with its inherent uncertainty. However, it seems important to consider the degree of uncertainty on the weights assigned to the different relations and interactions according to their importance.
In this purpose, recent research uses the belief functions theory to assess user’s influence in weighted networks [
28,
29] and complex networks [
30]. To the best of our knowledge, this is the first time belief functions theory is exploited to assess influence on
Twitter network using different forms of interactions instead of centrality measures.
Belief functions theory
Every day, a huge volume of incomplete and imperfect information is produced by social networks applications. Thus, reasoning with uncertainty has become a major interest in the analysis of social networks data.
The belief functions theory is considered as a general framework for reasoning with uncertainty, and has well been connected to other frameworks, such as probability, possibility, and imprecise probability theories [
31]. The theory of belief functions, also known as evidence theory or Dempster–Shafer theory, was first introduced by Dempster in the context of statistical inference, and was later developed by Shafer as a general framework for modeling epistemic uncertainty, which means, due to a lack of knowledge [
32].
In the following, we are going to remind the basic concepts of belief functions theory. Let
\(\Omega\) be a finite set, denote by
\(2^\Omega\) the set of all subsets of
\(\Omega\). In the context of Dempster–Shafer theory,
\(\Omega\), often called a frame of discernment, represents the set of possible answers to a certain question. A mass
m is a function
\(m\,{:} \,2^\Omega \longrightarrow [0,1]\) such that:
$$\begin{aligned} \underset{X\in 2^\Omega }{\sum }m(X) = 1 \; \text {and} \; m(\emptyset )=0 \end{aligned}$$
(1)
The mass
m(
X) expresses the part of belief that supports the subset
X of
\(\Omega\) and
\(m(\Omega )\) represents the degree of ignorance. According the theory of closed-world,
\(\Omega\) is exhaustive, and hypotheses are mutually exclusive and
\(m(\emptyset )=0\).
Belief functions theory allows not only the representation of the partial knowledge, but also the information fusion under uncertainty [
33]. Considering different sources of information expressed on the same frame of discernment, we would like to combine these information through one single belief mass. This is done by the conjunctive combination rule [
7], and it assumes that all sources are reliable and consistent. Considering two mass functions
\(m_1\) and
\(m_2\), the conjunctive combination rule is defined as:
To make a decision, we try to select the most likely hypothesis which may be difficult to realize directly with the basics of the belief functions theory where mass functions are given not only to singletons but also to subsets of hypothesis. There exist several solutions to ensure decision making within the belief functions theory. The most known is the pignistic probability [
34]. In contrast to mass functions that are defined on
\(2^\Omega\), pignistic probability is a probability measure defined on
\(\Omega\). Pignistic probability was proposed in the transferable belief model (TBM) [
35]. It is based on two levels: The “credal level” where beliefs are entertained and represented by belief functions and the “pignistic level” where beliefs are used to make decisions and represented as probability functions called pignistic probabilities denoted bet:
$$\begin{aligned} \mathrm {bet}(x)=\sum _{x \in X \subseteq 2^\Omega } \frac{m(X)}{|X|} \end{aligned}$$
(3)