Nepotistic relationships in Twitter and their impact on rank prestige algorithms

https://doi.org/10.1016/j.ipm.2013.06.003Get rights and content

Highlights

  • Micro-blogging services can develop into valuable sources of up-to-date information provided the spam problem is overcome.

  • Finding “authoritative” sources/users is the main topic of this paper.

  • Five rank prestige methods are compared: PageRank, HITS, NodeRanking, TunkRank and TwitterRank.

  • Reciprocal links are most of the time “counterfeit currency” to increase followers count.

  • The lower the ranking spammers reach (while relevant users remain atop), the better a method is.

Abstract

Micro-blogging services such as Twitter allow anyone to publish anything, anytime. Needless to say, many of the available contents can be diminished as babble or spam. However, given the number and diversity of users, some valuable pieces of information should arise from the stream of tweets. Thus, such services can develop into valuable sources of up-to-date information (the so-called real-time web) provided a way to find the most relevant/trustworthy/authoritative users is available. Hence, this makes a highly pertinent question for which graph centrality methods can provide an answer. In this paper the author offers a comprehensive survey of feasible algorithms for ranking users in social networks, he examines their vulnerabilities to linking malpractice in such networks, and suggests an objective criterion against which to compare such algorithms. Additionally, he suggests a first step towards “desensitizing” prestige algorithms against cheating by spammers and other abusive users.

Introduction

Twitter is a service which allows users to publish short text messages (tweets) which are shown to other users following the author of the message. In case the author is not protecting his tweets, they appear in the so-called public timeline and they are served as search results in response to user submitted queries. Thus, Twitter can be a source of valuable real-time information and, in fact, several major search engines were including tweets as search results at the moment of this writing.

Given that tweets are published by individual users, ranking them to find the most relevant information is a crucial matter. Indeed, at the moment of this writing, Google seemed to be applying the PageRank method to rank Twitter users to that end (Talbot, 2010). Nevertheless, the behavior of different graph centrality methods and their vulnerabilities when confronted with the Twitter user graph, in general, and Twitter spammers in particular, are still little-known.

Thus, this paper aims to shed some light on this particular issue besides providing some recommendations for future research in the area. As it will be later discussed, user ranking in social networks cannot be an end in itself, but a tool to be used for other tasks. Hence, this author is not considering any a priori “good” ranking and, instead, he suggests measuring the performance of the different methods on the basis of two desirable features: on one hand presumed relevant users should rank atop – although the actual ordering among them is irrelevant; and, on the other hand, spammers should achieve lower rankings.

The paper is organized as follows. First of all, a comprehensive literature review is provided. It deals with several rank prestige algorithms (some well-known and others lesser-known) which are applicable to social networks; their known vulnerabilities; and some partially related work and proprietary tools outside the scope of this study. In addition to that, Twitter spam is discussed with a focus on link spam (known as follow spam in Twitter). Then, the different strategies to fight spam in social websites are overviewed. Finally, the research questions are stated and the feasibility of “desensitizing” prestige ranking algorithms against follow spam is analyzed. After that, the experimental framework in which this study was conducted is described: the dataset crawled from Twitter; the elaboration of the subset of relevant and abusive users; and the straightforward nature of the evaluation. Afterwards, results obtained with each of the different ranking methods are discussed along with the implications of the study. Finally, an in-depth analysis of the collected dataset is provided in an appendix: it provides details on the nature of the social network, in addition to some demographical analysis.

Section snippets

Literature review

A social network, despite the current association with online services, is any interconnected system whose connections are a product of social relations or interactions among persons or groups. That way, families, companies, groups of friends, or scientific production are social networks.

Social networks can be mathematically modeled as graphs and, thus, graph theory has become inextricably related to social network analysis with a long history of research. Think, for instance, of bibliometric

Spam countermeasures

From the literature review regarding Twitter spam it can be concluded that false negatives (uncatched spammers) and false positives (legitimate users labeled as spammers) are unavoidable and, even worse, as it was stated in Benevenuto et al., 2010, Chowdhury, 2010, Yardi et al., 2010 the “war on spam” is a never ending fight because every deployed anti-spam measure will be eventually overcome by spammers.

However, all of those works approach Twitter spam as a detection task while there are other

Research design

The main goal of this study was to compare the performance of different rank prestige algorithms when applied to social networks. To attain that, a relative large dataset was needed, in addition to an objective criterion against which to compare the performance of the different methods. The dataset, described in the following subsection, was crawled by the author from Twitter. Within that dataset two different subsets were prepared: one of presumed relevant users and another of abusive users.

Prestige of abusive users and verified accounts when applying PageRank

As it has been aforementioned, 4884 verified accounts appear in the collected dataset. They amount for a mere 0.3% of the users in the graph but, still, they accumulate 12.7% of the total PageRank in the graph. Regarding their positioning in the global ranking (see Table 5, and Fig. 3, Fig. 4), 85% of the verified accounts appear among the top 10% of users.

With regards to the subset of abusive users, about 50% of the spammers detected in the collection of tweets did not appear in the graph.

Discussion of results

As we have explained before, our approach to evaluate the available ranking algorithms in the context of social networks was not based on an a priori “good” ranking but, instead, their capacity of ranking relevant users (i.e. verified accounts) atop as a whole group while, at the same time, being robust to “gaming” by abusive users, i.e. their “ability” to penalize spammers and aggressive marketers who try to reach better positions by exchanging links instead of providing better content.

Because

Implications, conclusions, and future work

This study makes four main contributions. First, when applying graph prestige algorithms to social networks, ranking is not only a measure of a user’s value but also the result of “gaming” the algorithm by means of relationship links. The fact that spammers – who contribute no valuable content – are consistently better positioned than marketers – who contribute somewhat valuable information – no matter the method employed supports this assert.

Second, evaluating ranking should not be a point in

Acknowledgements

The author would like to thank David J. Brenes, Miguel Fernández, Fernando Zapico, and Diego Guerra for their help during the Twitter dataset collection. He is also in debt with Miguel Fernández for comments on an early draft of this paper. This work was partially financed by grant UNOV-09-RENOV-MB-2 from the University of Oviedo. Finally, the author would like to thank the anonymous reviewers for their constructive suggestions to improve this paper, especially regarding the use of ground truth

References (52)

  • S. Brin et al.

    The anatomy of a large-scale hypertextual web search engine

    Computer Networks and ISDN Systems

    (1998)
  • Bakshy, E., et al. (2011). Everyone’s an influencer: quantifying influence on Twitter. In WSDM ‘11 Proceedings of the...
  • Benevenuto, F., et al. (2010). Detecting spammers on Twitter. In CEAS 2010, Proceedings of the seventh annual...
  • Bharat, K., & Henzinger, M. (1998). Improved algorithms for topic distillation in hyper-linked environments. In...
  • E. Broadman

    Choosing physiology journals

    Bulletin of the Medical Library Association

    (1944)
  • Cha, M., et al. (2010). Measuring user influence in Twitter: The million follower fallacy. In ICWSM’10: Proceedings of...
  • S. Chakrabarti

    Mining the web – Discovering knowledge from hypertext data

    (2002)
  • Choudhury, M. D., Sundaram, H., John, A., Seligmann, D. D., & Kelliher, A. (2010). Birds of a feather: Does attribute...
  • Chowdhury, A. (2010). State of Twitter spam....
  • Chu, Z., et al. (2010). Who is tweeting on Twitter: Human, bot, or cyborg? In Proceedings of annual computer security...
  • Chu, Z., et al. (2012). Detecting social spam campaigns on Twitter. In Proceedings of conference on applied...
  • Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparing top k lists. In Proceedings of the 14th anual ACM-SIAM...
  • H.H. Fussler

    Characteristics of the research literature used by chemists and physicists in the United States 1949

  • E. Garfield

    Citation analysis as a tool in journal evaluation

    Science

    (1972)
  • D. Gayo-Avello

    De retibus socialibus et legibus momenti

    Europhysics Letters

    (2011)
  • Ghosh, S., et al. (2011). Spammers’ networks within online social networks: a case-study on Twitter. In Proceedings of...
  • Ghosh, S., et al. (2012). Understanding and combating link farming in the Twitter social network. In Proceedings of the...
  • Goyal, A., Bonchi, F., & Lakshmanan, L. V. S. (2010). Learning influence probabilities in social networks. In...
  • Grier, C., et al. 2010. @spam: The underground on 140 characters or less, 2010. In Proceedings of the 17th ACM...
  • P.L.K. Gross et al.

    College libraries and chemical education, 1927

  • Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In Proceedings of the first international workshop on...
  • Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating web spam with TrustRank. In Proceedings of the 30th...
  • Heil, B., & Piskorski, M. (2009). New Twitter research: Men follow men and nobody tweets....
  • P. Heymann et al.

    Fighting spam on social web sites: A survey of approaches and future challenges

    IEEE Internet Computing

    (2007)
  • B.A. Huberman et al.

    Social networks that matter: Twitter under the microscope

    First Monday

    (2009)
  • Irani, D., et al. (2010). Study of trend-stuffing on twitter through text classification. In CEAS 2010, Proceedings of...
  • Cited by (44)

    • Identification of influential users on Twitter: A novel weighted correlated influence measure for Covid-19

      2020, Chaos, Solitons and Fractals
      Citation Excerpt :

      Follower-Rank is another such influence metric proposed by [8], which is expressed as the number of followers of an individual and Follower-Followee ratio [2] in the Twitter network. Another measure based upon followers and followee ratio is a paradoxical discounted measure [15]. To diminish the effect of trade-offs between several followers and followee of the user, a new metric namely, popularity [1] was developed.

    • A Centrality for Social Media Users Focusing on Information-Gathering Ability

      2023, HT 2023 - The 34th ACM Conference on Hypertext and Social Media
    View all citing articles on Scopus
    View full text