Elsevier

Computers & Security

Volume 76, July 2018, Pages 265-284
Computers & Security

Twitter spam detection: Survey of new approaches and comparative study

https://doi.org/10.1016/j.cose.2017.11.013Get rights and content

Highlights

  • Detailed taxonomy summary and analysis on the state-of-art.

  • Comparative studies: performance comparison of various typical methods.

  • Open issues in current Twitter spam detection techniques.

Abstract

Twitter spam has long been a critical but difficult problem to be addressed. So far, researchers have proposed many detection and defence methods in order to protect Twitter users from spamming activities. Particularly in the last three years, many innovative methods have been developed, which have greatly improved the detection accuracy and efficiency compared to those which were proposed three years ago. Therefore, we are motivated to work out a new survey about Twitter spam detection techniques. This survey includes three parts: 1) A literature review on the state-of-art: this part provides detailed analysis (e.g. taxonomies and biases on feature selection) and discussion (e.g. pros and cons on each typical method); 2) Comparative studies: we will compare the performance of various typical methods on a universal testbed (i.e. same datasets and ground truths) to provide a quantitative understanding of current methods; 3) Open issues: the final part is to summarise the unsolved challenges in current Twitter spam detection techniques. Solutions to these open issues are of great significance to both academia and industries. Readers of this survey may include those who do or do not have expertise in this area and those who are looking for deep understanding of this field in order to develop new methods.

Introduction

Online Social Networks (OSNs) are popular collaboration and communication tools for millions of Internet users. As a major social networking platform, Twitter attracts users by providing free microblogging services for customers to broadcast or discover messages within 140 characters, follow other users and so on, through different devices such as mobile phones and desktops (Chu et al., 2012). Everyday, millions of Twitter users share their moments or post their discoveries, such as breaking news to their followers (Ghosh et al., 2012). However, the openness and convenience of Twitter platform also attract criminal accounts (spammers), so as to attack the platform for the sake of making money illegitimately. These attacks include spam, scam, phishing (Adewole et al, 2017, Zhu et al, 2012). As there is a restriction on the length of tweets, it is common for spammers to broadcast unsolicited spam tweets, which can redirect users to external malicious websites (Lee and Kim, 2013). Compared to the traditional spam which spread through emails, Twitter spam is more dangerous and sophisticated in luring Internet users to get deceived (Thomas et al., 2011). According to a recent report (Grier et al., 2010), the click-through rate of Twitter spam reaches 0.13%, while it only achieves 0.0003% ~ 0.0006% in email spam.

In order to address the problem of Twitter spam, in the recent few years, there have been many detection schemes put forward. There are three main categories among current Twitter spam detection methods: detection based on Syntax Analysis, Feature Analysis and Blacklisting Techniques. As text is the only format Twitter users can use, many researchers focus on analysing tweets semantics to detect spam (Chu et al, 2012, Chu et al, 2012, Gao et al, 2010, Hu et al, 2013, Hu et al, 2014, Lee et al, 2011, Lee, Kim, 2012, Lee, Kim, 2013, Thomas et al, 2011, Wang et al, 2013, Wu et al, 2017, Wu et al, 2017, Yang et al, 2012, Yardi et al, 2009, Zhang et al, 2012). More work was proposed using the features from both account and message aspects and applied a statistical method to them (i.e. Feature Analysis) (Ahmed, Abulaish, 2013, Benevenuto et al, 2010, Cao et al, 2012, Castillo et al, 2011, Chen et al, 2015, Chen et al, 2015, Chu et al, 2012, Costa et al, 2013, Egele et al, 2013, Gao et al, 2012, Ghosh et al, 2012, Grier et al, 2010, Hu et al, 2013, Hu et al, 2014, Jin et al, 2011, Lee et al, 2010, Lee, Kim, 2013, Liu et al, 2017, Liu et al, 2016, Sabottke et al, 2015, Sala et al, 2010, Song et al, 2011, Stringhini et al, 2010, Tan et al, 2013, Thomas et al, 2011, Wang, 2010, Yang et al, 2011, Yang et al, 2012, Yang et al, 2013, Yang et al, 2014, Zhang et al, 2012, Zhang et al, 2016, Zhu et al, 2012). In addition, researchers also relied on third party services such as blacklisting technique to block malicious information (Ghosh et al, 2012, Gilani et al, 2017, Grier et al, 2010, Ma et al, 2009, Ma et al, 2011, Zhang et al, 2012).

Why do we need this survey? Currently, many efforts have been made in developing effective Twitter spam detection methods. Especially in the last three years, there were some innovative breakthrough techniques developed (Ahmed, Abulaish, 2013, Chen et al, 2015, Chen et al, 2015, Chen et al, 2015, Costa et al, 2013, Egele et al, 2013, Ghosh et al, 2013, Hu et al, 2013, Hu et al, 2014, Jiang et al, 2013, Lee, Kim, 2013, Liu et al, 2016, Liu et al, 2016, Oliver et al, 0000, Stringhini et al, 2013, Symantec, 2015, Tan et al, 2013, Wang et al, 2013, Yang et al, 2013, Yang et al, 2014, Zhang et al, 2016). The improvements of these newly developed methods covered almost all research issues in Twitter spam detection field, such as selection mechanisms of spam features, sampling techniques, detection engines with better accuracy and so on. Therefore, it is necessary to provide a survey organising both past and new methods in detecting spam for future research. In this survey, we will also run comparative studies among different detection techniques.

How to use this survey? The survey contains three parts in terms of usage. Firstly, we collect and list existing related techniques for a literature review. This part will employ a taxonomy and a series of analysis to divide the state-of-art into many categories. Readers will obtain the details of each type of methods and their pros and cons in Twitter spam detection. We further select typical methods in each branch of this research field and carry out comparative studies among all kinds of methods. This part will provide readers numerical descriptions on current methods, especially on their advantages and weakness under different scenarios. Finally, we summarise related work analysis and comparative studies, and come up with open issues and potential solutions as well. The summary contributes to future research as it presents several subsequent research areas.

The rest of this survey is organised as follows. Section 2 illustrates the taxonomy of the state-of-art, followed by Section 3, 4 and 5 which show the analysis of existing Twitter spam detection methods based on syntax, feature analysis and blacklisting techniques respectively. Section 6 shows comparatives studies of above techniques. The summary and open issues are discussed in Section 7. Finally, we conclude this survey in Section 8.

Section snippets

Taxonomy

It is challenging to categorise current technologies into several parts. Because most methods may borrow the ideas from many technologies that cover different categories in Twitter spam detection, it is difficult to divide the state-of-art and make sure each work exclusively drop into one category. Therefore, to be specific, we build up the taxonomy according to extraction and classification methods. Although this taxonomy creation method cannot avoid the case of one method dropping into

Literature review part I: detection based on syntax analysis

In this section, we will analyse the detection methods based on syntax analysis and discuss their pros and cons. These methods can be categorised into two parts: 1) key segment and 2) tweet content. Before we go for the details, we briefly analyse the generic Twitter spam detection framework. As shown in Fig. 2, Twitter posts will first be collected by data collection module. Suspicious tweets will be selected by a set of basic rules and then used for feature extraction. For supervised

Literature review part II: detection based on feature analysis

In this section, we analyse feature analysis based detection methods and discuss their pros and cons. We divide this category into two sub-categories: detection approaches by statistic information and by social graph.

Literature review part III: blacklist

In this section, we analyse the detection methods which rely on the third party blacklisting techniques. We will also discuss the pros and cons of this type of methods. Blacklisting techniques are widely used in current works for Twitter spam detection or dataset labelling (Chu et al, 2012, Ghosh et al, 2012, Grier et al, 2010, Ma et al, 2009, Thomas et al, 2011, Zhang et al, 2012). In particular, the works (Lee, Kim, 2013, Zhang et al, 2016) are published in the last three years.

Detailed

Comparative study

In this section, we compare typical Twitter spam detection methods from several angles such as features and performance. As the blacklisting techniques apply the third party services to block spam, we mainly focus on the comparison of syntax based methods and feature analysis based methods. From the experimental results, we will have a quantitative understanding of the pros and cons of the existing spam detection methods.

For the purpose of empirical study, a 10-day ground-truth dataset was

Summary and open issues

In this survey, we have discussed a series of Twitter spam detection methods, particularly the most recent ones which were published in the past three years. As we can see from the taxonomy and literature review parts, the majority of current methods mainly rely on machine learning based techniques (e.g. supervised or unsupervised) to identify Twitter spamming activities. Among these machine learning based methods, the major differences are about how and where to collect features. Therefore,

Conclusion

In this paper, we review the state-of-art in Twitter spam detection techniques. We first categorised existing Twitter spam detection methods into three groups and discussed the pros and cons for every method. We further carried out comparative studies on typical methods and mainly focused on the performance comparison. It was found that most of current spam detection techniques were based on feature selection and machine learning classification. Finally, we made a brief summary and discussed

Tingmin Wu received the Bachelor of Information Technology degree (with first class Hons.) from Deakin University Australia in 2016. Currently, she is a PhD student with Swinburne University of Technology and CSIRO Data61, Australia. Her research interests include cyber security, especially in social spam detection.

References (97)

  • A. Aizawa

    The feature quantity: an information theoretic perspective of Tfidf-like measures

  • F. Benevenuto et al.

    Detecting spammers on twitter

  • T. Bhraguram et al.

    An adaptive subspace clustering dimension reduction framework for time series indexing in Knime workflows

  • A. Bifet et al.

    Sentiment knowledge discovery in twitter streaming data

  • P.E. Black

    Manhattan distance

    Dictionary Algorithms Data Struct

    (2006)
  • E. Blanzieri et al.

    A survey of learning-based techniques of email spam filtering

    Artif Intell Rev

    (2008)
  • P.F. Brown et al.

    Class-based n-gram models of natural language

    Comput linguistic

    (1992)
  • CaoQ. et al.

    Aiding the detection of fake accounts in large scale social online services

  • C. Castillo et al.

    Information credibility on twitter

  • M. Cha et al.

    Twitter datdata homepage

  • ChenC. et al.

    Asymmetric self-learning for tackling twitter spam drift

    (2015)
  • ChenC. et al.

    6 million spam tweets: A large ground truth for timely twitter spam detection

    (2015)
  • ChenC. et al.

    A performance evaluation of machine learning-based streaming spam tweets detection

    IEEE Trans Comput Soc Syst

    (2015)
  • ChenC. et al.

    Spammers Are Becoming “Smarter” on Twitter

    IT professional

    (2016)
  • Chen C.S., Su S.-A., Hung Y.-C. Protecting computer users from online frauds, uS Patent 7,958,555 (Jun. 7...
  • ChuZ. et al.

    Who is tweeting on twitter: human, bot, or cyborg?

  • ChuZ. et al.

    Detecting social spam campaigns on twitter

  • ChuZ. et al.

    Detecting automation of twitter accounts: are you a human, bot, or cyborg?

    IEEE Trans Dependable Secure Comput

    (2012)
  • H. Costa et al.

    Detecting tip spam in location-based social networks

  • M. Egele et al.

    Compa: Detecting compromised accounts on social networks

  • FanR.-E. et al.

    Liblinear: a library for large linear classification

    J Mach Learn Res

    (2008)
  • J. Friedman et al.

    The elements of statistical learning,

    (2001)
  • GaoH. et al.

    Detecting and characterizing social spam campaigns

  • GaoH. et al.

    Towards online spam filtering in social networks

  • S. Ghosh et al.

    Understanding and combating link farming in the twitter social network

  • S. Ghosh et al.

    On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream

  • Z. Gilani et al.

    Do bots impact twitter activity?

  • Google Developers

    Google safe browsing api

  • C. Grier et al.

    @ spam: the underground on 140 characters or less

  • M. Hall et al.

    The WEKA data mining software: an update

    ACM SIGKDD Explor Newsl

    (2009)
  • P. Hayati et al.

    Spam 2.0 state of the art

    (2012)
  • P. Heymann et al.

    Fighting spam on social web sites: a survey of approaches and future challenges

    IEEE Internet Comput

    (2007)
  • P. Holme et al.

    Growing scale-free networks with tunable clustering

    Phys Rev E

    (2002)
  • HuX. et al.

    Social spammer detection in microblogging

  • HuX. et al.

    Online social spammer detection

  • JiangJ. et al.

    Understanding latent interactions in online social networks

    ACM Trans Web

    (2013)
  • JinX. et al.

    A data mining-based spam detection system for social media networks

    Proc VLDB Endowment

    (2011)
  • T. Joachims

    Text categorization with support vector machines: learning with many relevant features

  • Cited by (127)

    • A Hybrid Deep Learning Approach for Spam Detection in Twitter

      2024, Ingenierie des Systemes d'Information
    View all citing articles on Scopus

    Tingmin Wu received the Bachelor of Information Technology degree (with first class Hons.) from Deakin University Australia in 2016. Currently, she is a PhD student with Swinburne University of Technology and CSIRO Data61, Australia. Her research interests include cyber security, especially in social spam detection.

    Sheng Wen received the degree in computer science from the Central South University of China in 2012, and the Ph.D. degree from the School of Information Technology, Deakin University, Australia, in 2015. Currently, he is a Senior Lecturer with Swinburne University of Technology. His focus is on modeling of virus spread, information dissemination, and defense strategies for the Internet threats. He is also interested in the techniques of identifying information sources in networks.

    Yang Xiang received his PhD in Computer Science from Deakin University, Australia. He is the Dean of Digital Research & Innovation Capability Platform, Swinburne University of Technology, Australia. His research interests include cyber security, which covers network and system security, data analytics, distributed systems, and networking. In particular, he is currently leading his team developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC).

    Wanlei Zhou received the B.Eng and M.Eng degrees from Harbin Institute of Technology, Harbin, China in 1982 and 1984, respectively, and the PhD degree from The Australian National University, Canberra, Australia, in 1991, all in Computer Science and Engineering. He also received a DSc degree (a higher Doctorate degree) from Deakin University in 2002. He is currently the Alfred Deakin Professor (the highest honour the University can bestow on a member of academic staff), Chair of Information Technology, and Associate Dean (International Research Engagement) of Faculty of Science, Engineering and Built Environment, Deakin University.

    View full text