Twitter spam detection: Survey of new approaches and comparative study
Introduction
Online Social Networks (OSNs) are popular collaboration and communication tools for millions of Internet users. As a major social networking platform, Twitter attracts users by providing free microblogging services for customers to broadcast or discover messages within 140 characters, follow other users and so on, through different devices such as mobile phones and desktops (Chu et al., 2012). Everyday, millions of Twitter users share their moments or post their discoveries, such as breaking news to their followers (Ghosh et al., 2012). However, the openness and convenience of Twitter platform also attract criminal accounts (spammers), so as to attack the platform for the sake of making money illegitimately. These attacks include spam, scam, phishing (Adewole et al, 2017, Zhu et al, 2012). As there is a restriction on the length of tweets, it is common for spammers to broadcast unsolicited spam tweets, which can redirect users to external malicious websites (Lee and Kim, 2013). Compared to the traditional spam which spread through emails, Twitter spam is more dangerous and sophisticated in luring Internet users to get deceived (Thomas et al., 2011). According to a recent report (Grier et al., 2010), the click-through rate of Twitter spam reaches 0.13%, while it only achieves 0.0003% ~ 0.0006% in email spam.
In order to address the problem of Twitter spam, in the recent few years, there have been many detection schemes put forward. There are three main categories among current Twitter spam detection methods: detection based on Syntax Analysis, Feature Analysis and Blacklisting Techniques. As text is the only format Twitter users can use, many researchers focus on analysing tweets semantics to detect spam (Chu et al, 2012, Chu et al, 2012, Gao et al, 2010, Hu et al, 2013, Hu et al, 2014, Lee et al, 2011, Lee, Kim, 2012, Lee, Kim, 2013, Thomas et al, 2011, Wang et al, 2013, Wu et al, 2017, Wu et al, 2017, Yang et al, 2012, Yardi et al, 2009, Zhang et al, 2012). More work was proposed using the features from both account and message aspects and applied a statistical method to them (i.e. Feature Analysis) (Ahmed, Abulaish, 2013, Benevenuto et al, 2010, Cao et al, 2012, Castillo et al, 2011, Chen et al, 2015, Chen et al, 2015, Chu et al, 2012, Costa et al, 2013, Egele et al, 2013, Gao et al, 2012, Ghosh et al, 2012, Grier et al, 2010, Hu et al, 2013, Hu et al, 2014, Jin et al, 2011, Lee et al, 2010, Lee, Kim, 2013, Liu et al, 2017, Liu et al, 2016, Sabottke et al, 2015, Sala et al, 2010, Song et al, 2011, Stringhini et al, 2010, Tan et al, 2013, Thomas et al, 2011, Wang, 2010, Yang et al, 2011, Yang et al, 2012, Yang et al, 2013, Yang et al, 2014, Zhang et al, 2012, Zhang et al, 2016, Zhu et al, 2012). In addition, researchers also relied on third party services such as blacklisting technique to block malicious information (Ghosh et al, 2012, Gilani et al, 2017, Grier et al, 2010, Ma et al, 2009, Ma et al, 2011, Zhang et al, 2012).
Why do we need this survey? Currently, many efforts have been made in developing effective Twitter spam detection methods. Especially in the last three years, there were some innovative breakthrough techniques developed (Ahmed, Abulaish, 2013, Chen et al, 2015, Chen et al, 2015, Chen et al, 2015, Costa et al, 2013, Egele et al, 2013, Ghosh et al, 2013, Hu et al, 2013, Hu et al, 2014, Jiang et al, 2013, Lee, Kim, 2013, Liu et al, 2016, Liu et al, 2016, Oliver et al, 0000, Stringhini et al, 2013, Symantec, 2015, Tan et al, 2013, Wang et al, 2013, Yang et al, 2013, Yang et al, 2014, Zhang et al, 2016). The improvements of these newly developed methods covered almost all research issues in Twitter spam detection field, such as selection mechanisms of spam features, sampling techniques, detection engines with better accuracy and so on. Therefore, it is necessary to provide a survey organising both past and new methods in detecting spam for future research. In this survey, we will also run comparative studies among different detection techniques.
How to use this survey? The survey contains three parts in terms of usage. Firstly, we collect and list existing related techniques for a literature review. This part will employ a taxonomy and a series of analysis to divide the state-of-art into many categories. Readers will obtain the details of each type of methods and their pros and cons in Twitter spam detection. We further select typical methods in each branch of this research field and carry out comparative studies among all kinds of methods. This part will provide readers numerical descriptions on current methods, especially on their advantages and weakness under different scenarios. Finally, we summarise related work analysis and comparative studies, and come up with open issues and potential solutions as well. The summary contributes to future research as it presents several subsequent research areas.
The rest of this survey is organised as follows. Section 2 illustrates the taxonomy of the state-of-art, followed by Section 3, 4 and 5 which show the analysis of existing Twitter spam detection methods based on syntax, feature analysis and blacklisting techniques respectively. Section 6 shows comparatives studies of above techniques. The summary and open issues are discussed in Section 7. Finally, we conclude this survey in Section 8.
Section snippets
Taxonomy
It is challenging to categorise current technologies into several parts. Because most methods may borrow the ideas from many technologies that cover different categories in Twitter spam detection, it is difficult to divide the state-of-art and make sure each work exclusively drop into one category. Therefore, to be specific, we build up the taxonomy according to extraction and classification methods. Although this taxonomy creation method cannot avoid the case of one method dropping into
Literature review part I: detection based on syntax analysis
In this section, we will analyse the detection methods based on syntax analysis and discuss their pros and cons. These methods can be categorised into two parts: 1) key segment and 2) tweet content. Before we go for the details, we briefly analyse the generic Twitter spam detection framework. As shown in Fig. 2, Twitter posts will first be collected by data collection module. Suspicious tweets will be selected by a set of basic rules and then used for feature extraction. For supervised
Literature review part II: detection based on feature analysis
In this section, we analyse feature analysis based detection methods and discuss their pros and cons. We divide this category into two sub-categories: detection approaches by statistic information and by social graph.
Literature review part III: blacklist
In this section, we analyse the detection methods which rely on the third party blacklisting techniques. We will also discuss the pros and cons of this type of methods. Blacklisting techniques are widely used in current works for Twitter spam detection or dataset labelling (Chu et al, 2012, Ghosh et al, 2012, Grier et al, 2010, Ma et al, 2009, Thomas et al, 2011, Zhang et al, 2012). In particular, the works (Lee, Kim, 2013, Zhang et al, 2016) are published in the last three years.
Detailed
Comparative study
In this section, we compare typical Twitter spam detection methods from several angles such as features and performance. As the blacklisting techniques apply the third party services to block spam, we mainly focus on the comparison of syntax based methods and feature analysis based methods. From the experimental results, we will have a quantitative understanding of the pros and cons of the existing spam detection methods.
For the purpose of empirical study, a 10-day ground-truth dataset was
Summary and open issues
In this survey, we have discussed a series of Twitter spam detection methods, particularly the most recent ones which were published in the past three years. As we can see from the taxonomy and literature review parts, the majority of current methods mainly rely on machine learning based techniques (e.g. supervised or unsupervised) to identify Twitter spamming activities. Among these machine learning based methods, the major differences are about how and where to collect features. Therefore,
Conclusion
In this paper, we review the state-of-art in Twitter spam detection techniques. We first categorised existing Twitter spam detection methods into three groups and discussed the pros and cons for every method. We further carried out comparative studies on typical methods and mainly focused on the performance comparison. It was found that most of current spam detection techniques were based on feature selection and machine learning classification. Finally, we made a brief summary and discussed
Tingmin Wu received the Bachelor of Information Technology degree (with first class Hons.) from Deakin University Australia in 2016. Currently, she is a PhD student with Swinburne University of Technology and CSIRO Data61, Australia. Her research interests include cyber security, especially in social spam detection.
References (97)
- et al.
Malicious accounts: dark of the social networks
J Netw Comput Appl
(2017) - et al.
A generic statistical approach for spam detection in online social networks
Comput Commun
(2013) - et al.
Investigating the deceptive information in twitter spam
Future Gener Comput Syst
(2017) Euclidean distance mapping
Comput Graph Image Process
(1980)- et al.
Addressing the class imbalance problem in twitter spam detection using ensemble learning
Comput Secur
(2017) The vector distance transform in two and three dimensions
CVGIP
(1992)A measure of betweenness centrality based on random walks
Soc Networks
(2005)Applications of Menger's graph theorem
J Math Anal Appl
(1968)- et al.
Term-weighting approaches in automatic text retrieval
Inf Process Manage
(1988) - et al.
Detecting spam and promoting campaigns in twitter
ACM Trans Web
(2016)
The feature quantity: an information theoretic perspective of Tfidf-like measures
Detecting spammers on twitter
An adaptive subspace clustering dimension reduction framework for time series indexing in Knime workflows
Sentiment knowledge discovery in twitter streaming data
Manhattan distance
Dictionary Algorithms Data Struct
A survey of learning-based techniques of email spam filtering
Artif Intell Rev
Class-based n-gram models of natural language
Comput linguistic
Aiding the detection of fake accounts in large scale social online services
Information credibility on twitter
Twitter datdata homepage
Asymmetric self-learning for tackling twitter spam drift
6 million spam tweets: A large ground truth for timely twitter spam detection
A performance evaluation of machine learning-based streaming spam tweets detection
IEEE Trans Comput Soc Syst
Spammers Are Becoming “Smarter” on Twitter
IT professional
Who is tweeting on twitter: human, bot, or cyborg?
Detecting social spam campaigns on twitter
Detecting automation of twitter accounts: are you a human, bot, or cyborg?
IEEE Trans Dependable Secure Comput
Detecting tip spam in location-based social networks
Compa: Detecting compromised accounts on social networks
Liblinear: a library for large linear classification
J Mach Learn Res
The elements of statistical learning,
Detecting and characterizing social spam campaigns
Towards online spam filtering in social networks
Understanding and combating link farming in the twitter social network
On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream
Do bots impact twitter activity?
Google safe browsing api
@ spam: the underground on 140 characters or less
The WEKA data mining software: an update
ACM SIGKDD Explor Newsl
Spam 2.0 state of the art
Fighting spam on social web sites: a survey of approaches and future challenges
IEEE Internet Comput
Growing scale-free networks with tunable clustering
Phys Rev E
Social spammer detection in microblogging
Online social spammer detection
Understanding latent interactions in online social networks
ACM Trans Web
A data mining-based spam detection system for social media networks
Proc VLDB Endowment
Text categorization with support vector machines: learning with many relevant features
Cited by (127)
Learning textual features for Twitter spam detection: A systematic literature review
2023, Expert Systems with ApplicationsSpam community detection & influence minimization using NRIM algorithm
2023, Computers in Human BehaviorA Hybrid Deep Learning Approach for Spam Detection in Twitter
2024, Ingenierie des Systemes d'InformationDeep learning trends and future perspectives of web security and vulnerabilities
2024, Journal of High Speed NetworksComparative Study of Different Machine Learning Models for Detecting Spam Tweet
2023, AIP Conference Proceedings
Tingmin Wu received the Bachelor of Information Technology degree (with first class Hons.) from Deakin University Australia in 2016. Currently, she is a PhD student with Swinburne University of Technology and CSIRO Data61, Australia. Her research interests include cyber security, especially in social spam detection.
Sheng Wen received the degree in computer science from the Central South University of China in 2012, and the Ph.D. degree from the School of Information Technology, Deakin University, Australia, in 2015. Currently, he is a Senior Lecturer with Swinburne University of Technology. His focus is on modeling of virus spread, information dissemination, and defense strategies for the Internet threats. He is also interested in the techniques of identifying information sources in networks.
Yang Xiang received his PhD in Computer Science from Deakin University, Australia. He is the Dean of Digital Research & Innovation Capability Platform, Swinburne University of Technology, Australia. His research interests include cyber security, which covers network and system security, data analytics, distributed systems, and networking. In particular, he is currently leading his team developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC).
Wanlei Zhou received the B.Eng and M.Eng degrees from Harbin Institute of Technology, Harbin, China in 1982 and 1984, respectively, and the PhD degree from The Australian National University, Canberra, Australia, in 1991, all in Computer Science and Engineering. He also received a DSc degree (a higher Doctorate degree) from Deakin University in 2002. He is currently the Alfred Deakin Professor (the highest honour the University can bestow on a member of academic staff), Chair of Information Technology, and Associate Dean (International Research Engagement) of Faculty of Science, Engineering and Built Environment, Deakin University.