Abstract
In this manuscript, we study the problem of detecting coordinated free text campaigns in large-scale social media. These campaigns—ranging from coordinated spam messages to promotional and advertising campaigns to political astro-turfing—are growing in significance and reach with the commensurate rise in massive-scale social systems. Specifically, we propose and evaluate a content-driven framework for effectively linking free text posts with common “talking points” and extracting campaigns from large-scale social media. Three of the salient features of the campaign extraction framework are: (i) first, we investigate graph mining techniques for isolating coherent campaigns from large message-based graphs; (ii) second, we conduct a comprehensive comparative study of text-based message correlation in message and user levels; and (iii) finally, we analyze temporal behaviors of various campaign types. Through an experimental study over millions of Twitter messages we identify five major types of campaigns—namely Spam, Promotion, Template, News, and Celebrity campaigns—and we show how these campaigns may be extracted with high precision and recall.
- Apache. 2012. Hadoop. http://hadoop.apache.org/.Google Scholar
- Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., and Leonardi, S. 2008. Link analysis for web spam detection. ACM Trans. Web 2, 1, 1--42. Google ScholarDigital Library
- Benczur, A. A., Csalogany, K., and Sarlos, T. 2006. Link-based similarity search to fight web spam. In Proceedings of the SIGIR Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Benevenuto, F., Magno, G., Rodrigues, T., and Almeida, V. 2010. Detecting spammers on twitter. In Proceedings of the Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS'10).Google Scholar
- Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonc Alves, M. 2009. Detecting spammers and content promoters in online video social networks. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). 620--627. Google ScholarDigital Library
- Bratko, A., Filipic, B., Cormack, G. V., Lynam, T. R., and Zupan, B. 2006. Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673--2698. Google ScholarDigital Library
- Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166. Google ScholarDigital Library
- Castillo, C., Mendoza, M., and Poblete, B. 2011. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW'11). 675--684. Google ScholarDigital Library
- Caverlee, J., Liu, L., and Webb, S. 2008. Socialtrust: Tamper-resilient trust establishment in online communities. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'08). 104--114. Google ScholarDigital Library
- Caverlee, J., Liu, L., and Webb, S. 2010. The socialtrust framework for trusted social information management: Architecture and algorithms. Inf. Sci. 180, 1, 95--112. Google ScholarDigital Library
- Cctv. 2010. Uncovering online promotion. http://news.cntv.cn/china/20101107/102619.shtml.Google Scholar
- Cheng, Z., Caverlee, J., and Lee, K. 2010. You are where you tweet: A content-based approach to geolocating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'10). 759--768. Google ScholarDigital Library
- Chowdhury, A., Frieder, O., Grossman, D., and McCabe, M. C. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2, 171--191. Google ScholarDigital Library
- Cialdini, R. B. 2007. Influence: The Psychology of Persuasion (Collins Business Essentials). Harper Paperbacks.Google Scholar
- Cormack, G. V. 2008. Email spam filtering: A systematic review. Foundat. Trends Inf. Retr. 1, 335--455. Google ScholarDigital Library
- Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Operating Systems Design and Implementation (OSDI'04). Google ScholarDigital Library
- Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the Workshop on the Web and Databases. Google ScholarDigital Library
- Films, L. 2011. (Astro) turf wars. www.astroturfwars.com.Google Scholar
- Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., and Zhao, B. Y. 2010. Detecting and characterizing social spam campaigns. In Proceedings of the 10th Annual Conference on Internet Measurement (IMC'10). Google ScholarDigital Library
- Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 721--732. Google ScholarDigital Library
- Gilbert, I. and Henry, T. 2010. Persuasion detection in conversation. In Master's thesis, Naval Postgraduate School, Monterey, CA.Google Scholar
- Grier, C., Thomas, K., Paxson, V., and Zhang, M. 2010. @spam: The underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security (CCS'10). 27--37. Google ScholarDigital Library
- Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 439--450. Google ScholarDigital Library
- Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). 576--587. Google ScholarDigital Library
- Hu, H., Yan, X., Huang, Y., Han, J., and Zhou, X. J. 2005. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinf. 21, 213--221. Google ScholarDigital Library
- Hurley, N. J., O'Mahony, M. P., and Silvestre, G. C. M. 2007. Attacking recommender systems: A cost-benefit analysis. IEEE Intell. Syst. 22, 3, 64--68. Google ScholarDigital Library
- Irani, D., Webb, S., Pu, C., and Li, K. 2010. Study of trend-stuffing on twitter through text classification. In Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS'10).Google Scholar
- Koutrika, G., Effendi, F. A., Gyongyi, Z., Heymann, P., and Garcia-Molina, H. 2008. Combating spam in tagging systems: An evaluation. ACM Trans. Web 2, 4, 1--34. Google ScholarDigital Library
- Lam, S. K. and Riedl, J. 2004. Shilling recommender systems for fun and profit. In Proceedings of the 13th International Conference on World Wide Web (WWW'04). Google ScholarDigital Library
- Lee, K., Caverlee, J., Cheng, Z., and Sui, D. Z. 2011a. Content-driven detection of campaigns in social media. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). 551--556. Google ScholarDigital Library
- Lee, K., Caverlee, J., and Webb, S. 2010. Uncovering social spammers: Social honeypots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). 435--442. Google ScholarDigital Library
- Lee, K., Eoff, B. D., and Caverlee, J. 2011b. Seven months with the devils: A long-term study of content polluters on twitter. In Proceedings of the 5th AAAI International Conference on Weblogs and Social Media (ICWSM'11).Google Scholar
- Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707.Google Scholar
- Levien, R. and Aiken, A. 1998. Attack-resistant trust metrics for public key certification. In Proceedings of the 7th USENIX Security Symposium. Google ScholarDigital Library
- Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B., and Lauw, H. W. 2010. Detecting product review spammers using rating behaviors. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'10). Google ScholarDigital Library
- Manning, C. D., Raghavan, P., and Schtze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Google ScholarDigital Library
- Mehta, B. 2007. Unsupervised shilling detection for collaborative filtering. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI'07). Google ScholarDigital Library
- Mehta, B., Hofmann, T., and Fankhauser, P. 2007. Lies and propaganda: Detecting spam users in collaborative filtering. In Proceedings of the 12th International Conference on Intelligent User Interfaces (IUI'07). 14--21. Google ScholarDigital Library
- Motoyama, M., McCoy, D., Levchenko, K., Savage, S., and Voelker, G. M. 2011. Dirty jobs: The role of freelance labor in web service abuse. In Proceedings of the 20th USENIX Security Symposium. Google ScholarDigital Library
- Mui, L., Mohtashemi, M., and Halberstadt, A. 2002. A computational model of trust and reputation for e-business. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02). 188. Google ScholarDigital Library
- Mukherjee, A., Liu, B., Wang, J., Glance, N., and Jindal, N. 2011. Detecting group review spam. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW'11). 93--94. Google ScholarDigital Library
- Niennattrakul, V. and Ratanamahatana, C. A. 2007. Inaccuracies of shape averaging method using dynamic time warping for time series data. In Proceedings of the 7th International Conference on Computational Science (ICCS'07). Google ScholarDigital Library
- Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). 83--92. Google ScholarDigital Library
- O'Mahony, M., Hurley, N., and Silvestre, G. 2002. Promoting recommendations: An attack on collaborative filtering. In Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA'02). 494--503. Google ScholarDigital Library
- Petitjean, F., Ketterlin, A., and Gancarski, P. 2011. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recogn. 44, 678--693. Google ScholarDigital Library
- Ratkiewicz, J., Conover, M., Meiss, M., Goncalves, B., Flammini, A., and Menczer, F. 2011. Detecting and tracking political abuse in social media. In Proceedings of the 5th AAAI International Conference on Weblogs and Social Media (ICWSM'11).Google Scholar
- Ray, S. and Mahanti, A. 2009. Strategies for effective shilling attacks against recommender systems. In Proceedings of the 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD. Google ScholarDigital Library
- Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. 1998. A bayesian approach to filtering junk e-mail. In Proceedings of the ICML Workshop on Learning for Text Categorization.Google Scholar
- Su, X.-F., Zeng, H.-J., and Chen, Z. 2005. Finding group shilling in recommendation system. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). (Special Interest Tracks and Posters). Google ScholarDigital Library
- Theobald, M., Siddharth, J., and Paepcke, A. 2008. Spotsigs: Robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'08). Google ScholarDigital Library
- Tomita, E., Tanaka, A., and Takahashi, H. 2006. The worst-case time complexity for generating all maximal cliques and computational experiments. Theor. Comput. Sci. 363, 28--42. Google ScholarDigital Library
- Trec. 2004. Terabyte track. http://www-nlpir.nist.gov/projects/terabyte/.Google Scholar
- Trec. 2007. Spam track. http://plg.uwaterloo.ca/∼gvcormac/treccorpus07/.Google Scholar
- Twitter. 2012. The twitter rules. http://support.twitter.com/articles/18311-the-twitter-rules.Google Scholar
- Voorhees, E. M. and Dang, H. T. 2005. Overview of the trec 2005 question answering track. In Proceedings of the 14th Text Retrieval Conference (TREC'05).Google ScholarCross Ref
- Wang, G., Wilson, C., Zhao, X., Zhu, Y., Mohanlal, M., Zheng, H., and Zhao, B. Y. 2012. Serf and turf: Crowdturfing for fun and profit. In Proceedings of the 21st International Conference on World Wide Web (WWW'12). Google ScholarDigital Library
- Wang, N., Parthasarathy, S., Tan, K.-L., and Tung, A. K. H. 2008. Csv: Visualizing and mining cohesive subgraphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 445--448. Google ScholarDigital Library
- Webb, S., Caverlee, J., and Pu, C. 2006. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In Proceedings of the Conference on Email and Anti-Spam (CEAS'06).Google Scholar
- Wu, B. and Davison, B. D. 2005. Identifying link farm spam pages. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). (Special Interest Tracks and Posters). Google ScholarDigital Library
- Wu, B., Yang, S., Zhao, H., and Wang, B. 2009. A distributed algorithm to enumerate all maximal cliques in mapreduce. In Proceedings of the 4th International Conference on Frontier of Computer Science and Technology. Google ScholarDigital Library
- Wu, G., Greene, D., Smyth, B., and Cunningham, P. 2010. Distortion as a validation criterion in the identification of suspicious reviews. In proceedings of the SIGKDD Workshop on Social Media Analytics (SOMA'10). Google ScholarDigital Library
- Yoshida, K., Adachi, F., Washio, T., Motoda, H., Homma, T., Nakashima, A., Fujikawa, H., and Yamazaki, K. 2004. Density-based spam detector. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'04). http://pdf.aminer.org/000/473/526/density_based_spam_detector.pdf. Google ScholarDigital Library
- Young, J., Martell, C., Anand, P., Ortiz, P., and Gilbert Iv, H. 2011. A microtext corpus for persuasion detection in dialog. In Proceedings of the 25th Workshops at the AAAI Conference on Artificial Intelligence.Google Scholar
- Zhang, Q., Zhang, Y., Yu, H., and Huang, X. 2010. Efficient partial-duplicate detection based on sequence matching. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). Google ScholarDigital Library
- Ziegler, C.-N. and Lausen, G. 2005. Propagation models for trust and distrust in social networks. Inf. Syst. Frontiers 7, 4--5, 337--358. Google ScholarDigital Library
Index Terms
- Campaign extraction from social media
Recommendations
Content-driven detection of campaigns in social media
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementWe study the problem of detecting coordinated free text campaigns in large-scale social media. These campaigns -- ranging from coordinated spam messages to promotional and advertising campaigns to political astro-turfing -- are growing in significance ...
Towards Multimodal Campaign Detection: Including Image Information in Stream Clustering to Detect Social Media Campaigns
Disinformation in Open Online MediaAbstractThis work explores the potential to include visual information from images in social media campaign recognition. The diverse content shared on social media platforms, including text, photos, videos, and links, necessitates a multimodal analysis ...
Strategic Temporality on Social Media During the General Election of the 2016 U.S. Presidential Campaign
#SMSociety17: Proceedings of the 8th International Conference on Social Media & SocietyTo date, little attention has been paid to the temporal nature of campaigns as they respond to events or react to the different stages of a political election -- what we define as strategic temporality. This article seeks to remedy this lack of research ...
Comments