skip to main content
research-article

Campaign extraction from social media

Published:03 January 2014Publication History
Skip Abstract Section

Abstract

In this manuscript, we study the problem of detecting coordinated free text campaigns in large-scale social media. These campaigns—ranging from coordinated spam messages to promotional and advertising campaigns to political astro-turfing—are growing in significance and reach with the commensurate rise in massive-scale social systems. Specifically, we propose and evaluate a content-driven framework for effectively linking free text posts with common “talking points” and extracting campaigns from large-scale social media. Three of the salient features of the campaign extraction framework are: (i) first, we investigate graph mining techniques for isolating coherent campaigns from large message-based graphs; (ii) second, we conduct a comprehensive comparative study of text-based message correlation in message and user levels; and (iii) finally, we analyze temporal behaviors of various campaign types. Through an experimental study over millions of Twitter messages we identify five major types of campaigns—namely Spam, Promotion, Template, News, and Celebrity campaigns—and we show how these campaigns may be extracted with high precision and recall.

References

  1. Apache. 2012. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  2. Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., and Leonardi, S. 2008. Link analysis for web spam detection. ACM Trans. Web 2, 1, 1--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Benczur, A. A., Csalogany, K., and Sarlos, T. 2006. Link-based similarity search to fight web spam. In Proceedings of the SIGIR Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  4. Benevenuto, F., Magno, G., Rodrigues, T., and Almeida, V. 2010. Detecting spammers on twitter. In Proceedings of the Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS'10).Google ScholarGoogle Scholar
  5. Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., and Gonc Alves, M. 2009. Detecting spammers and content promoters in online video social networks. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). 620--627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bratko, A., Filipic, B., Cormack, G. V., Lynam, T. R., and Zupan, B. 2006. Spam filtering using statistical data compression models. J. Mach. Learn. Res. 7, 2673--2698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29, 8--13, 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Castillo, C., Mendoza, M., and Poblete, B. 2011. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW'11). 675--684. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Caverlee, J., Liu, L., and Webb, S. 2008. Socialtrust: Tamper-resilient trust establishment in online communities. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'08). 104--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Caverlee, J., Liu, L., and Webb, S. 2010. The socialtrust framework for trusted social information management: Architecture and algorithms. Inf. Sci. 180, 1, 95--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cctv. 2010. Uncovering online promotion. http://news.cntv.cn/china/20101107/102619.shtml.Google ScholarGoogle Scholar
  12. Cheng, Z., Caverlee, J., and Lee, K. 2010. You are where you tweet: A content-based approach to geolocating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'10). 759--768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chowdhury, A., Frieder, O., Grossman, D., and McCabe, M. C. 2002. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20, 2, 171--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cialdini, R. B. 2007. Influence: The Psychology of Persuasion (Collins Business Essentials). Harper Paperbacks.Google ScholarGoogle Scholar
  15. Cormack, G. V. 2008. Email spam filtering: A systematic review. Foundat. Trends Inf. Retr. 1, 335--455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Operating Systems Design and Implementation (OSDI'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the Workshop on the Web and Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Films, L. 2011. (Astro) turf wars. www.astroturfwars.com.Google ScholarGoogle Scholar
  19. Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., and Zhao, B. Y. 2010. Detecting and characterizing social spam campaigns. In Proceedings of the 10th Annual Conference on Internet Measurement (IMC'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 721--732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gilbert, I. and Henry, T. 2010. Persuasion detection in conversation. In Master's thesis, Naval Postgraduate School, Monterey, CA.Google ScholarGoogle Scholar
  22. Grier, C., Thomas, K., Paxson, V., and Zhang, M. 2010. @spam: The underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security (CCS'10). 27--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hu, H., Yan, X., Huang, Y., Han, J., and Zhou, X. J. 2005. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinf. 21, 213--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hurley, N. J., O'Mahony, M. P., and Silvestre, G. C. M. 2007. Attacking recommender systems: A cost-benefit analysis. IEEE Intell. Syst. 22, 3, 64--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Irani, D., Webb, S., Pu, C., and Li, K. 2010. Study of trend-stuffing on twitter through text classification. In Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS'10).Google ScholarGoogle Scholar
  28. Koutrika, G., Effendi, F. A., Gyongyi, Z., Heymann, P., and Garcia-Molina, H. 2008. Combating spam in tagging systems: An evaluation. ACM Trans. Web 2, 4, 1--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lam, S. K. and Riedl, J. 2004. Shilling recommender systems for fun and profit. In Proceedings of the 13th International Conference on World Wide Web (WWW'04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lee, K., Caverlee, J., Cheng, Z., and Sui, D. Z. 2011a. Content-driven detection of campaigns in social media. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11). 551--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lee, K., Caverlee, J., and Webb, S. 2010. Uncovering social spammers: Social honeypots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). 435--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lee, K., Eoff, B. D., and Caverlee, J. 2011b. Seven months with the devils: A long-term study of content polluters on twitter. In Proceedings of the 5th AAAI International Conference on Weblogs and Social Media (ICWSM'11).Google ScholarGoogle Scholar
  33. Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707.Google ScholarGoogle Scholar
  34. Levien, R. and Aiken, A. 1998. Attack-resistant trust metrics for public key certification. In Proceedings of the 7th USENIX Security Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Lim, E. P., Nguyen, V. A., Jindal, N., Liu, B., and Lauw, H. W. 2010. Detecting product review spammers using rating behaviors. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Manning, C. D., Raghavan, P., and Schtze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mehta, B. 2007. Unsupervised shilling detection for collaborative filtering. In Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mehta, B., Hofmann, T., and Fankhauser, P. 2007. Lies and propaganda: Detecting spam users in collaborative filtering. In Proceedings of the 12th International Conference on Intelligent User Interfaces (IUI'07). 14--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Motoyama, M., McCoy, D., Levchenko, K., Savage, S., and Voelker, G. M. 2011. Dirty jobs: The role of freelance labor in web service abuse. In Proceedings of the 20th USENIX Security Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Mui, L., Mohtashemi, M., and Halberstadt, A. 2002. A computational model of trust and reputation for e-business. In Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02). 188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mukherjee, A., Liu, B., Wang, J., Glance, N., and Jindal, N. 2011. Detecting group review spam. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW'11). 93--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Niennattrakul, V. and Ratanamahatana, C. A. 2007. Inaccuracies of shape averaging method using dynamic time warping for time series data. In Proceedings of the 7th International Conference on Computational Science (ICCS'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. O'Mahony, M., Hurley, N., and Silvestre, G. 2002. Promoting recommendations: An attack on collaborative filtering. In Proceedings of the 13th International Conference on Database and Expert Systems Applications (DEXA'02). 494--503. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Petitjean, F., Ketterlin, A., and Gancarski, P. 2011. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recogn. 44, 678--693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ratkiewicz, J., Conover, M., Meiss, M., Goncalves, B., Flammini, A., and Menczer, F. 2011. Detecting and tracking political abuse in social media. In Proceedings of the 5th AAAI International Conference on Weblogs and Social Media (ICWSM'11).Google ScholarGoogle Scholar
  48. Ray, S. and Mahanti, A. 2009. Strategies for effective shilling attacks against recommender systems. In Proceedings of the 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. 1998. A bayesian approach to filtering junk e-mail. In Proceedings of the ICML Workshop on Learning for Text Categorization.Google ScholarGoogle Scholar
  50. Su, X.-F., Zeng, H.-J., and Chen, Z. 2005. Finding group shilling in recommendation system. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). (Special Interest Tracks and Posters). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Theobald, M., Siddharth, J., and Paepcke, A. 2008. Spotsigs: Robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Tomita, E., Tanaka, A., and Takahashi, H. 2006. The worst-case time complexity for generating all maximal cliques and computational experiments. Theor. Comput. Sci. 363, 28--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Trec. 2004. Terabyte track. http://www-nlpir.nist.gov/projects/terabyte/.Google ScholarGoogle Scholar
  54. Trec. 2007. Spam track. http://plg.uwaterloo.ca/∼gvcormac/treccorpus07/.Google ScholarGoogle Scholar
  55. Twitter. 2012. The twitter rules. http://support.twitter.com/articles/18311-the-twitter-rules.Google ScholarGoogle Scholar
  56. Voorhees, E. M. and Dang, H. T. 2005. Overview of the trec 2005 question answering track. In Proceedings of the 14th Text Retrieval Conference (TREC'05).Google ScholarGoogle ScholarCross RefCross Ref
  57. Wang, G., Wilson, C., Zhao, X., Zhu, Y., Mohanlal, M., Zheng, H., and Zhao, B. Y. 2012. Serf and turf: Crowdturfing for fun and profit. In Proceedings of the 21st International Conference on World Wide Web (WWW'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Wang, N., Parthasarathy, S., Tan, K.-L., and Tung, A. K. H. 2008. Csv: Visualizing and mining cohesive subgraphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 445--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Webb, S., Caverlee, J., and Pu, C. 2006. Introducing the webb spam corpus: Using email spam to identify web spam automatically. In Proceedings of the Conference on Email and Anti-Spam (CEAS'06).Google ScholarGoogle Scholar
  60. Wu, B. and Davison, B. D. 2005. Identifying link farm spam pages. In Proceedings of the 14th International Conference on World Wide Web (WWW'05). (Special Interest Tracks and Posters). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Wu, B., Yang, S., Zhao, H., and Wang, B. 2009. A distributed algorithm to enumerate all maximal cliques in mapreduce. In Proceedings of the 4th International Conference on Frontier of Computer Science and Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Wu, G., Greene, D., Smyth, B., and Cunningham, P. 2010. Distortion as a validation criterion in the identification of suspicious reviews. In proceedings of the SIGKDD Workshop on Social Media Analytics (SOMA'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yoshida, K., Adachi, F., Washio, T., Motoda, H., Homma, T., Nakashima, A., Fujikawa, H., and Yamazaki, K. 2004. Density-based spam detector. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'04). http://pdf.aminer.org/000/473/526/density_based_spam_detector.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Young, J., Martell, C., Anand, P., Ortiz, P., and Gilbert Iv, H. 2011. A microtext corpus for persuasion detection in dialog. In Proceedings of the 25th Workshops at the AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  65. Zhang, Q., Zhang, Y., Yu, H., and Huang, X. 2010. Efficient partial-duplicate detection based on sequence matching. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Ziegler, C.-N. and Lausen, G. 2005. Propagation models for trust and distrust in social networks. Inf. Syst. Frontiers 7, 4--5, 337--358. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Campaign extraction from social media

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Intelligent Systems and Technology
          ACM Transactions on Intelligent Systems and Technology  Volume 5, Issue 1
          Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
          December 2013
          520 pages
          ISSN:2157-6904
          EISSN:2157-6912
          DOI:10.1145/2542182
          Issue’s Table of Contents

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 January 2014
          • Revised: 1 September 2012
          • Accepted: 1 September 2012
          • Received: 1 February 2012
          Published in tist Volume 5, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader