skip to main content
10.1145/2488388.2488499acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Groundhog day: near-duplicate detection on Twitter

Published:13 May 2013Publication History

ABSTRACT

With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.

References

  1. F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing User Modeling on Twitter for Personalized News Recommendations. In J. A. Konstan, R. Conejo, J. L. Marzo, and N. Oliver, editors, Proceedings of the 19th International Conference on User Modeling, Adaption and Personalization (UMAP), Girona, Spain, volume 6787 of Lecture Notes in Computer Science, pages 1--12, Girona, Spain, July 2011. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao. Twitcident: Fighting Fire with Information from Social Web Stream. In International Conference on World Wide Web (WWW), Lyon, France. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM '09), pages 5--14, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Antoniades, I. Polakis, G. Kontaxis, E. Athanasopoulos, S. Ioannidis, E. P. Markatos, and T. Karagiannis. we.b: the web of short urls. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 715--724, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, UIST '10, pages 303--312, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the sixth international conference on World Wide Web, pages 1157--1166, Essex, UK, 1997. Elsevier Science Publishers Ltd. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, STOC '02, pages 380--388, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06), pages 284--291, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, WWW '10, pages 591--600, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Lee, H. Kwak, H. Park, and S. Moon. Finding influentials based on the temporal order of information adoption in twitter. In WWW '10: Proceedings of the 19th international conference on World wide web, pages 1137--1138, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, volume 1, pages 296--304. San Francisco, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 141--150, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. McCreadie, I. Soboroff, J. Lin, C. Macdonald, I. Ounis, and D. McCullough. On building a reusable twitter corpus. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 1113--1114, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Meeder, B. Karrer, A. Sayedi, R. Ravi, C. Borgs, and J. Chayes. We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW '10), New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics), Graz, Austria, September 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Metzler and C. Cai. USC/ISI at TREC 2011: Microblog Track. In Working Notes, Text REtrieval Conference (TREC), Gaithersburg, USA, November 2011.Google ScholarGoogle Scholar
  17. G. Miller et al. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Pennacchiotti and A.-M. Popescu. A Machine Learning Approach to Twitter User Classification. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, 2011. AAAI Press.Google ScholarGoogle Scholar
  19. D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In Proceedings of the 19th international conference on World Wide Web (WWW '10), pages 781--790, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 851--860, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Tao, F. Abel, C. Hauff, and G.-J. Houben. Twinder: A search engine for twitter streams. In M. Brambilla, T. Tokuda, and R. Tolksdorf, editors, Proceedings of the 12th International Conference on Web Engineering (ICWE '12), volume 7387 of Lecture Notes in Computer Science, pages 153--168. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Tao, F. Abel, C. Hauff, and G.-J. Houben. WISTUD at TREC 2011: Microblog track: Exploiting background knowledge from dbpedia and news articles for search on twitter. In Working Notes, The Twentieth Text REtrieval Conference (TREC 2011) Proceedings. NIST, 2012.Google ScholarGoogle Scholar
  23. K. Tao, F. Abel, C. Hauff, G.-J. Houben, and U. Gadiraju. Supporting website: datasets and additional findings., 2012. http://wis.ewi.tudelft.nl/duptweet/.Google ScholarGoogle Scholar
  24. J. Teevan, D. Ramage, and M. R. Morris.#TwitterSearch: a comparison of microblog search and web search. In Proceedings of the international conference on Web Search and Web Data Mining (WSDM '11), pages 35--44, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Weng and B.-S. Lee. Event Detection in Twitter. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, 2011. The AAAI Press.Google ScholarGoogle Scholar

Index Terms

  1. Groundhog day: near-duplicate detection on Twitter

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '13: Proceedings of the 22nd international conference on World Wide Web
        May 2013
        1628 pages
        ISBN:9781450320351
        DOI:10.1145/2488388

        Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 May 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WWW '13 Paper Acceptance Rate125of831submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader