ABSTRACT
With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.
- F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing User Modeling on Twitter for Personalized News Recommendations. In J. A. Konstan, R. Conejo, J. L. Marzo, and N. Oliver, editors, Proceedings of the 19th International Conference on User Modeling, Adaption and Personalization (UMAP), Girona, Spain, volume 6787 of Lecture Notes in Computer Science, pages 1--12, Girona, Spain, July 2011. Springer. Google ScholarDigital Library
- F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao. Twitcident: Fighting Fire with Information from Social Web Stream. In International Conference on World Wide Web (WWW), Lyon, France. ACM, 2012. Google ScholarDigital Library
- R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM '09), pages 5--14, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- D. Antoniades, I. Polakis, G. Kontaxis, E. Athanasopoulos, S. Ioannidis, E. P. Markatos, and T. Karagiannis. we.b: the web of short urls. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 715--724, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, UIST '10, pages 303--312, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the sixth international conference on World Wide Web, pages 1157--1166, Essex, UK, 1997. Elsevier Science Publishers Ltd. Google ScholarDigital Library
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, STOC '02, pages 380--388, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06), pages 284--291, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, WWW '10, pages 591--600, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- C. Lee, H. Kwak, H. Park, and S. Moon. Finding influentials based on the temporal order of information adoption in twitter. In WWW '10: Proceedings of the 19th international conference on World wide web, pages 1137--1138, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, volume 1, pages 296--304. San Francisco, 1998. Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 141--150, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- R. McCreadie, I. Soboroff, J. Lin, C. Macdonald, I. Ounis, and D. McCullough. On building a reusable twitter corpus. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 1113--1114, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- B. Meeder, B. Karrer, A. Sayedi, R. Ravi, C. Borgs, and J. Chayes. We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW '10), New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics), Graz, Austria, September 2011. Google ScholarDigital Library
- D. Metzler and C. Cai. USC/ISI at TREC 2011: Microblog Track. In Working Notes, Text REtrieval Conference (TREC), Gaithersburg, USA, November 2011.Google Scholar
- G. Miller et al. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995. Google ScholarDigital Library
- M. Pennacchiotti and A.-M. Popescu. A Machine Learning Approach to Twitter User Classification. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, 2011. AAAI Press.Google Scholar
- D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In Proceedings of the 19th international conference on World Wide Web (WWW '10), pages 781--790, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 851--860, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- K. Tao, F. Abel, C. Hauff, and G.-J. Houben. Twinder: A search engine for twitter streams. In M. Brambilla, T. Tokuda, and R. Tolksdorf, editors, Proceedings of the 12th International Conference on Web Engineering (ICWE '12), volume 7387 of Lecture Notes in Computer Science, pages 153--168. Springer, 2012. Google ScholarDigital Library
- K. Tao, F. Abel, C. Hauff, and G.-J. Houben. WISTUD at TREC 2011: Microblog track: Exploiting background knowledge from dbpedia and news articles for search on twitter. In Working Notes, The Twentieth Text REtrieval Conference (TREC 2011) Proceedings. NIST, 2012.Google Scholar
- K. Tao, F. Abel, C. Hauff, G.-J. Houben, and U. Gadiraju. Supporting website: datasets and additional findings., 2012. http://wis.ewi.tudelft.nl/duptweet/.Google Scholar
- J. Teevan, D. Ramage, and M. R. Morris.#TwitterSearch: a comparison of microblog search and web search. In Proceedings of the international conference on Web Search and Web Data Mining (WSDM '11), pages 35--44, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- J. Weng and B.-S. Lee. Event Detection in Twitter. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, 2011. The AAAI Press.Google Scholar
Index Terms
- Groundhog day: near-duplicate detection on Twitter
Recommendations
Searching for relevant tweets based on topic-related user activities
Twitter is one of the largest social media. Although it can be used to get information on a topic of interest, it is not easy for us to find tweets relevant to the topic due to a massive amount of tweets and the small size of each tweet. Some relevant ...
Disinformation Warfare: Understanding State-Sponsored Trolls on Twitter and Their Influence on the Web
WWW '19: Companion Proceedings of The 2019 World Wide Web ConferenceOver the past couple of years, anecdotal evidence has emerged linking coordinated campaigns by state-sponsored actors with efforts to manipulate public opinion on the Web, often around major political events, through dedicated accounts, or “trolls.” ...
Rumor Gauge: Predicting the Veracity of Rumors on Twitter
Special Issue on KDD 2016 and Regular PapersThe spread of malicious or accidental misinformation in social media, especially in time-sensitive situations, such as real-world emergencies, can have harmful effects on individuals and society. In this work, we developed models for automated ...
Comments