research-article

Groundhog day: near-duplicate detection on Twitter

Authors:
Ke Tao

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Fabian Abel

XING AG, Hamburg, Germany

XING AG, Hamburg, Germany
View Profile

,
Claudia Hauff

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Geert-Jan Houben

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Ujwal Gadiraju

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

WWW '13: Proceedings of the 22nd international conference on World Wide WebMay 2013Pages 1273–1284https://doi.org/10.1145/2488388.2488499

Published:13 May 2013Publication History

WWW '13: Proceedings of the 22nd international conference on World Wide Web

Pages 1273–1284

ABSTRACT

With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.

References

F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing User Modeling on Twitter for Personalized News Recommendations. In J. A. Konstan, R. Conejo, J. L. Marzo, and N. Oliver, editors, Proceedings of the 19th International Conference on User Modeling, Adaption and Personalization (UMAP), Girona, Spain, volume 6787 of Lecture Notes in Computer Science, pages 1--12, Girona, Spain, July 2011. Springer. Google ScholarDigital Library
F. Abel, C. Hauff, G.-J. Houben, R. Stronkman, and K. Tao. Twitcident: Fighting Fire with Information from Social Web Stream. In International Conference on World Wide Web (WWW), Lyon, France. ACM, 2012. Google ScholarDigital Library
R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM '09), pages 5--14, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
D. Antoniades, I. Polakis, G. Kontaxis, E. Athanasopoulos, S. Ioannidis, E. P. Markatos, and T. Karagiannis. we.b: the web of short urls. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 715--724, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and E. H. Chi. Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, UIST '10, pages 303--312, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the sixth international conference on World Wide Web, pages 1157--1166, Essex, UK, 1997. Elsevier Science Publishers Ltd. Google ScholarDigital Library
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, STOC '02, pages 380--388, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06), pages 284--291, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, WWW '10, pages 591--600, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
C. Lee, H. Kwak, H. Park, and S. Moon. Finding influentials based on the temporal order of information adoption in twitter. In WWW '10: Proceedings of the 19th international conference on World wide web, pages 1137--1138, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, volume 1, pages 296--304. San Francisco, 1998. Google ScholarDigital Library
G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 141--150, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
R. McCreadie, I. Soboroff, J. Lin, C. Macdonald, I. Ounis, and D. McCullough. On building a reusable twitter corpus. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 1113--1114, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
B. Meeder, B. Karrer, A. Sayedi, R. Ravi, C. Borgs, and J. Chayes. We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter. In Proceedings of the 20th International Conference on World Wide Web (WWW '10), New York, NY, USA, 2011. ACM. Google ScholarDigital Library
P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (I-Semantics), Graz, Austria, September 2011. Google ScholarDigital Library
D. Metzler and C. Cai. USC/ISI at TREC 2011: Microblog Track. In Working Notes, Text REtrieval Conference (TREC), Gaithersburg, USA, November 2011.Google Scholar
G. Miller et al. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995. Google ScholarDigital Library
M. Pennacchiotti and A.-M. Popescu. A Machine Learning Approach to Twitter User Classification. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, 2011. AAAI Press.Google Scholar
D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In Proceedings of the 19th international conference on World Wide Web (WWW '10), pages 781--790, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 851--860, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
K. Tao, F. Abel, C. Hauff, and G.-J. Houben. Twinder: A search engine for twitter streams. In M. Brambilla, T. Tokuda, and R. Tolksdorf, editors, Proceedings of the 12th International Conference on Web Engineering (ICWE '12), volume 7387 of Lecture Notes in Computer Science, pages 153--168. Springer, 2012. Google ScholarDigital Library
K. Tao, F. Abel, C. Hauff, and G.-J. Houben. WISTUD at TREC 2011: Microblog track: Exploiting background knowledge from dbpedia and news articles for search on twitter. In Working Notes, The Twentieth Text REtrieval Conference (TREC 2011) Proceedings. NIST, 2012.Google Scholar
K. Tao, F. Abel, C. Hauff, G.-J. Houben, and U. Gadiraju. Supporting website: datasets and additional findings., 2012. http://wis.ewi.tudelft.nl/duptweet/.Google Scholar
J. Teevan, D. Ramage, and M. R. Morris.#TwitterSearch: a comparison of microblog search and web search. In Proceedings of the international conference on Web Search and Web Data Mining (WSDM '11), pages 35--44, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
J. Weng and B.-S. Lee. Event Detection in Twitter. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011), Barcelona, Spain, 2011. The AAAI Press.Google Scholar

Index Terms

Groundhog day: near-duplicate detection on Twitter
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Searching for relevant tweets based on topic-related user activities

Twitter is one of the largest social media. Although it can be used to get information on a topic of interest, it is not easy for us to find tweets relevant to the topic due to a massive amount of tweets and the small size of each tweet. Some relevant ...
Read More
Disinformation Warfare: Understanding State-Sponsored Trolls on Twitter and Their Influence on the Web
WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

Over the past couple of years, anecdotal evidence has emerged linking coordinated campaigns by state-sponsored actors with efforts to manipulate public opinion on the Web, often around major political events, through dedicated accounts, or “trolls.” ...
Read More
Rumor Gauge: Predicting the Veracity of Rumors on Twitter
Special Issue on KDD 2016 and Regular Papers

The spread of malicious or accidental misinformation in social media, especially in time-sensitive situations, such as real-world emergencies, can have harmful effects on individuals and society. In this work, we developed models for automated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '13: Proceedings of the 22nd international conference on World Wide Web
May 2013
1628 pages
ISBN:9781450320351
DOI:10.1145/2488388
General Chairs:
Daniel Schwabe
PUC-Rio - Brazil
,
Virgílio Almeida
UFMG - Brazil
,
Hartmut Glaser
CGI.br - Brazil
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Labs - Spain & Chile
,
Sue Moon
KAIST - South Korea
Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
diversification
duplicate detection
search
twitter
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '13 Paper Acceptance Rate125of831submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 623
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Groundhog day: near-duplicate detection on Twitter

WWW '13: Proceedings of the 22nd international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Searching for relevant tweets based on topic-related user activities

Disinformation Warfare: Understanding State-Sponsored Trolls on Twitter and Their Influence on the Web

Rumor Gauge: Predicting the Veracity of Rumors on Twitter