research-article

Predicting quality flaws in user-generated content: the case of wikipedia

Authors:
Maik Anderka

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

,
Benno Stein

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

,
Nedim Lipka

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalAugust 2012Pages 981–990https://doi.org/10.1145/2348283.2348413

Published:12 August 2012Publication History

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Pages 981–990

ABSTRACT

The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.

References

B. Adler and L. de Alfaro. A content-driven reputation system for the Wikipedia. In Proc. of WWW'07, pages 261--270, 2007. Google ScholarDigital Library
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proc. of WSDM'08, pages 183--194, 2008. Google ScholarDigital Library
M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proc. of WWW'11, pages 5--6, 2011. Google ScholarDigital Library
M. Anderka, B. Stein, and N. Lipka. Detection of text quality flaws as a one-class classification problem. In Proc. of CIKM'11, pages 2313--2316, 2011. Google ScholarDigital Library
M. Anderka and B. Stein. A breakdown of quality flaws in Wikipedia. In Proc. of WebQuality'12, pages 11--18, 2012. Google ScholarDigital Library
R. Baeza-Yates. User generated content: how good is it? In Proc. of WICOW'09, pages 1--2, 2009. Google ScholarDigital Library
M. Bendersky, W. B. Croft, and Y. Diao. Quality-biased ranking of Web documents. In Proc. of WSDM'11, pages 95--104, 2011. Google ScholarDigital Library
J. Blumenstock. Size matters: word count as a measure of quality on Wikipedia. In Proc. of WWW'08, pages 1095--1096, 2008. Google ScholarDigital Library
L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996. Google ScholarCross Ref
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
D. Cosley, D. Frankowski, L. Terveen, and J. Riedl. Using intelligent task routing and contribution review to help communities build artifacts of lasting value. In Proc. of CHI'06, pages 1037--1046, 2006. Google ScholarDigital Library
T. Cross. Puppy smoothies: improving the reliability of open, collaborative wikis. First Monday, 11(9), 2006.Google Scholar
D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proc. of JCDL'09, pages 295--304, 2009. Google ScholarDigital Library
W. Emigh and S. Herring. Collaborative authoring on the Web: a genre analysis of online encyclopedias. In Proc. of HICSS'05, 2005. Google ScholarDigital Library
T. Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical report, HP Laboratories, 2004.Google Scholar
J. Giles. Internet encyclopaedias go head to head. Nature, 438(7070):900--901, 2005.Google ScholarCross Ref
K. Hempstalk, E. Frank, and I. Witten. One-class classification by combining density and class probability estimation. In Proc. of ECML'08, pages 505--519, 2008. Google ScholarDigital Library
T. K. Ho. Random decision forests. In Proc. of ICDAR'95, pages 278--282, 1995. Google ScholarDigital Library
M. Hu, E. Lim, A. Sun, H. Lauw, and B. Vuong. Measuring article quality in Wikipedia: models and evaluation. In Proc. of CIKM'07, pages 243--252, 2007. Google ScholarDigital Library
Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information and Management, 40(2):133--146, 2002. Google ScholarDigital Library
A. Lih. Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource. In Proc. of ISOJ'04, 2004.Google Scholar
N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proc. of WWW'10, pages 1147--1148, 2010. Google ScholarDigital Library
B. Liu, Y. Dai, X. Li, W. S. Lee and P. Yu. Building text classifiers using positive and unlabeled examples. In Proc. of ICDM'03, pages 179--186, 2003. Google ScholarDigital Library
S. Madnick, R. Wang, Y. Lee, and H. Zhu. Overview and framework for data and information quality research. Journal of data and information quality, 1(1):1--22, 2009. Google ScholarDigital Library
T. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997. Google ScholarDigital Library
A. Pirkola and T. Talvensaari. A topic-specific Web search system focusing on quality pages. In Proc. of ECDL'10, pages 490--493, 2010. Google ScholarDigital Library
M. Potthast, B. Stein, and R. Gerling. Automatic vandalism detection in Wikipedia. In Proc. of ECIR'08, pages 663--668, 2008. Google ScholarDigital Library
J. Slone. Information quality strategy: an empirical investigation of the relationship between information quality improvements and organizational outcomes. PhD\ thesis, Capella University, 2006.Google Scholar
B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of ICIQ'05, pages 442--454, 2005.Google Scholar
B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Information quality work organization in Wikipedia. Journal of the american society for information science and technology, 59(6):983--1001, 2008. Google ScholarDigital Library
D. Tax. One-Class Classification. PhD thesis, Delft University of Technology, 2001.Google Scholar
F. Viégas, M. Wattenberg, and K. Dave. Studying cooperation and conflict between authors with history flow visualizations. In Proc. of CHI'04, pages 575--582, 2004. Google ScholarDigital Library
R. Wang and D. Strong. Beyond accuracy: what data quality means to data consumers. Journal of management information systems, 12(4):5--33, 1996. Google ScholarDigital Library
D. Wilkinson and B. Huberman. Cooperation and quality in Wikipedia. In Proc. of WikiSym'07, pages 157--164, 2007. Google ScholarDigital Library
D. Wilson and R. Randall. Bias and the probability of generalization. In Proc. of IIS'97, pages 108--114, 1997. Google ScholarDigital Library
The Wall Street Journal. Jimmy Wales on Wikipedia quality and tips for contributors. November 2009. URL:smallhttp://blogs.wsj.com/digits/2009/11/06.Google Scholar
H. Zeng, M. Alhossaini, L. Ding, R. Fikes, and D. McGuinness. Computing trust from revision history. In Proc. of PST'06, 2006. Google ScholarDigital Library
Y. Zhou and B. W. Croft. Document quality models for Web ad hoc retrieval. In Proc. of CIKM'05, pages 331--332, 2005. Google ScholarDigital Library
X. Zhu and S. Gauch. Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web. In Proc. of SIGIR'00, pages 288--295, 2000. Google ScholarDigital Library

Index Terms

Predicting quality flaws in user-generated content: the case of wikipedia
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing design and evaluation methods
2. Information systems
  1. Information retrieval

Recommendations

A breakdown of quality flaws in Wikipedia
WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality

The online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its ...
Read More
Detection of text quality flaws as a one-class classification problem
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a ...
Read More
Interactive Quality Analytics of User-generated Content: An Integrated Toolkit for the Case of Wikipedia
Special Issue on Highlights of ACM IUI 2017

Digital libraries and services enable users to access large amounts of data on demand. Yet, quality assessment of information encountered on the Internet remains an elusive open issue. For example, Wikipedia, one of the most visited platforms on the Web,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
August 2012
1236 pages
ISBN:9781450314725
DOI:10.1145/2348283
General Chair:
William Hersh
Oregon Health & Science University, USA
,
Program Chairs:
Jamie Callan
Carnegie Mellon University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
,
Mark Sanderson
Royal Melbourne Institute of Technology, Australia
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information quality
one-class classification
quality flaw prediction
user-generated content analysis
wikipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 62
  Total Citations
  View Citations
- 839
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Predicting quality flaws in user-generated content: the case of wikipedia

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A breakdown of quality flaws in Wikipedia

Detection of text quality flaws as a one-class classification problem

Interactive Quality Analytics of User-generated Content: An Integrated Toolkit for the Case of Wikipedia