ABSTRACT
The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.
- B. Adler and L. de Alfaro. A content-driven reputation system for the Wikipedia. In Proc. of WWW'07, pages 261--270, 2007. Google ScholarDigital Library
- E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proc. of WSDM'08, pages 183--194, 2008. Google ScholarDigital Library
- M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proc. of WWW'11, pages 5--6, 2011. Google ScholarDigital Library
- M. Anderka, B. Stein, and N. Lipka. Detection of text quality flaws as a one-class classification problem. In Proc. of CIKM'11, pages 2313--2316, 2011. Google ScholarDigital Library
- M. Anderka and B. Stein. A breakdown of quality flaws in Wikipedia. In Proc. of WebQuality'12, pages 11--18, 2012. Google ScholarDigital Library
- R. Baeza-Yates. User generated content: how good is it? In Proc. of WICOW'09, pages 1--2, 2009. Google ScholarDigital Library
- M. Bendersky, W. B. Croft, and Y. Diao. Quality-biased ranking of Web documents. In Proc. of WSDM'11, pages 95--104, 2011. Google ScholarDigital Library
- J. Blumenstock. Size matters: word count as a measure of quality on Wikipedia. In Proc. of WWW'08, pages 1095--1096, 2008. Google ScholarDigital Library
- L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996. Google ScholarCross Ref
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
- D. Cosley, D. Frankowski, L. Terveen, and J. Riedl. Using intelligent task routing and contribution review to help communities build artifacts of lasting value. In Proc. of CHI'06, pages 1037--1046, 2006. Google ScholarDigital Library
- T. Cross. Puppy smoothies: improving the reliability of open, collaborative wikis. First Monday, 11(9), 2006.Google Scholar
- D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proc. of JCDL'09, pages 295--304, 2009. Google ScholarDigital Library
- W. Emigh and S. Herring. Collaborative authoring on the Web: a genre analysis of online encyclopedias. In Proc. of HICSS'05, 2005. Google ScholarDigital Library
- T. Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical report, HP Laboratories, 2004.Google Scholar
- J. Giles. Internet encyclopaedias go head to head. Nature, 438(7070):900--901, 2005.Google ScholarCross Ref
- K. Hempstalk, E. Frank, and I. Witten. One-class classification by combining density and class probability estimation. In Proc. of ECML'08, pages 505--519, 2008. Google ScholarDigital Library
- T. K. Ho. Random decision forests. In Proc. of ICDAR'95, pages 278--282, 1995. Google ScholarDigital Library
- M. Hu, E. Lim, A. Sun, H. Lauw, and B. Vuong. Measuring article quality in Wikipedia: models and evaluation. In Proc. of CIKM'07, pages 243--252, 2007. Google ScholarDigital Library
- Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information and Management, 40(2):133--146, 2002. Google ScholarDigital Library
- A. Lih. Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource. In Proc. of ISOJ'04, 2004.Google Scholar
- N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proc. of WWW'10, pages 1147--1148, 2010. Google ScholarDigital Library
- B. Liu, Y. Dai, X. Li, W. S. Lee and P. Yu. Building text classifiers using positive and unlabeled examples. In Proc. of ICDM'03, pages 179--186, 2003. Google ScholarDigital Library
- S. Madnick, R. Wang, Y. Lee, and H. Zhu. Overview and framework for data and information quality research. Journal of data and information quality, 1(1):1--22, 2009. Google ScholarDigital Library
- T. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997. Google ScholarDigital Library
- A. Pirkola and T. Talvensaari. A topic-specific Web search system focusing on quality pages. In Proc. of ECDL'10, pages 490--493, 2010. Google ScholarDigital Library
- M. Potthast, B. Stein, and R. Gerling. Automatic vandalism detection in Wikipedia. In Proc. of ECIR'08, pages 663--668, 2008. Google ScholarDigital Library
- J. Slone. Information quality strategy: an empirical investigation of the relationship between information quality improvements and organizational outcomes. PhD\ thesis, Capella University, 2006.Google Scholar
- B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of ICIQ'05, pages 442--454, 2005.Google Scholar
- B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Information quality work organization in Wikipedia. Journal of the american society for information science and technology, 59(6):983--1001, 2008. Google ScholarDigital Library
- D. Tax. One-Class Classification. PhD thesis, Delft University of Technology, 2001.Google Scholar
- F. Viégas, M. Wattenberg, and K. Dave. Studying cooperation and conflict between authors with history flow visualizations. In Proc. of CHI'04, pages 575--582, 2004. Google ScholarDigital Library
- R. Wang and D. Strong. Beyond accuracy: what data quality means to data consumers. Journal of management information systems, 12(4):5--33, 1996. Google ScholarDigital Library
- D. Wilkinson and B. Huberman. Cooperation and quality in Wikipedia. In Proc. of WikiSym'07, pages 157--164, 2007. Google ScholarDigital Library
- D. Wilson and R. Randall. Bias and the probability of generalization. In Proc. of IIS'97, pages 108--114, 1997. Google ScholarDigital Library
- The Wall Street Journal. Jimmy Wales on Wikipedia quality and tips for contributors. November 2009. URL:smallhttp://blogs.wsj.com/digits/2009/11/06.Google Scholar
- H. Zeng, M. Alhossaini, L. Ding, R. Fikes, and D. McGuinness. Computing trust from revision history. In Proc. of PST'06, 2006. Google ScholarDigital Library
- Y. Zhou and B. W. Croft. Document quality models for Web ad hoc retrieval. In Proc. of CIKM'05, pages 331--332, 2005. Google ScholarDigital Library
- X. Zhu and S. Gauch. Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web. In Proc. of SIGIR'00, pages 288--295, 2000. Google ScholarDigital Library
Index Terms
- Predicting quality flaws in user-generated content: the case of wikipedia
Recommendations
A breakdown of quality flaws in Wikipedia
WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web QualityThe online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its ...
Detection of text quality flaws as a one-class classification problem
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementFor Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a ...
Interactive Quality Analytics of User-generated Content: An Integrated Toolkit for the Case of Wikipedia
Special Issue on Highlights of ACM IUI 2017Digital libraries and services enable users to access large amounts of data on demand. Yet, quality assessment of information encountered on the Internet remains an elusive open issue. For example, Wikipedia, one of the most visited platforms on the Web,...
Comments