skip to main content
10.1145/2348283.2348413acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Predicting quality flaws in user-generated content: the case of wikipedia

Published:12 August 2012Publication History

ABSTRACT

The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.

References

  1. B. Adler and L. de Alfaro. A content-driven reputation system for the Wikipedia. In Proc. of WWW'07, pages 261--270, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proc. of WSDM'08, pages 183--194, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proc. of WWW'11, pages 5--6, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Anderka, B. Stein, and N. Lipka. Detection of text quality flaws as a one-class classification problem. In Proc. of CIKM'11, pages 2313--2316, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Anderka and B. Stein. A breakdown of quality flaws in Wikipedia. In Proc. of WebQuality'12, pages 11--18, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Baeza-Yates. User generated content: how good is it? In Proc. of WICOW'09, pages 1--2, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Bendersky, W. B. Croft, and Y. Diao. Quality-biased ranking of Web documents. In Proc. of WSDM'11, pages 95--104, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Blumenstock. Size matters: word count as a measure of quality on Wikipedia. In Proc. of WWW'08, pages 1095--1096, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996. Google ScholarGoogle ScholarCross RefCross Ref
  10. L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Cosley, D. Frankowski, L. Terveen, and J. Riedl. Using intelligent task routing and contribution review to help communities build artifacts of lasting value. In Proc. of CHI'06, pages 1037--1046, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Cross. Puppy smoothies: improving the reliability of open, collaborative wikis. First Monday, 11(9), 2006.Google ScholarGoogle Scholar
  13. D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proc. of JCDL'09, pages 295--304, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Emigh and S. Herring. Collaborative authoring on the Web: a genre analysis of online encyclopedias. In Proc. of HICSS'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Fawcett. Roc graphs: Notes and practical considerations for researchers. Technical report, HP Laboratories, 2004.Google ScholarGoogle Scholar
  16. J. Giles. Internet encyclopaedias go head to head. Nature, 438(7070):900--901, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. K. Hempstalk, E. Frank, and I. Witten. One-class classification by combining density and class probability estimation. In Proc. of ECML'08, pages 505--519, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. K. Ho. Random decision forests. In Proc. of ICDAR'95, pages 278--282, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Hu, E. Lim, A. Sun, H. Lauw, and B. Vuong. Measuring article quality in Wikipedia: models and evaluation. In Proc. of CIKM'07, pages 243--252, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information and Management, 40(2):133--146, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Lih. Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news resource. In Proc. of ISOJ'04, 2004.Google ScholarGoogle Scholar
  22. N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proc. of WWW'10, pages 1147--1148, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Liu, Y. Dai, X. Li, W. S. Lee and P. Yu. Building text classifiers using positive and unlabeled examples. In Proc. of ICDM'03, pages 179--186, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Madnick, R. Wang, Y. Lee, and H. Zhu. Overview and framework for data and information quality research. Journal of data and information quality, 1(1):1--22, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Pirkola and T. Talvensaari. A topic-specific Web search system focusing on quality pages. In Proc. of ECDL'10, pages 490--493, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Potthast, B. Stein, and R. Gerling. Automatic vandalism detection in Wikipedia. In Proc. of ECIR'08, pages 663--668, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Slone. Information quality strategy: an empirical investigation of the relationship between information quality improvements and organizational outcomes. PhD\ thesis, Capella University, 2006.Google ScholarGoogle Scholar
  29. B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of ICIQ'05, pages 442--454, 2005.Google ScholarGoogle Scholar
  30. B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Information quality work organization in Wikipedia. Journal of the american society for information science and technology, 59(6):983--1001, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Tax. One-Class Classification. PhD thesis, Delft University of Technology, 2001.Google ScholarGoogle Scholar
  32. F. Viégas, M. Wattenberg, and K. Dave. Studying cooperation and conflict between authors with history flow visualizations. In Proc. of CHI'04, pages 575--582, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Wang and D. Strong. Beyond accuracy: what data quality means to data consumers. Journal of management information systems, 12(4):5--33, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Wilkinson and B. Huberman. Cooperation and quality in Wikipedia. In Proc. of WikiSym'07, pages 157--164, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Wilson and R. Randall. Bias and the probability of generalization. In Proc. of IIS'97, pages 108--114, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. The Wall Street Journal. Jimmy Wales on Wikipedia quality and tips for contributors. November 2009. URL:smallhttp://blogs.wsj.com/digits/2009/11/06.Google ScholarGoogle Scholar
  37. H. Zeng, M. Alhossaini, L. Ding, R. Fikes, and D. McGuinness. Computing trust from revision history. In Proc. of PST'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Zhou and B. W. Croft. Document quality models for Web ad hoc retrieval. In Proc. of CIKM'05, pages 331--332, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. Zhu and S. Gauch. Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web. In Proc. of SIGIR'00, pages 288--295, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Predicting quality flaws in user-generated content: the case of wikipedia

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
        August 2012
        1236 pages
        ISBN:9781450314725
        DOI:10.1145/2348283

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader