skip to main content
10.1145/1555400.1555449acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

Authors Info & Claims
Published:15 June 2009Publication History

ABSTRACT

The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction.

References

  1. T. B. Adler and L. de Alfaro. A content-driven reputation system for the wikipedia. In Proc. of WWW '07, pages 261--270, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. E. Alexander and M. A. Tate. Web Wisdom; How to Evaluate and Create Information Quality on the Web. L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, C. Zhang, and K. Ross. Identifying video spammers in online social networks. In Proc. of AIRWeb '08, pages 45--52, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Björnsson. Lesbarkeit durch Lix. 1968.Google ScholarGoogle Scholar
  5. P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In In Proc. of WWW'04, pages 595--601, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, April 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Cassel. Selection criteria for internet resources. College and Research Libraries News, 56(2):92--93, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In Proc. of SIGIR '07, pages 423--430, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Chu. Trust management for the world wide web. Master's thesis, MIT, USA, 1997.Google ScholarGoogle Scholar
  11. M. Coleman and T. L. Liau. A computer readability formula designed for machine scoring. 60(2):283--284, 1975.Google ScholarGoogle Scholar
  12. P. Dondio, S. Barrett, S. Weber, and J. Seigneur. Extracting trust from domain analysis: A case study on the wikipedia project. pages 362--373. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). Oxford University Press, March 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. R. (Ed). Online Collaborative Learning: Theory and Practice. Idea Group Pub, USA, 2004.Google ScholarGoogle Scholar
  15. R. Flesch. A new readability yardstick. pages 221--235, 1948.Google ScholarGoogle Scholar
  16. B. J. Fogg, C. Soohoo, D. R. Danielson, L. Marable, J. Stanford, and E. R. Tauber. How do users evaluate the credibility of web sites?: a study with over 2,500 participants. In Proc. of DUX '03, pages 1--15, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Gunning. The Technique of Clear Writing. McGraw-Hill International Book Co, 1952.Google ScholarGoogle Scholar
  18. M. Hu, E.-P. Lim, A. Sun, H. W. Lauw, and B.-Q. Vuong. Measuring article quality in wikipedia: models and evaluation. In Proc. of CIKM '07, pages 243--252, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR'00, pages 41--48, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Korfiatis, M. Poulos, and G. Bokos. Evaluating authoritative sources using social networks: An insight from wikipedia. Online Information Review, 30(3):252--262, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. Krowne. Building a digital library the commons-based peer production way. D-Lib magazine, 9(1082), 2003.Google ScholarGoogle Scholar
  22. G. H. McLaughlin. Smog grading: A new readability formula. pages 639--646, 1969.Google ScholarGoogle Scholar
  23. B. Mingus. personal communication, 2008.Google ScholarGoogle Scholar
  24. T. M. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. B. P. Dondio and S. Weber. Calculating the trustworthiness of a wikipedia article using dante methodology. In IADIS e Society Conference, Dublin, Ireland, 2006.Google ScholarGoogle Scholar
  26. L. Rassbach, T. Pincock, and B. Mingus. Exploring the feasibility of automatically rating online article quality. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.Google ScholarGoogle Scholar
  27. S. Ressler. Perspectives on electronic publishing: standards, solutions, and more. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. A. Smith and R. J. Senter. Automated readability index. 1967.Google ScholarGoogle Scholar
  29. B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of the ICIQ 2005, pages 442--454, 2005.Google ScholarGoogle Scholar
  30. V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. H. Veltman. Access, claims and quality on the internet -- future challenges. Progress in informatics : PI, 2:17--40, 2005.Google ScholarGoogle Scholar
  32. F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, pages 80--83, 1945.Google ScholarGoogle Scholar

Index Terms

  1. Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
        June 2009
        502 pages
        ISBN:9781605583228
        DOI:10.1145/1555400

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 June 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate415of1,482submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader