ABSTRACT
The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction.
- T. B. Adler and L. de Alfaro. A content-driven reputation system for the wikipedia. In Proc. of WWW '07, pages 261--270, 2007. Google ScholarDigital Library
- J. E. Alexander and M. A. Tate. Web Wisdom; How to Evaluate and Create Information Quality on the Web. L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1999. Google ScholarDigital Library
- F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, C. Zhang, and K. Ross. Identifying video spammers in online social networks. In Proc. of AIRWeb '08, pages 45--52, 2008. Google ScholarDigital Library
- C. Björnsson. Lesbarkeit durch Lix. 1968.Google Scholar
- P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In In Proc. of WWW'04, pages 595--601, 2004. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, April 1998. Google ScholarDigital Library
- R. Cassel. Selection criteria for internet resources. College and Research Libraries News, 56(2):92--93, 1995.Google ScholarCross Ref
- C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In Proc. of SIGIR '07, pages 423--430, 2007. Google ScholarDigital Library
- C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001. Google ScholarDigital Library
- Y. Chu. Trust management for the world wide web. Master's thesis, MIT, USA, 1997.Google Scholar
- M. Coleman and T. L. Liau. A computer readability formula designed for machine scoring. 60(2):283--284, 1975.Google Scholar
- P. Dondio, S. Barrett, S. Weber, and J. Seigneur. Extracting trust from domain analysis: A case study on the wikipedia project. pages 362--373. 2006. Google ScholarDigital Library
- S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). Oxford University Press, March 2003. Google ScholarDigital Library
- T. R. (Ed). Online Collaborative Learning: Theory and Practice. Idea Group Pub, USA, 2004.Google Scholar
- R. Flesch. A new readability yardstick. pages 221--235, 1948.Google Scholar
- B. J. Fogg, C. Soohoo, D. R. Danielson, L. Marable, J. Stanford, and E. R. Tauber. How do users evaluate the credibility of web sites?: a study with over 2,500 participants. In Proc. of DUX '03, pages 1--15, 2003. Google ScholarDigital Library
- R. Gunning. The Technique of Clear Writing. McGraw-Hill International Book Co, 1952.Google Scholar
- M. Hu, E.-P. Lim, A. Sun, H. W. Lauw, and B.-Q. Vuong. Measuring article quality in wikipedia: models and evaluation. In Proc. of CIKM '07, pages 243--252, 2007. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR'00, pages 41--48, 2000. Google ScholarDigital Library
- N. Korfiatis, M. Poulos, and G. Bokos. Evaluating authoritative sources using social networks: An insight from wikipedia. Online Information Review, 30(3):252--262, 2006.Google ScholarCross Ref
- A. Krowne. Building a digital library the commons-based peer production way. D-Lib magazine, 9(1082), 2003.Google Scholar
- G. H. McLaughlin. Smog grading: A new readability formula. pages 639--646, 1969.Google Scholar
- B. Mingus. personal communication, 2008.Google Scholar
- T. M. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997. Google ScholarDigital Library
- S. B. P. Dondio and S. Weber. Calculating the trustworthiness of a wikipedia article using dante methodology. In IADIS e Society Conference, Dublin, Ireland, 2006.Google Scholar
- L. Rassbach, T. Pincock, and B. Mingus. Exploring the feasibility of automatically rating online article quality. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.Google Scholar
- S. Ressler. Perspectives on electronic publishing: standards, solutions, and more. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993. Google ScholarDigital Library
- E. A. Smith and R. J. Senter. Automated readability index. 1967.Google Scholar
- B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of the ICIQ 2005, pages 442--454, 2005.Google Scholar
- V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995. Google ScholarDigital Library
- K. H. Veltman. Access, claims and quality on the internet -- future challenges. Progress in informatics : PI, 2:17--40, 2005.Google Scholar
- F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, pages 80--83, 1945.Google Scholar
Index Terms
- Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia
Recommendations
Automatic Assessment of Document Quality in Web Collaborative Digital Libraries
The old dream of a universal repository containing all of human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative participation of people. Wikipedia is a great example. ...
Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature Review
Wikipedia is the world’s largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing ...
Assessing the Quality of Wikipedia Articles
ICMLSC '21: Proceedings of the 2021 5th International Conference on Machine Learning and Soft ComputingWikipedia is a very important information reference source for the Internet users. Due to the fact that the content of Wikipedia is the collaborative result from a massive number of participants all over the world, the quality of Wikipedia might be ...
Comments