research-article

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

Authors:
Daniel Hasan Dalip

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Marcos André Gonçalves

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Marco Cristo

FUCAPI - Analysis, Research and Tech. Innovation Center, Manaus, Brazil

FUCAPI - Analysis, Research and Tech. Innovation Center, Manaus, Brazil
View Profile

,
Pável Calado

Instituto Superior Técnico/INESC-ID, Porto Salvo, Portugal

Instituto Superior Técnico/INESC-ID, Porto Salvo, Portugal
View Profile

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital librariesJune 2009Pages 295–304https://doi.org/10.1145/1555400.1555449

Published:15 June 2009Publication History

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

Pages 295–304

ABSTRACT

The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction.

References

T. B. Adler and L. de Alfaro. A content-driven reputation system for the wikipedia. In Proc. of WWW '07, pages 261--270, 2007. Google ScholarDigital Library
J. E. Alexander and M. A. Tate. Web Wisdom; How to Evaluate and Create Information Quality on the Web. L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1999. Google ScholarDigital Library
F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, C. Zhang, and K. Ross. Identifying video spammers in online social networks. In Proc. of AIRWeb '08, pages 45--52, 2008. Google ScholarDigital Library
C. Björnsson. Lesbarkeit durch Lix. 1968.Google Scholar
P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In In Proc. of WWW'04, pages 595--601, 2004. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, April 1998. Google ScholarDigital Library
R. Cassel. Selection criteria for internet resources. College and Research Libraries News, 56(2):92--93, 1995.Google ScholarCross Ref
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In Proc. of SIGIR '07, pages 423--430, 2007. Google ScholarDigital Library
C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines, 2001. Google ScholarDigital Library
Y. Chu. Trust management for the world wide web. Master's thesis, MIT, USA, 1997.Google Scholar
M. Coleman and T. L. Liau. A computer readability formula designed for machine scoring. 60(2):283--284, 1975.Google Scholar
P. Dondio, S. Barrett, S. Weber, and J. Seigneur. Extracting trust from domain analysis: A case study on the wikipedia project. pages 362--373. 2006. Google ScholarDigital Library
S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks: From Biological Nets to the Internet and WWW (Physics). Oxford University Press, March 2003. Google ScholarDigital Library
T. R. (Ed). Online Collaborative Learning: Theory and Practice. Idea Group Pub, USA, 2004.Google Scholar
R. Flesch. A new readability yardstick. pages 221--235, 1948.Google Scholar
B. J. Fogg, C. Soohoo, D. R. Danielson, L. Marable, J. Stanford, and E. R. Tauber. How do users evaluate the credibility of web sites?: a study with over 2,500 participants. In Proc. of DUX '03, pages 1--15, 2003. Google ScholarDigital Library
R. Gunning. The Technique of Clear Writing. McGraw-Hill International Book Co, 1952.Google Scholar
M. Hu, E.-P. Lim, A. Sun, H. W. Lauw, and B.-Q. Vuong. Measuring article quality in wikipedia: models and evaluation. In Proc. of CIKM '07, pages 243--252, 2007. Google ScholarDigital Library
K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR'00, pages 41--48, 2000. Google ScholarDigital Library
N. Korfiatis, M. Poulos, and G. Bokos. Evaluating authoritative sources using social networks: An insight from wikipedia. Online Information Review, 30(3):252--262, 2006.Google ScholarCross Ref
A. Krowne. Building a digital library the commons-based peer production way. D-Lib magazine, 9(1082), 2003.Google Scholar
G. H. McLaughlin. Smog grading: A new readability formula. pages 639--646, 1969.Google Scholar
B. Mingus. personal communication, 2008.Google Scholar
T. M. Mitchell. Machine Learning. McGraw-Hill Higher Education, 1997. Google ScholarDigital Library
S. B. P. Dondio and S. Weber. Calculating the trustworthiness of a wikipedia article using dante methodology. In IADIS e Society Conference, Dublin, Ireland, 2006.Google Scholar
L. Rassbach, T. Pincock, and B. Mingus. Exploring the feasibility of automatically rating online article quality. http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf.Google Scholar
S. Ressler. Perspectives on electronic publishing: standards, solutions, and more. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993. Google ScholarDigital Library
E. A. Smith and R. J. Senter. Automated readability index. 1967.Google Scholar
B. Stvilia, M. B. Twidale, L. C. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proc. of the ICIQ 2005, pages 442--454, 2005.Google Scholar
V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995. Google ScholarDigital Library
K. H. Veltman. Access, claims and quality on the internet -- future challenges. Progress in informatics : PI, 2:17--40, 2005.Google Scholar
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, pages 80--83, 1945.Google Scholar

Index Terms

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Automatic Assessment of Document Quality in Web Collaborative Digital Libraries

The old dream of a universal repository containing all of human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative participation of people. Wikipedia is a great example. ...
Read More
Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature Review
Wikipedia is the world’s largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing ...
Read More
Assessing the Quality of Wikipedia Articles
ICMLSC '21: Proceedings of the 2021 5th International Conference on Machine Learning and Soft Computing

Wikipedia is a very important information reference source for the Internet users. Due to the fact that the content of Wikipedia is the collaborative result from a massive number of participants all over the world, the quality of Wikipedia might be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
June 2009
502 pages
ISBN:9781605583228
DOI:10.1145/1555400
General Chairs:
Fred Heath
University of Texas Libraries, USA
,
Mary Lynn Rice-Lively
University of Texas at Austin, USA
,
Program Chair:
Richard Furuta
Texas A&M University, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SVM
machine learning
quality assessment
wikipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 76
  Total Citations
  View Citations
- 1,372
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic Assessment of Document Quality in Web Collaborative Digital Libraries

Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature Review

Assessing the Quality of Wikipedia Articles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

JCDL '09: Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic Assessment of Document Quality in Web Collaborative Digital Libraries

Automatic Quality Assessment of Wikipedia Articles—A Systematic Literature Review

Assessing the Quality of Wikipedia Articles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media