ABSTRACT
In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.
- R. Baeza-Yates and L. Rello. Estimating dyslexia in the Web. In International Cross Disciplinary Conference on Web Accessibility (W4A 2011), pages 1--4, Hyderabad, India, March 2011. ACM Press. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology behind Search (2nd Edition). Addison Wesley, Harlow, UK, second edition, 2011. Google ScholarDigital Library
- Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proc. WWW, pages 367--376. ACM Press, 2006. Google ScholarDigital Library
- S. F. Ehrlich and K. Rayner. Contextual effects on word perception and eye movements during reading. Journal of Verbal Learning and Verbal Behavior, 20(6):641--655, 1981.Google ScholarCross Ref
- I. A. Gelman and A. L. Barletta. A "quick and dirty" website data quality indicator. In The 2nd ACM workshop on Information credibility on the Web (WICOW '08), pages 43--46, 2008. Google ScholarDigital Library
- Internet World Stats. Usage and population statistics, April 2011. http://www.internetworldstats.com.Google Scholar
- A. M. Kaplan and M. Haenlein. Users of the world, unite! the challenges and opportunities of social media. Business Horizons, 53:59--68, January--February 2010.Google ScholarCross Ref
- C. Perfetti and L. Hart. Precursors of functional literacy, chapter The lexical quality hypothesis, pages 189--213. Amsterdam/Philadelphia: John Benjamins, 2002.Google Scholar
- J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for web spam detection: a preliminary study. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, AIRWeb '08, pages 25--28, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- L. Rello and R. Baeza-Yates. Lexical quality as a proxy for web text understandability. In The 21st International World Wide Web Conference (WWW 2012), April 2012. Google ScholarDigital Library
- C. Ringlstetter, K. U. Schulz, and S. Mihov. Orthographic errors in web pages: Towards cleaner web corpora. Computational Linguistics, 2006. Google ScholarDigital Library
- Wikipedia. Wikipedia, the free encyclopedia, April 2011. http://www.wikipedia.org.Google Scholar
Index Terms
- On measuring the lexical quality of the web
Recommendations
Lexical quality as a proxy for web text understandability
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebWe show that a recently introduced lexical quality measure is also valid to measure textual Web accessibility. Our measure estimates the lexical quality of a site based in the occurrence in English Web pages of a large set of words with errors. We first ...
Measuring web quality
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebMeasuring the quality of web content, either at page level or website level, is at the heart of several key challenges in the Web. Without doubt, the main one is web search, to be able to rank results. However, there are other important problems such as ...
The presence of English and Spanish dyslexia in the Web
Web AccessibilityIn this study we present a lower bound of the prevalence of dyslexia in the Web for English and Spanish. On the basis of analysis of corpora written by dyslexic people, we propose a classification of the different kinds of dyslexic errors. A ...
Comments