skip to main content
10.1145/2184305.2184307acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebqualityConference Proceedingsconference-collections
research-article

On measuring the lexical quality of the web

Authors Info & Claims
Published:16 April 2012Publication History

ABSTRACT

In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.

References

  1. R. Baeza-Yates and L. Rello. Estimating dyslexia in the Web. In International Cross Disciplinary Conference on Web Accessibility (W4A 2011), pages 1--4, Hyderabad, India, March 2011. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology behind Search (2nd Edition). Addison Wesley, Harlow, UK, second edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proc. WWW, pages 367--376. ACM Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. F. Ehrlich and K. Rayner. Contextual effects on word perception and eye movements during reading. Journal of Verbal Learning and Verbal Behavior, 20(6):641--655, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  5. I. A. Gelman and A. L. Barletta. A "quick and dirty" website data quality indicator. In The 2nd ACM workshop on Information credibility on the Web (WICOW '08), pages 43--46, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Internet World Stats. Usage and population statistics, April 2011. http://www.internetworldstats.com.Google ScholarGoogle Scholar
  7. A. M. Kaplan and M. Haenlein. Users of the world, unite! the challenges and opportunities of social media. Business Horizons, 53:59--68, January--February 2010.Google ScholarGoogle ScholarCross RefCross Ref
  8. C. Perfetti and L. Hart. Precursors of functional literacy, chapter The lexical quality hypothesis, pages 189--213. Amsterdam/Philadelphia: John Benjamins, 2002.Google ScholarGoogle Scholar
  9. J. Piskorski, M. Sydow, and D. Weiss. Exploring linguistic features for web spam detection: a preliminary study. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, AIRWeb '08, pages 25--28, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Rello and R. Baeza-Yates. Lexical quality as a proxy for web text understandability. In The 21st International World Wide Web Conference (WWW 2012), April 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Ringlstetter, K. U. Schulz, and S. Mihov. Orthographic errors in web pages: Towards cleaner web corpora. Computational Linguistics, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wikipedia. Wikipedia, the free encyclopedia, April 2011. http://www.wikipedia.org.Google ScholarGoogle Scholar

Index Terms

  1. On measuring the lexical quality of the web

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
      April 2012
      71 pages
      ISBN:9781450312370
      DOI:10.1145/2184305

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 April 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader