Skip to main content
Log in

The WaCky wide web: a collection of very large linguistically processed web-crawled corpora

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC versus the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.dmoz.org.

  2. http://www.catedratelefonica.upf.es/cucweb.

  3. http://corpus.leeds.ac.uk/internet.html.

  4. The Google API facility was used in the construction of deWaC, itWaC and ukWaC. While this functionality is no longer offered to new users, similar ones are offered by, e.g., Microsoft Live Search and Yahoo!.

  5. http://wordlist.sourceforge.net/.

  6. http://www.bardito.com/language/italian_english_wordlist.html and http://homepage.bluewin.ch/cusipage/.

  7. The slightly different seed construction strategy used for Italian is not by design. It is an alternative course of action due to different people performing the procedure for different languages at different times.

  8. http://crawler.archive.org/.

  9. http://web.archive.org/web/*/www.smi.ucd.ie/hyppia/; our re-implementation of the Hyppia method is also available for download (see Sect. 5).

  10. These experiments were conducted by Jan Pomikálek, whose contribution we gratefully acknowledge.

  11. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.

  12. http://sslmit.unibo.it/morphit.

  13. http://www.natcorp.ox.ac.uk.

  14. http://sslmit.unibo.it/repubblica.

  15. Notice that here we are estimating noise based on the type lists only. Clearly, we cannot rule out the possibility that concordances from the web corpora are also less informative due to a larger presence of duplicates. More thorough evaluations of the noise rates are needed to shed light on this issue. On the other hand, depending on the uses we envisage for the corpora, we might in fact be overestimating noise: for instance, non-standard spellings would have linguistic relevance, e.g., if one were interested in grammaticalization and language change, or in collecting statistics to train spellcheckers.

  16. Given two corpora C + and C , drawn from the same population of words, with C + N times the size of C , a word occurring n times in C should occur about Nn times in C +. Both WaCky corpora are more than 10 times larger than their reference counterparts. Thus, if the WaCky and reference corpora were random samples from the same populations, enrichment as defined in the text should be trivially at 100% when going from reference to WaCky. However, we know that the WaCky corpora are sampling from rather different populations than the ones of the BNC and la Repubblica, and thus the fact that the enrichment proportion is very high is a non-trivial positive result, indicating that, despite the different sample source, the WaCky corpora also contain occurrences of the same words encountered in traditional corpora, in larger proportions than in the latter.

  17. http://www.urbandictionary.com/define.php?term=wacky+backy.

  18. Full lists are available from the WaCky site (see Sect. 5). A more extensive analysis, that also covers adjectives, verbs and function words, is presented in Ferraresi (2007).

  19. http://wacky.sslmit.unibo.it.

  20. http://cwb.sourceforge.net/.

  21. http://www.sketchengine.co.uk/.

References

  • Baayen, A. (2001). Word frequency distributions. Dordrecht: Kluwer.

    Google Scholar 

  • Baroni, M., & Bernardini, S. (Eds.). (2006). Wacky! Working papers on the web as corpus. Bologna: Gedit.

  • Baroni, M., & Kilgarriff, A. (2006). Large linguistically-processed web corpora for multiple languages. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy, pp. 87–90.

  • Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application, Tokyo, Japan, pp. 31–40.

  • Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & López, V. (2006). CUCWeb: A Catalan corpus built from the web. In Kilgarriff and Baroni (2006), pp. 19–26.

  • Brants, T., & Franz, A. (2006). Web 1T 5-gram, version 1. Philadelphia: Linguistic Data Consortium.

    Google Scholar 

  • Broder, A., Glassman, S., Manasse, M., & Zweig, G. (1997). Syntactic clustering of the web. In Proceedings of the sixth international world wide web conference, Santa Clara, California, pp. 391–404.

  • Ciaramita, M., & Baroni, M. (2006). Measuring web corpus randomness: A progress report. In Baroni and Bernardini (2006), pp. 127–158.

  • Clarke, C., Cormack, G., Laszlo, M., Lynam, T., & Terra, E. (2002). The impact of corpus size on question answering performance. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland, pp. 369–370.

  • Clarke, C., Craswell, N., & Soboroff, I. (2005). The TREC terabyte retrieval track. SIGIR Forum, 39(1), 25.

    Article  Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Emerson, T., & O’Neil, J. (2006). Experience building a large corpus for Chinese lexicon construction. In Baroni and Bernardini (2006), pp. 41–62.

  • Fairon, C., Naets, H., Kilgarriff, A., & de Schryver, G.-M. (Eds.). (2007). Building and exploring web corpora. In Proceedings of the 3rd web as corpus workshop, incorporating Cleaneval. Louvain: Presses Universitaires de Louvain.

  • Ferraresi, A. (2007). Building a very large corpus of English obtained by web crawling: ukWaC. MA Dissertation, University of Bologna. Retrieved January 28, 2008, from http://wacky.sslmit.unibo.it

  • Fletcher, W. (2004). Making the web more useful as a source for linguistic corpora. In U. Connor & T. Upton (Eds.), Corpus linguistics in North America 2002 (pp. 191–205). Amsterdam: Rodopi.

    Google Scholar 

  • Hundt, M., Nesselhauf, N., & Biewer, C. (Eds.). (2007). Corpus linguistics and the web. Amsterdam: Rodopi.

    Google Scholar 

  • Kilgarriff, A., & Baroni, M. (Eds.). (2006). Proceedings of the 2nd international workshop on the web as corpus. East Stroudsburg, PA: ACL.

  • Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.

    Article  Google Scholar 

  • Kornai, A., Halácsy, P., Nagy, V., Oravecz, C., Trón, V., & Varga, D. (2006). Web-based frequency dictionaries for medium density languages. In Kilgarriff and Baroni (2006), pp. 1–8.

  • Lee, D. (2001). Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology, 5(3), 37–72.

    Google Scholar 

  • Liu, V., & Curran, J. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics. Trento, Italy, pp. 233–240.

  • Santini, M., & Sharoff, S. (Eds.). (2007). Proceedings of the CL 2007 colloquium: Towards a reference corpus of web genres, Birmingham, UK.

  • Shaoul, C., & Westbury, C. 2007. A USENET corpus (2005–2007). Retrieved January 28, 2008, from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

  • Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni and Bernardini (2006), pp. 63–98.

  • Sinclair, J. McH. (1996). The search for units of meaning. Textus 9(1), 71–106.

    Google Scholar 

  • Sinclair, J. McH. (2005). Corpus and text—Basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books.

    Google Scholar 

  • Thelwall, M. (2005). Creating and using web corpora. International Journal of Corpus Linguistics, 10(4), 517–541.

    Article  Google Scholar 

  • Ueyama, M. (2006). Evaluation of Japanese web-based reference corpora: Effects of seed selection and time interval. In Baroni and Bernardini (2006), pp. 99–126.

Download references

Acknowledgements

We would like to thank the members of the World Wide WaCky community for many useful interactions, in particular: Sara Castagnoli, Tom Emerson, Stefan Evert, Bill Fletcher, Federico Gaspari, Adam Kilgarriff, Jan Pomikálek and Serge Sharoff. We would also like to thank the LREJ reviewers for useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Baroni.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baroni, M., Bernardini, S., Ferraresi, A. et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang Resources & Evaluation 43, 209–226 (2009). https://doi.org/10.1007/s10579-009-9081-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9081-4

Keywords

Navigation