The Sketch Engine:  ten years on

Adam Kilgarriff; Vít Baisa; Jan Bušta; Miloš Jakubíček; Vojtěch Kovář; Jan Michelfeit; Pavel Rychlý; Vít Suchomel

doi:10.1007/s40607-014-0009-9

Authors

Adam Kilgarriff Lexical Computing Ltd.
Vít Baisa Lexical Computing Ltd.
Jan Bušta Lexical Computing Ltd.
Miloš Jakubíček Lexical Computing Ltd.
Vojtěch Kovář Lexical Computing Ltd.
Jan Michelfeit Lexical Computing Ltd.
Pavel Rychlý Lexical Computing Ltd.
Vít Suchomel Lexical Computing Ltd.

DOI:

https://doi.org/10.1007/s40607-014-0009-9

Keywords:

Corpora, Corpus lexicography, Corpus tools, Word sketches, Sketch Engine

Abstract

The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from
the last few years, and surveys other corpus tools and websites.

References

Ambati, B.R., S. Reddy, and A. Kilgarriff. 2012. Word sketches for Turkish. In Proc LREC, 2945–2950. Istanbul.

Anthony, L. 2004. AntConc: a learner and classroom friendly, multi-platform corpus analysis toolkit. In Proc IWLeL, 7–13.

Arts. T., ed. 2014. Oxford Arabic Dictionary. Oxford: Oxford University Press.

Arts, T., Y. Belinkov, N. Habash, A. Kilgarriff, and V. Suchomel. 2014 (forthcoming). arTenTen and word sketches for Arabic. Journal of King Saud University: Computing and Information Science. Special issue on Arabic natural language processing.

Baisa, V., M. Jakub?´c?ek, A. Kilgarriff, V. Kova´r?, and P. Rychly´. 2014. Bilingual word sketches: the translate button. In Proc EURALEX, Bolzano/Bozen

Baisa, V., and V. Suchomel. 2012. Large corpora for Turkic languages and unsupervised morphological analysis. In Proc LREC, Istanbul

Basile, V., J. Bos, K. Evang, and N. Venhuizen. 2012. Developing a large semantically annotated corpus. In LREC vol. 12, 3196–3200.

Baroni, M., and S. Bernardini. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proc LREC, Lisbon

Baroni, M., S. Bernardini, A. Ferraresi, and E. Zanchetta. 2009. The WaCky Wide Web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226.

Bick, E. 2009. DeepDict—a graphical corpus-based dictionary of word relations. In Proc NODALIDA, Vol. 4, 268–271.

Biemann, C., S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. In Computational linguistics and intelligent text processing, 217–228. Berlin Heidelberg: Springer.

Burnard, L. 1995. The BNC reference manual.

Christ, O., and M. Schulze. 1994. The IMS Corpus Workbench: Corpus Query Processor (CQP) User’s Manual. University of Stuttgart.

Chung, S.-F., and C.-R. Huang. 2010. Using collocations to establish the source domains of conceptual metaphors. Journal of Chinese Linguistics 38(2): 183–223.

Culpeper, J., and M. Kyto¨. 2010. Early Modern English dialogues: spoken interaction as writing. Cambridge: Cambridge University Press.

Davies, M. 2009. The 385+ million word Corpus of Contemporary American English (1990–2008+): design, architecture, and linguistic insights. International Journal of Corpus Linguistics 14(2): 159–190.

Frankenberg Garcia, A. 2014. The use of corpus examples for language comprehension and production. ReCALL.

Garrett, E., N.W. Hill, A. Kilgarriff, R. Vadlapudi, and A. Zadoks. 2014. The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries. In The Third International Conference on Tibetan Language, eds. Tuttle, Gya, Dare and Wilber, New York: Trace Foundation

(forthcoming).

Greaves, C. 2009. ConcGram 1.0: a phraseological search engine. John Benjamins.

Hanks, P. 2008. Mapping meaning onto use: a Pattern Dictionary of English Verbs. In Proc AACL, Utah.

Hanks, P. 2012. The corpus revolution in lexicography. International Journal of Lexicography 25(4): 398–436.

Hardie, A. 2012. CQPweb—combining power, flexibility and usability in a corpus analysis tool. International journal of corpus linguistics 17(3): 380–409.

Huang, C.-R., K.-J. Chen, and Q.-X. Lai. 1997. Mandarin Daily Dictionary of Chinese Classifiers. Taipei: Mandarin Daily Press.

Huang, C.-R., J.-F. Hong, W.-Y. Ma, and P. S?imon. 2014. From corpus to grammar: automatic extraction of grammatical relations from annotated corpus. In T’sou and Kwong Eds. Linguistic Corpus and Corpus Linguistics in the Chinese Context. Journal of Chinese Linguistics Monograph. Hong Kong: Chinese University of Hong Kong Press, (forthcoming).

Huang, C-R., A. Kilgarriff, Y. Wu, C.M. Chiu, S. Smith, P. Rychly, M.H. Bai, and K.-J. Chen. 2005. Chinese Sketch Engine and the extraction of grammatical collocations. In Proc Fourth SIGHAN Workshop on Chinese Language Processing, 48–55.

Jakub?´c?ek, M., A. Kilgarriff, D. McCarthy, and P. Rychly´. 2010. Fast syntactic searching in very large corpora for many languages. In Proc PACLIC, Vol. 24, 741–747, Japan.

Jakub?´c?ek, M., A. Kilgarriff, V. Kova´r?, P. Rychly´, and V. Suchomel. 2013. The TenTen corpus family. Lancaster: In Proc. Int. Conf. on Corpus Linguistics.

Kerswill, P., J. Cheshire, S. Fox, and E. Torgersen. 2013. English as a contact language: the role of children and adolescents. In English as a Contact Language, 258.

Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1): 97–133.

Kilgarriff, A. 2007. Googleology is bad science. Computational linguistics 33(1): 147–151.

Kilgarriff, A. 2012. Getting to know your corpus. In Text, Speech and Dialogue, 3–15. Berlin Heidelberg:Springer.

Kilgarriff, A. 2013. Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proc ASLIB 35th Translating and the Computer Conference, London.

Kilgarriff, A., and M. Rundell. 2002. Lexical Profiling Software and its lexicographic applications: a case study. In Proc EURALEX. Copenhagen, Denmark.

Kilgarriff, A., P. Rychly´, P. Smrz, and D. Tugwell. 2004. The Sketch Engine. In Proc Eleventh EURALEX International Congress. Lorient, France.

Kilgarriff, A., C.R. Huang, P. Rychly´, S. Smith, and D. Tugwell. 2005. Chinese word sketches. In Proc ASIALEX 2005: Words in Asian cultural context. Singapore.

Kilgarriff, A., M. Husa´k, K. McAdam, M. Rundell, and P. Rychly´. 2008. GDEX: automatically finding good dictionary examples in a corpus. In Proc. Euralex. Barcelona

Kilgarriff, A., and I. Renau. 2013. esTenTen, a Vast Web Corpus of Peninsular and American Spanish. Procedia Social and Behavioral Sciences 95: 12–19.

Koehn, P. 2005. Europarl: a parallel corpus for statistical machine translation. Proc MT summit 5: 79–86.

Kosem, I., M. Husak, and D. McCarthy. 2011. GDEX for Slovene. In Proceedings of eLex, 151–159. Bled, Slovenia.

Kosem, I., V. Baisa, V. Kova´r?, and A. Kilgarriff. 2013. User-friendly interface of error/correctionannotated corpus for both teachers and researchers. Solstrand: Proc Learner Corpus Research.

McGillivray, B., and A. Kilgarriff. 2013. Tools for historical corpus research, and a corpus of Latin. In New Methods in Historical Corpora, Bennett, P.D. ed. Vol 3. BoD–books on demand.

O’Donnell, M. 2008. Demonstration of the UAM CorpusTool for text and image annotation. In Proc 46th ACL: Demo Session, 13–16. Association for computational linguistics.

Pomika´lek, J. 2011. Removing boilerplate and duplicate content from Web Corpora. PhD thesis, Masaryk University, Brno, Czech Republic.

Quasthoff, U., M. Richter, and C. Biemann. 2006. Corpus portal for search in monolingual corpora. In Proc LREC, 1799–1802. Genoa, Italy.

Renouf, A., A. Kehoe, and J. Banerjee. 2006. WebCorp: an integrated system for web text search. Language and Computers 59(1): 47–67.

Rundell, M. ed. 2002. Macmillan English Dictionary for Advanced Learners. Macmillan.

Rundell, M. 2012. Stop the presses—the end of the printed dictionary. Macmillan Dictionary Blog, 5 Nov. http://www.macmillandictionaryblog.com/bye-print-dictionary.

Rychly´, P. 2000. Korpusove´ manaz?ery a ~ jejich efektivn?´ implementace.Rychly´. PhD Thesis, Masaryk University, Brno, Czech Republic.

Rychly´, P. 2007. Manatee/bonito–a modular corpus manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing, 65–70. Masaryk University, Brno, Czech Republic.

Sanseido. 2003, 2007. The WISDOM English–Japanese Dictionary. Sanseido.

Scha¨fer, R., and F. Bildhauer. 2013. Web corpus construction. Synthesis Lectures on Human Language Technologies 6(4): 1–145.

Scheible, S., R.J. Whitt, M. Durrell, and P. Bennett. 2011. A gold standard corpus of Early Modern German. In Proc 5th Linguistic Annotation Workshop, 124–128. Association for computational linguistics.

Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus, eds. Baroni and Bernardini, 63–98. Bologna: Gedit.

Srdanovic Erjavecs, I., Erjavec, T., and Kilgarriff, A. 2008. A web corpus and word sketches for Japanese. Information and Media Technologies, 3(3).

Suchomel, V., and J. Pomika´lek. 2012. Efficient Web crawling for large text corpora. In Proc Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.

Thomas, J. 2014. Discovering English with the Sketch Engine. Print-on-demand. http://ske.li/deske.

Tiedemann, J., and L. Nygaard. 2004. The OPUS Corpus—parallel and free. Lisbon: Proc LREC.

Wild, K., A. Kilgarriff, and D. Tugwell. 2013. The Oxford Children’s Corpus: using a Children’s Corpus in Lexicography. International Journal of Lexicography 26(2): 190–218

The Sketch Engine

ten years on

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Subscription

Information

Accessibility

Unsubscribe

Latest publications