Skip to main content
Top

2018 | OriginalPaper | Chapter

7. Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

When analysing corpora with automatic and statistical means, one should remember that the raw material being treated is language and the specific nature thereof ought to be considered in all stages of research. Since language cannot be investigated per se, corpora can only reveal the characteristics of limited instances of linguistic behaviour: even exhaustive corpora only supply a finite set of texts which should be assessed in the light of a number of extra-linguistic factors impacting linguistic traits from different viewpoints: the sender’s and recipient’s region of origin, social and educational background and gender; the channel of communication; the topic under discussion and the formality of the situation, not to speak of the period in history when texts were produced. Such factors come into play in defining the linguistic properties of each single text (fragment) in the corpus, and their overall balance should be considered during the preliminary stages of corpus design and compilation.
After having made decisions in terms of the selection of the texts to be included in the corpus, linguistic data need to be prepared for automatic processing. This stage too is far from intuitive and automatic: from the very identification of tokens of language to the extraction of lemmas, researchers should take into account qualitative aspects. Both corpus compilation and pre-processing cannot be considered neutral operations with a view to the results of automatic analysis and should be made explicit to enable the assessment of results and further exploitation of the same corpus.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Consistently with the studies illustrated in this book, all the examples provided in this chapter will be mostly in English and Italian.
 
Literature
go back to reference Antonelli, G. (2010). Lingua. In A. Afribo & E. Zinato (Eds.), Modernità italiana. Cultura, lingua e letteratura dagli anni Settanta a oggi (pp. 15–52). Roma: Carocci. Antonelli, G. (2010). Lingua. In A. Afribo & E. Zinato (Eds.), Modernità italiana. Cultura, lingua e letteratura dagli anni Settanta a oggi (pp. 15–52). Roma: Carocci.
go back to reference Attili, G., & Benigni, L. (1979). Interazione sociale, ruolo sessuale e comportamento verbale: lo stile retorico naturale del linguaggio femminile nell’interazione faccia a faccia. In F. A. Leoni & M. R. Pigliasco (Eds.), Retorica e scienze del linguaggio: atti del 10. Congresso internazionale di studi, Pisa, 31 maggio - 2 giugno 1976. SLI, Società di linguistica italiana (pp. 261–280). Roma, Bulzoni. Attili, G., & Benigni, L. (1979). Interazione sociale, ruolo sessuale e comportamento verbale: lo stile retorico naturale del linguaggio femminile nell’interazione faccia a faccia. In F. A. Leoni & M. R. Pigliasco (Eds.), Retorica e scienze del linguaggio: atti del 10. Congresso internazionale di studi, Pisa, 31 maggio - 2 giugno 1976. SLI, Società di linguistica italiana (pp. 261–280). Roma, Bulzoni.
go back to reference Barbera, M. (2009). Schema e storia del Corpus Taurinense: linguistica dei corpora dell’italiano antico. Alessandria: Edizioni dell’Orso. Barbera, M. (2009). Schema e storia del Corpus Taurinense: linguistica dei corpora dell’italiano antico. Alessandria: Edizioni dell’Orso.
go back to reference Barbera, M., Corino, E., & Onesti, C. (2007). Cosa è un corpus? Per una definizione più rigorosa di corpus, token, markup. In M. Barbera, E. Corino, & C. Onesti (Eds.), Corpora e linguistica in rete (pp. 25–88). Perugia: Guerra. Barbera, M., Corino, E., & Onesti, C. (2007). Cosa è un corpus? Per una definizione più rigorosa di corpus, token, markup. In M. Barbera, E. Corino, & C. Onesti (Eds.), Corpora e linguistica in rete (pp. 25–88). Perugia: Guerra.
go back to reference Berruto, G. (1987). Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica. Berruto, G. (1987). Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.
go back to reference Berruto, G. (2012). L’italiano popolare e la semplificazione linguistica. In G. Berruto (Ed.), Saggi di sociolinguistica e linguistica (pp. 141–181). Alessandria: Edizioni dell’Orso. Berruto, G. (2012). L’italiano popolare e la semplificazione linguistica. In G. Berruto (Ed.), Saggi di sociolinguistica e linguistica (pp. 141–181). Alessandria: Edizioni dell’Orso.
go back to reference Cortelazzo, M. A. (1990). Lingue speciali. La dimensione verticale. Padova: Unipress. Cortelazzo, M. A. (1990). Lingue speciali. La dimensione verticale. Padova: Unipress.
go back to reference Cortelazzo, M. A. (1994). Il parlato giovanile. In L. Serianni & P. Trifone (Eds.), Storia della lingua italiana, vol. II, Scritto e parlato (pp. 291–317). Torino: Einaudi. Cortelazzo, M. A. (1994). Il parlato giovanile. In L. Serianni & P. Trifone (Eds.), Storia della lingua italiana, vol. II, Scritto e parlato (pp. 291–317). Torino: Einaudi.
go back to reference Coseriu, E. (1988). Einführung in die Allgemeine Sprachwissenschaft. Tübingen: Francke. Coseriu, E. (1988). Einführung in die Allgemeine Sprachwissenschaft. Tübingen: Francke.
go back to reference Coveri, L. (2014). Una lingua per crescere. Scritti sull’italiano dei giovani. Firenze: Franco Cesati editore. Coveri, L. (2014). Una lingua per crescere. Scritti sull’italiano dei giovani. Firenze: Franco Cesati editore.
go back to reference De Mauro, T. (2014). Storia Linguistica dell’Italia repubblicana dal 1946 ai nostri giorni. Roma-Bari: Laterza. De Mauro, T. (2014). Storia Linguistica dell’Italia repubblicana dal 1946 ai nostri giorni. Roma-Bari: Laterza.
go back to reference Fiorentino, G. (2013). Frontiere della scrittura: lineamenti di web writing. Roma: Carocci. Fiorentino, G. (2013). Frontiere della scrittura: lineamenti di web writing. Roma: Carocci.
go back to reference Fitschen, A., & Gupta, P. (2008). Lemmatising and morphological tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 552–564). Berlin: Walter de Gruyter. Fitschen, A., & Gupta, P. (2008). Lemmatising and morphological tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 552–564). Berlin: Walter de Gruyter.
go back to reference Halliday, M. A. K. (1989). Spoken and written language. Oxford: OUP. Halliday, M. A. K. (1989). Spoken and written language. Oxford: OUP.
go back to reference Hunston, S. (2008). Corpus compilation and corpus types. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 154–168). Berlin: Walter de Gruyter. Hunston, S. (2008). Corpus compilation and corpus types. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 154–168). Berlin: Walter de Gruyter.
go back to reference Kaplan, A. (2016). Women talk more than men ... and other myths about language explained. Cambridge: Cambridge University Press.CrossRef Kaplan, A. (2016). Women talk more than men ... and other myths about language explained. Cambridge: Cambridge University Press.CrossRef
go back to reference Lakoff, R. (1975). Language and Woman’s Place. New York: Harper.MATH Lakoff, R. (1975). Language and Woman’s Place. New York: Harper.MATH
go back to reference McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction (2nd ed.). Edinburgh: Edinburgh University Press. McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction (2nd ed.). Edinburgh: Edinburgh University Press.
go back to reference Mortara Garavelli, B. (1985). La parola d’altri: prospettive di analisi del discorso. Palermo: Sellerio. Mortara Garavelli, B. (1985). La parola d’altri: prospettive di analisi del discorso. Palermo: Sellerio.
go back to reference Ondelli, S. (2013). Un genere testuale attraverso i confini nazionali: la sentenza. In S. Ondelli (Ed.), Realizzazioni testuali ibride in contesto europeo. Lingue dell’UE e lingue nazionali a confronto (pp. 67–92). Trieste: EUT. Ondelli, S. (2013). Un genere testuale attraverso i confini nazionali: la sentenza. In S. Ondelli (Ed.), Realizzazioni testuali ibride in contesto europeo. Lingue dell’UE e lingue nazionali a confronto (pp. 67–92). Trieste: EUT.
go back to reference Ondelli, S., & Viale, M. (2010). L’assetto dell’italiano delle traduzioni in un corpus giornalistico. Aspetti qualitativi e quantitativi. Rivista internazionale di tecnica della traduzione, 12, 1–62. Ondelli, S., & Viale, M. (2010). L’assetto dell’italiano delle traduzioni in un corpus giornalistico. Aspetti qualitativi e quantitativi. Rivista internazionale di tecnica della traduzione, 12, 1–62.
go back to reference Oxford English Dictionary (1933). Oxford: OUP. Oxford English Dictionary (1933). Oxford: OUP.
go back to reference Renzi, L. (2012). Come cambia la lingua: l’italiano in movimento. Bologna: il Mulino. Renzi, L. (2012). Come cambia la lingua: l’italiano in movimento. Bologna: il Mulino.
go back to reference Romaine, S. (2008). Corpus linguistics and sociolinguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 97–111). Berlin: Walter de Gruyter. Romaine, S. (2008). Corpus linguistics and sociolinguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 97–111). Berlin: Walter de Gruyter.
go back to reference Ross, A. S. C. (1980). U and non-U. In N. Mitford (Ed.), Noblesse oblige (pp. 11–38). London: Futura. Ross, A. S. C. (1980). U and non-U. In N. Mitford (Ed.), Noblesse oblige (pp. 11–38). London: Futura.
go back to reference Sampson, G. (2003). Thoughts on Two Decades of Drawing Trees. In A. Abeillé (Ed.), Treebanks (pp. 23–41). Dordrecht: Springer.CrossRef Sampson, G. (2003). Thoughts on Two Decades of Drawing Trees. In A. Abeillé (Ed.), Treebanks (pp. 23–41). Dordrecht: Springer.CrossRef
go back to reference Stenström, A.-B. (1991). Expletives in the London-Lund Corpus. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: In honour of Jan Svartvik (pp. 230-253). London: Longman. Stenström, A.-B. (1991). Expletives in the London-Lund Corpus. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: In honour of Jan Svartvik (pp. 230-253). London: Longman.
go back to reference Swales, J. M. (2004). Research Genres: Explorations and Applications. Cambridge: Cambridge University Press.CrossRef Swales, J. M. (2004). Research Genres: Explorations and Applications. Cambridge: Cambridge University Press.CrossRef
go back to reference Swan, M. (2016). Practical English Usage. Oxford: OUP. Swan, M. (2016). Practical English Usage. Oxford: OUP.
go back to reference Wehrlich, E. (1982). A Text Grammar of English. Heidelberg: Quelle & Meyer. Wehrlich, E. (1982). A Text Grammar of English. Heidelberg: Quelle & Meyer.
Metadata
Title
Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora
Author
Stefano Ondelli
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-97064-6_7

Premium Partner