Skip to main content

2023 | OriginalPaper | Buchkapitel

A German Parallel Clausal Coordinate Ellipsis Corpus that Aligns Sentences from the TüBa-D/Z Treebank with Reconstructed Canonical Forms

verfasst von : Denis Memmesheimer, Karin Harbusch

Erschienen in: Text, Speech, and Dialogue

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper presents a new German resource for coordinated sentences, including asyndetons. The aim is to align cases of Clausal Coordinate Ellipsis (CCE) with the ellipsis-reconstructed sentences. The latter are called canonical forms. CCE is a challenging linguistic phenomenon in which constituents can be omitted under certain conditions. Often, several elision phenomena occur simultaneously. Even state-of-the-art constituency parsers have difficulties with CCE sentences. Although CCE examples occur in sufficient numbers in both written and spoken corpora, they are often among those with the lowest F1 scores. We surmise that elided verbforms, in particular, lead to incorrect hypotheses about phrase boundaries. Our new parallel corpus is designed to support the development of effective models for machine learning or natural language processing components that can automatically reconstruct CCE phenomena.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Here, we follow the terminology in [7] where all identity types for CCE are outlined in detail.
 
2
TüBa-D/S uses a very similar encoding scheme for spoken dialogues (see [26]). However, it does not provide either morphological or lemma specifications. So we focus on TüBa-D/Z.
 
5
SGF is not necessarily judged as an ellipsis phenomenon (see, e.g., [13] for a psycholinguistic argument).
 
6
Originally, in [14], this licensing condition was referred to as lemma-identity.
 
7
In the CCE corpus, we do not spell out all word-order variants for Gapping, but rather adhere as closely as possible to the order in the first conjunct.
 
8
OPIELLE stands for ELLEIPO read in reverse, indicating that it reverses the generation process. However, it is important to note that OPIELLE has to hypothesize the scope of a coordination along with all possible canonical forms, whereas ELLEIPO only tests conditions for omitting given constituents in the predefined scope of conjuncts. Due to space limitations, we have to skip all the details here. The advantages of OPIELLE are: (1) reusing a parser’s initial chart data structure, and (2) using an efficient dynamic programming algorithm to produce reconstructed syntax trees for an entire input sentence. These factors contribute to the efficient production of canonical forms.
 
Literatur
1.
Zurück zum Zitat Brants, S., et al.: TIGER: linguistic interpretation of a German corpus. Res. Lang. Comput. 2(4), 597–620 (2004)CrossRef Brants, S., et al.: TIGER: linguistic interpretation of a German corpus. Res. Lang. Comput. 2(4), 597–620 (2004)CrossRef
2.
Zurück zum Zitat Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp. 249–256 (2006) Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp. 249–256 (2006)
4.
Zurück zum Zitat Foth, K., Köhn, A., Beuck, N., Menzel, W.: Because size does matter: the Hamburg dependency treebank. Fachbereich Informatik, Universität Hamburg, Germany, Technical report (2014) Foth, K., Köhn, A., Beuck, N., Menzel, W.: Because size does matter: the Hamburg dependency treebank. Fachbereich Informatik, Universität Hamburg, Germany, Technical report (2014)
5.
Zurück zum Zitat Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. AI Res. 61(1), 65–170 (2018)MathSciNetMATH Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J. AI Res. 61(1), 65–170 (2018)MathSciNetMATH
6.
Zurück zum Zitat Harbusch, K.: Incremental sentence production inhibits clausal coordinate ellipsis: a treebank study into Dutch and German. Dialogue Discourse 2(1), 313–332 (2011)CrossRef Harbusch, K.: Incremental sentence production inhibits clausal coordinate ellipsis: a treebank study into Dutch and German. Dialogue Discourse 2(1), 313–332 (2011)CrossRef
7.
Zurück zum Zitat Harbusch, K., van Breugel, C., Koch, U., Kempen, G.: Interactive sentence combining and paraphrasing in support of integrated writing and grammar instruction: a new application area for natural language sentence generators. In: Proceedings of the 11th European Workshop on Natural Language Generation (ENLG), pp. 65–68. Saarbrücken, Germany (2007) Harbusch, K., van Breugel, C., Koch, U., Kempen, G.: Interactive sentence combining and paraphrasing in support of integrated writing and grammar instruction: a new application area for natural language sentence generators. In: Proceedings of the 11th European Workshop on Natural Language Generation (ENLG), pp. 65–68. Saarbrücken, Germany (2007)
8.
Zurück zum Zitat Harbusch, K., Kempen, G.: ELLEIPO: a module that computes coordinative ellipsis for language generators that don’t. In: Proceedings of the 11th EACL: Posters & Demonstrations, Trento, Italy, pp. 115–118. (2006) Harbusch, K., Kempen, G.: ELLEIPO: a module that computes coordinative ellipsis for language generators that don’t. In: Proceedings of the 11th EACL: Posters & Demonstrations, Trento, Italy, pp. 115–118. (2006)
9.
Zurück zum Zitat Harbusch, K., Kempen, G.: Clausal coordinate ellipsis in German: the TIGER treebank as a source of evidence. In: Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007), Tartu, Estonia, pp. 81–88 (2007) Harbusch, K., Kempen, G.: Clausal coordinate ellipsis in German: the TIGER treebank as a source of evidence. In: Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007), Tartu, Estonia, pp. 81–88 (2007)
10.
Zurück zum Zitat Harbusch, K., Kempen, G.: Generating clausal coordinate ellipsis multilingually: a uniform approach based on postediting. In: Proceedings of the 12th ENLG, Athens, Greece, pp. 138–145 (2009) Harbusch, K., Kempen, G.: Generating clausal coordinate ellipsis multilingually: a uniform approach based on postediting. In: Proceedings of the 12th ENLG, Athens, Greece, pp. 138–145 (2009)
11.
Zurück zum Zitat Harbusch, K., Memmesheimer, D., Franek, J., Kwasnik, W.: Polish clausal coordination with and without ellipsis. In: Guz, W., Szymanek, B. (eds.) Canonical and non-canonical structures in Polish, vol. 12, pp. 97–121. Wydawnictwo KUL, Lublin, Poland (2018) Harbusch, K., Memmesheimer, D., Franek, J., Kwasnik, W.: Polish clausal coordination with and without ellipsis. In: Guz, W., Szymanek, B. (eds.) Canonical and non-canonical structures in Polish, vol. 12, pp. 97–121. Wydawnictwo KUL, Lublin, Poland (2018)
12.
Zurück zum Zitat Haspelmath, M.: Coordination. In: Shopen, T., (ed.) Language Typology and Linguistic Description, vol. 2, pp. 1–51, 2 edn. Cambridge University Press, Cambridge (2007) Haspelmath, M.: Coordination. In: Shopen, T., (ed.) Language Typology and Linguistic Description, vol. 2, pp. 1–51, 2 edn. Cambridge University Press, Cambridge (2007)
13.
Zurück zum Zitat Kempen, G.: Clausal coordination and coordinative ellipsis in a model of the speaker. Linguistics 47(3), 653–696 (2009)CrossRef Kempen, G.: Clausal coordination and coordinative ellipsis in a model of the speaker. Linguistics 47(3), 653–696 (2009)CrossRef
14.
Zurück zum Zitat Kempen, G., Huijbers, P.: The lexicalization process in sentence production and naming: indirect election of words. Cognition 14, 185–209 (1983)CrossRef Kempen, G., Huijbers, P.: The lexicalization process in sentence production and naming: indirect election of words. Cognition 14, 185–209 (1983)CrossRef
15.
Zurück zum Zitat Khullar, P., Majmundar, K., Shrivastava, M.: NoEl: an annotated corpus for noun ellipsis in English. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 34–43. European Language Resources Association, Marseille, France (2020) Khullar, P., Majmundar, K., Shrivastava, M.: NoEl: an annotated corpus for noun ellipsis in English. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 34–43. European Language Resources Association, Marseille, France (2020)
16.
Zurück zum Zitat Kitaev, N., Cao, S., Klein, D.: Multilingual constituency parsing with self-attention and pre-training. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3499–3505 (2019) Kitaev, N., Cao, S., Klein, D.: Multilingual constituency parsing with self-attention and pre-training. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3499–3505 (2019)
17.
Zurück zum Zitat Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, pp. 79–86 (2005) Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, Phuket, Thailand, pp. 79–86 (2005)
18.
Zurück zum Zitat Kubota, Y., Levine, R.: Against ellipsis: arguments for the direct licensing of ‘noncanonical’ coordinations. Linguist. Philos. 38, 521–576 (2015)CrossRef Kubota, Y., Levine, R.: Against ellipsis: arguments for the direct licensing of ‘noncanonical’ coordinations. Linguist. Philos. 38, 521–576 (2015)CrossRef
19.
Zurück zum Zitat Kupietz, M., Lüngen, H., Diewald, N.: Das Gesamtkonzept des Deutschen Referenzkorpus DeReKo. In: Deppermann, A., Fandrych, C., Kupietz, M., Schmidt, T. (eds.) Korpora in der germanistischen Sprachwissenschaft: Mündlich, schriftlich, multimedial, pp. 1–28. de Gryter, Berlin, Germany/Boston, USA (2023) Kupietz, M., Lüngen, H., Diewald, N.: Das Gesamtkonzept des Deutschen Referenzkorpus DeReKo. In: Deppermann, A., Fandrych, C., Kupietz, M., Schmidt, T. (eds.) Korpora in der germanistischen Sprachwissenschaft: Mündlich, schriftlich, multimedial, pp. 1–28. de Gryter, Berlin, Germany/Boston, USA (2023)
20.
Zurück zum Zitat Laskar, S.R., Manna, R., Pakray, P., Bandyopadhyay, S.: Investigation of multilingual neural machine translation for Indian languages. In: Proceedings of the 9th Workshop on Asian Translation, Gyeongju, Republic of Korea, pp. 78–81 (2022) Laskar, S.R., Manna, R., Pakray, P., Bandyopadhyay, S.: Investigation of multilingual neural machine translation for Indian languages. In: Proceedings of the 9th Workshop on Asian Translation, Gyeongju, Republic of Korea, pp. 78–81 (2022)
21.
Zurück zum Zitat Matzke, M., Mai, H., Nager, W., Rüsseler, J., Münte, T.: The costs of freedom: an ERP - study of non-canonical sentences. Clin. Neurophysiol. 113(6), 844–852 (2002)CrossRef Matzke, M., Mai, H., Nager, W., Rüsseler, J., Münte, T.: The costs of freedom: an ERP - study of non-canonical sentences. Clin. Neurophysiol. 113(6), 844–852 (2002)CrossRef
22.
Zurück zum Zitat Memmesheimer, D., Harbusch, K.: Exploring the feasibility of accurate reconstruction of clausal coordinate ellipsis in German. In: Experimental and Corpus-based Approaches to Ellipsis, 5th edn. (ECBAE 2023). University of Massachusetts, Amherst, MA, USA (2023) Memmesheimer, D., Harbusch, K.: Exploring the feasibility of accurate reconstruction of clausal coordinate ellipsis in German. In: Experimental and Corpus-based Approaches to Ellipsis, 5th edn. (ECBAE 2023). University of Massachusetts, Amherst, MA, USA (2023)
23.
Zurück zum Zitat Mrini, K., Dernoncourt, F., Tran, Q.H., Bui, T., Chang, W., Nakashole, N.: Rethinking self-attention: towards interpretability in neural parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 731–742 (2020) Mrini, K., Dernoncourt, F., Tran, Q.H., Bui, T., Chang, W., Nakashole, N.: Rethinking self-attention: towards interpretability in neural parsing. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 731–742 (2020)
24.
Zurück zum Zitat Muhonen, K., Purtonen, T.: Rule-based detection of clausal coordinate ellipsis. In: Proceedings of the 8th LREC, Istanbul, Turkey, pp. 1955–1959 (2012) Muhonen, K., Purtonen, T.: Rule-based detection of clausal coordinate ellipsis. In: Proceedings of the 8th LREC, Istanbul, Turkey, pp. 1955–1959 (2012)
25.
Zurück zum Zitat Shiraïshi, A., Abeillé, A., Hemforth, B., Miller, P.: Verbal mismatch in right-node raising. Glossa: J. Gen. Linguist. 4(1) (2019) Shiraïshi, A., Abeillé, A., Hemforth, B., Miller, P.: Verbal mismatch in right-node raising. Glossa: J. Gen. Linguist. 4(1) (2019)
26.
Zurück zum Zitat Stegmann, R., Telljohann, H., Hinrichs, E.W.: Stylebook for the German Treebank in VERBMOBIL. Technical report, 239, DFKI, Saarbrücken, Germany (2000) Stegmann, R., Telljohann, H., Hinrichs, E.W.: Stylebook for the German Treebank in VERBMOBIL. Technical report, 239, DFKI, Saarbrücken, Germany (2000)
27.
Zurück zum Zitat Telljohann, H., Hinrichs, E.W., Kübler, S., Zinsmeister, H., Beck, K.: Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z). Seminar fur Sprachwissenschaft, Universitat Tübingen, Germany, Technical report (2017) Telljohann, H., Hinrichs, E.W., Kübler, S., Zinsmeister, H., Beck, K.: Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z). Seminar fur Sprachwissenschaft, Universitat Tübingen, Germany, Technical report (2017)
28.
Zurück zum Zitat Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8h LREC, Istanbul, Turkey, pp. 2214–2218 (2012) Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8h LREC, Istanbul, Turkey, pp. 2214–2218 (2012)
Metadaten
Titel
A German Parallel Clausal Coordinate Ellipsis Corpus that Aligns Sentences from the TüBa-D/Z Treebank with Reconstructed Canonical Forms
verfasst von
Denis Memmesheimer
Karin Harbusch
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-40498-6_11

Premium Partner