Skip to main content

2018 | OriginalPaper | Buchkapitel

10. Statistical Machine Translation and Turkish

verfasst von : Kemal Oflazer, Reyyan Yeniterzi, İlknur Durgar-El Kahlout

Erschienen in: Turkish Natural Language Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Machine translation is one of the most important applications of natural language processing. The last 25 years have seen tremendous progress in machine translation, enabled by the development of statistical techniques and availability of large-scale parallel sentence corpora from which statistical models of translation can be learned. Turkish poses quite many challenges for statistical machine translation as alluded to in Chap. 1, owing mainly to its complex morphology. This chapter discusses in more detail the challenges of Turkish in the context of statistical machine translation and describes two widely different approaches that have been employed in the last several years to English to Turkish machine translation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
www.​statmt.​org/​wmt16/​ (Accessed Sept. 14, 2017).
 
2
International Workshop on Spoken Language Translation: workshop2013.​iwslt.​org/​ (Accessed Sept. 14, 2017).
 
3
Note that on the English side, the filler for [something] would come in the middle of this phrase.
 
4
See Chap. 1 for details.
 
5
This disambiguator has about 94% accuracy.
 
6
Ideally, it would have been very desirable to actually do derivational morphological analysis on the English side, so that one could, for example, analyze accession into access plus a marker indicating nominalization.
 
7
The training set in the first row of Table 10.2 was limited to sentences on the Turkish side which had at most 90 tokens (roots and bound morphemes) in total in order to comply with the limitations of the GIZA++ alignment tool. However when only the content words are included, we have more sentences to include since much less number of sentences violate the length restriction when morphemes/function words are removed.
 
8
It should be noted that what to selectively attach to the root should be considered on a per-language basis; if Turkish were to be aligned with a language with similar morphological markers, this perhaps would not have been needed.
 
9
Using the content word data improved performance for all representations except the baseline.
 
10
We ran MERT on the baseline model and the morphologically segmented models forcing -weight-d to range a very small around 0.1, but letting the other parameters range in their suggested ranges. Even though the procedure came back claiming that it achieved a better BLEU score on the tune set, running the new model on the test set did not show any improvement at all. This may have been due to the fact that the initial choice of -weight-d along with -dl set to -1 provides such a drastic improvement that perturbations in the other parameters do not have much impact.
 
11
We arrived at this combination by experimenting with the decoder to avoid the almost monotonic translation we were getting with the default parameters. These parameters boosted the BLEU scores substantially compared to default parameters used by the decoder.
 
12
We should also note that all sentences were lowercased so that we would not have to deal with exact capitalization issue at that stage.
 
13
The meanings of various tags are as follows: Dependency Labels: PMOD—Preposition Modifier; POS—Possessive. Part-of-Speech Tags for the English words: +IN—Preposition; +PRP$—Possessive Pronoun; +JJ—Adjective; +NN—Noun; +NNS—Plural Noun. Morphological Feature Tags in the Turkish Sentence: +A3pl—3rd person plural; +P3sg—3rd person singular possessive; +Loc—Locative case. Note that we mark an English plural noun as +NN_NNS to indicate that the root is a noun and there is a plural morpheme on it. Note also that economic is also related to relations but we are not interested in such content words and their relations.
 
14
We use _ to prefix such syntactic tags on the English side.
 
15
The order is important in that we would like to attach the same sequence of function words in the same order so that the resulting tags on the English side are the same.
 
16
We outline two additional rules later when we see a more complex example in Fig. 10.4.
 
17
For example, the morphological analyzer outputs +A3sg to mark a singular noun, if there is no explicit plural morpheme. Such markers are removed.
 
18
The tune set was not used in this work but reserved for future work so that meaningful comparisons could be made.
 
19
It is possible that the ten test sets are not mutually exclusive.
 
20
These allow and do not penalize unlimited distortions, but increase decoding time.
 
21
In Moses, factors are separated by a ‘|’ symbol.
 
22
Concatenating Root and Tags gives the Surface form, in that the surface is unique given this concatenation.
 
23
Note that for Turkish, this representation is equivalent to surface words in that the surface is unique given this representation.
 
24
Note that in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score.
 
25
In order to provide a simple and clear representation, the example sentences contain the surface form of the words as opposed to the morphemic representation used earlier.
 
26
For instance, consider the example in Fig. 10.4 involving if with some additional modifiers added to the intervening noun phrase.
 
Literatur
Zurück zum Zitat Bisazza A, Federico M (2009) Morphological pre-processing for Turkish to English statistical machine translation. In: Proceedings of IWSLT, Tokyo, pp 129–135 Bisazza A, Federico M (2009) Morphological pre-processing for Turkish to English statistical machine translation. In: Proceedings of IWSLT, Tokyo, pp 129–135
Zurück zum Zitat Carpuat M (2009) Toward using morphology in French-English phrase-based SMT. In: Proceedings of WMT, Athens, pp 150–154 Carpuat M (2009) Toward using morphology in French-English phrase-based SMT. In: Proceedings of WMT, Athens, pp 150–154
Zurück zum Zitat Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, Singapore, pp 718–726 Chung T, Gildea D (2009) Unsupervised tokenization for machine translation. In: Proceedings of EMNLP, Singapore, pp 718–726
Zurück zum Zitat Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Proceedings of AMTA, Washington, DC Corston-Oliver S, Gamon M (2004) Normalizing German and English inflectional morphology to improve statistical word alignment. In: Proceedings of AMTA, Washington, DC
Zurück zum Zitat Durgar-El Kahlout İ (2009) A prototype English-Turkish statistical machine translation system. PhD thesis, Sabancı University, Istanbul Durgar-El Kahlout İ (2009) A prototype English-Turkish statistical machine translation system. PhD thesis, Sabancı University, Istanbul
Zurück zum Zitat Durgar-El Kahlout İ, Oflazer K (2006) Initial explorations in English to Turkish statistical machine translation. In: Proceedings of WMT, New York, NY, pp 7–14 Durgar-El Kahlout İ, Oflazer K (2006) Initial explorations in English to Turkish statistical machine translation. In: Proceedings of WMT, New York, NY, pp 7–14
Zurück zum Zitat Durgar-El Kahlout İ, Oflazer K (2010) Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans Audio Speech Lang Process 18(6):1313–1322 Durgar-El Kahlout İ, Oflazer K (2010) Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans Audio Speech Lang Process 18(6):1313–1322
Zurück zum Zitat Durgar-El Kahlout İ, Mermer C, Doğan MU (2012) Recent improvements in statistical machine translation between Turkish and English. In: Vertan C, von Hahn W (eds) Multilingual processing in Eastern and Southern EU languages: low-resourced technologies and translation. Cambridge Scholars Publishing, Cambridge Durgar-El Kahlout İ, Mermer C, Doğan MU (2012) Recent improvements in statistical machine translation between Turkish and English. In: Vertan C, von Hahn W (eds) Multilingual processing in Eastern and Southern EU languages: low-resourced technologies and translation. Cambridge Scholars Publishing, Cambridge
Zurück zum Zitat Eyigöz E, Gildea D, Oflazer K (2013a) Multi-rate HMMs for word alignment. In: Proceedings of WMT, Sofia, pp 494–502 Eyigöz E, Gildea D, Oflazer K (2013a) Multi-rate HMMs for word alignment. In: Proceedings of WMT, Sofia, pp 494–502
Zurück zum Zitat Eyigöz E, Gildea D, Oflazer K (2013b) Simultaneous word-morpheme alignment for statistical machine translation. In: Proceedings of NAACL-HLT, Atlanta, GA, pp 32–40 Eyigöz E, Gildea D, Oflazer K (2013b) Simultaneous word-morpheme alignment for statistical machine translation. In: Proceedings of NAACL-HLT, Atlanta, GA, pp 32–40
Zurück zum Zitat Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of EMNLP, Vancouver, BC, pp 676–683 Goldwater S, McClosky D (2005) Improving statistical MT through morphological analysis. In: Proceedings of EMNLP, Vancouver, BC, pp 676–683
Zurück zum Zitat Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 127–133 Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 127–133
Zurück zum Zitat Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, Prague, pp 177–180 Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, Prague, pp 177–180
Zurück zum Zitat Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of NAACL-HLT, Boston, MA, pp 57–60 Lee YS (2004) Morphological analysis for statistical machine translation. In: Proceedings of NAACL-HLT, Boston, MA, pp 57–60
Zurück zum Zitat Luong MT, Nakov P, Kan MY (2010) A hybrid morpheme-word representation for machine translation of morphologically rich languages. In: Proceedings of EMNLP, Cambridge, MA, pp 148–157 Luong MT, Nakov P, Kan MY (2010) A hybrid morpheme-word representation for machine translation of morphologically rich languages. In: Proceedings of EMNLP, Cambridge, MA, pp 148–157
Zurück zum Zitat Mermer C, Akın AA (2010) Unsupervised search for the optimal segmentation for statistical machine translation. In: Proceedings of the ACL student research workshop, Uppsala, pp 31–36 Mermer C, Akın AA (2010) Unsupervised search for the optimal segmentation for statistical machine translation. In: Proceedings of the ACL student research workshop, Uppsala, pp 31–36
Zurück zum Zitat Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: Proceedings of ACL, Prague, pp 128–135 Minkov E, Toutanova K, Suzuki H (2007) Generating complex morphology for machine translation. In: Proceedings of ACL, Prague, pp 128–135
Zurück zum Zitat Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-markov models. In: Proceedings of ACL-HLT, Portland, OR, pp 895–904 Naradowsky J, Toutanova K (2011) Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-markov models. In: Proceedings of ACL-HLT, Portland, OR, pp 895–904
Zurück zum Zitat Nguyen T, Vogel S, Smith NA (2010) Nonparametric word segmentation for machine translation. In: Proceedings of COLING, Beijing, pp 815–823 Nguyen T, Vogel S, Smith NA (2010) Nonparametric word segmentation for machine translation. In: Proceedings of COLING, Beijing, pp 815–823
Zurück zum Zitat Niessen S, Ney H (2004) Statistical machine translation with scarce resources using morpho-syntatic information. Comput Linguist 30(2):181–204 Niessen S, Ney H (2004) Statistical machine translation with scarce resources using morpho-syntatic information. Comput Linguist 30(2):181–204
Zurück zum Zitat Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135 Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135
Zurück zum Zitat Oflazer K (1994) Two-level description of Turkish morphology. Lit Linguist Comput 9(2):137–148 Oflazer K (1994) Two-level description of Turkish morphology. Lit Linguist Comput 9(2):137–148
Zurück zum Zitat Oflazer K (1996) Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–99 Oflazer K (1996) Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–99
Zurück zum Zitat Oflazer K, Durgar-El Kahlout İ (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of WMT, Prague, pp 25–32 Oflazer K, Durgar-El Kahlout İ (2007) Exploring different representational units in English-to-Turkish statistical machine translation. In: Proceedings of WMT, Prague, pp 25–32
Zurück zum Zitat Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, Philadelphia, PA, pp 311–318 Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, Philadelphia, PA, pp 311–318
Zurück zum Zitat Popovic M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of LREC, Lisbon, pp 1585–1588 Popovic M, Ney H (2004) Towards the use of word stems and suffixes for statistical machine translation. In: Proceedings of LREC, Lisbon, pp 1585–1588
Zurück zum Zitat Sadat F, Habash N (2006) Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, pp 1–8 Sadat F, Habash N (2006) Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, pp 1–8
Zurück zum Zitat Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing, Manchester Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing, Manchester
Zurück zum Zitat Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, CO, vol 2, pp 901–904 Stolcke A (2002) SRILM – an extensible language modeling toolkit. In: Proceedings of ICSLP, Denver, CO, vol 2, pp 901–904
Zurück zum Zitat Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: Proceedings of COLING-ACL, Sydney, pp 969–976 Talbot D, Osborne M (2006) Modelling lexical redundancy for machine translation. In: Proceedings of COLING-ACL, Sydney, pp 969–976
Zurück zum Zitat Tantuğ AC, Oflazer K, Durgar-El Kahlout İ (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of LREC, Marrakesh, pp 1493–1499 Tantuğ AC, Oflazer K, Durgar-El Kahlout İ (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of LREC, Marrakesh, pp 1493–1499
Zurück zum Zitat Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 252–259 Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of NAACL-HLT, Edmonton, AB, pp 252–259
Zurück zum Zitat Yang M, Kirchhoff K (2006) Phrase-based backoff models for machine translation of highly inflected languages. In: Proceedings of EACL, Trento, pp 41–48 Yang M, Kirchhoff K (2006) Phrase-based backoff models for machine translation of highly inflected languages. In: Proceedings of EACL, Trento, pp 41–48
Zurück zum Zitat Yeniterzi R (2009) Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish. Master’s thesis, Sabancı University, Istanbul Yeniterzi R (2009) Syntax-to-morphology alignment and constituent reordering in factored phrase-based statistical machine translation from English to Turkish. Master’s thesis, Sabancı University, Istanbul
Zurück zum Zitat Yeniterzi R, Oflazer K (2010) Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In: Proceedings of ACL, Uppsala, pp 454–464 Yeniterzi R, Oflazer K (2010) Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In: Proceedings of ACL, Uppsala, pp 454–464
Zurück zum Zitat Yılmaz E, Durgar-El Kahlout İ (2014) The use of recurrent neural networks language model in Turkish-English machine translation. In: Proceedings of IEEE signal processing and communications applications conference, Trabzon, pp 1247–1250 Yılmaz E, Durgar-El Kahlout İ (2014) The use of recurrent neural networks language model in Turkish-English machine translation. In: Proceedings of IEEE signal processing and communications applications conference, Trabzon, pp 1247–1250
Zurück zum Zitat Yılmaz E, Durgar-El Kahlout İ, Aydın B, Özil ZS (2013) TÜBİTAK Turkish-English submissions for IWSLT 2013. In: Proceedings of IWSLT, Heidelberg, pp 152–159 Yılmaz E, Durgar-El Kahlout İ, Aydın B, Özil ZS (2013) TÜBİTAK Turkish-English submissions for IWSLT 2013. In: Proceedings of IWSLT, Heidelberg, pp 152–159
Zurück zum Zitat Yuret D, Türe F (2006) Learning morphological disambiguation rules for Turkish. In: Proceedings of NAACL-HLT, New York, NY, pp 328–334 Yuret D, Türe F (2006) Learning morphological disambiguation rules for Turkish. In: Proceedings of NAACL-HLT, New York, NY, pp 328–334
Zurück zum Zitat Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of NAACL-HLT, New York, NY, pp 201–204 Zollmann A, Venugopal A, Vogel S (2006) Bridging the inflection morphology gap for Arabic statistical machine translation. In: Proceedings of NAACL-HLT, New York, NY, pp 201–204
Metadaten
Titel
Statistical Machine Translation and Turkish
verfasst von
Kemal Oflazer
Reyyan Yeniterzi
İlknur Durgar-El Kahlout
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-90165-7_10

Neuer Inhalt