Skip to main content

2018 | OriginalPaper | Buchkapitel

Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation

verfasst von : Fahad Al-Obaidli, Stephen Cox, Preslav Nakov

Erschienen in: Computational Linguistics and Intelligent Text Processing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
The data can be found at http://​workshop2013.​iwslt.​org.
 
3
For a broader discussion see also [6, 21].
 
4
This is the official scoring method for the translation tracks into Arabic at IWSLT’13: http://​alt.​qcri.​org/​tools/​arabic-normalizer.
 
5
We set M to be at most 5 in order to prevent the algorithm from unreasonably iterating up to the last segment looking for a match.
 
Literatur
1.
Zurück zum Zitat Abdelali, A., Guzmán, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland (2014) Abdelali, A., Guzmán, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland (2014)
2.
Zurück zum Zitat Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993) Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
3.
Zurück zum Zitat Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany, pp. 15–24 (2013) Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany, pp. 15–24 (2013)
4.
Zurück zum Zitat Foster, G., Kuhn, R.: Stabilizing minimum error rate training. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT 2009, Athens, Greece, pp. 242–249 (2009) Foster, G., Kuhn, R.: Stabilizing minimum error rate training. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT 2009, Athens, Greece, pp. 242–249 (2009)
5.
Zurück zum Zitat Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993) Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)
6.
Zurück zum Zitat Guzmán, F., Nakov, P., Vogel, S.: Analyzing optimization for statistical machine translation: MERT learns verbosity, PRO learns length. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL 2015, Beijing, China, pp. 62–72 (2015) Guzmán, F., Nakov, P., Vogel, S.: Analyzing optimization for statistical machine translation: MERT learns verbosity, PRO learns length. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL 2015, Beijing, China, pp. 62–72 (2015)
7.
Zurück zum Zitat Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, UK, pp. 187–197 (2011) Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, UK, pp. 187–197 (2011)
8.
Zurück zum Zitat Hopkins, M., May, J.: Tuning as ranking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, pp. 1352–1362 (2011) Hopkins, M., May, J.: Tuning as ranking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, pp. 1352–1362 (2011)
10.
Zurück zum Zitat El Kholy, A., Habash, N.: Orthographic and morphological processing for English-Arabic statistical machine translation. Mach. Transl. 26(1–2), 25–45 (2012)CrossRef El Kholy, A., Habash, N.: Orthographic and morphological processing for English-Arabic statistical machine translation. Mach. Transl. 26(1–2), 25–45 (2012)CrossRef
11.
Zurück zum Zitat Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, MT Summit 2005, Phuket, Thailand, pp. 79–86 (2005) Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, MT Summit 2005, Phuket, Thailand, pp. 79–86 (2005)
12.
Zurück zum Zitat Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne, M., Talbot, D.: Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2005, Pittsburgh, PA, USA (2005) Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne, M., Talbot, D.: Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2005, Pittsburgh, PA, USA (2005)
13.
Zurück zum Zitat Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL: Interactive Poster and Demonstration Sessions, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007) Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL: Interactive Poster and Demonstration Sessions, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007)
14.
Zurück zum Zitat Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL 2003, Edmonton, Canada, vol. 1, pp. 48–54 (2003) Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL 2003, Edmonton, Canada, vol. 1, pp. 48–54 (2003)
15.
Zurück zum Zitat Lavecchia, C., Smaïli, K., Langlois, D.: Building parallel corpora from movies. In: Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2007, Funchal, Madeira, Portugal (2007) Lavecchia, C., Smaïli, K., Langlois, D.: Building parallel corpora from movies. In: Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2007, Funchal, Madeira, Portugal (2007)
16.
Zurück zum Zitat Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Report in Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance (2005) Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Report in Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance (2005)
17.
Zurück zum Zitat Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2014, Baltimore, MD, USA, pp. 206–211 (2014) Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2014, Baltimore, MD, USA, pp. 206–211 (2014)
18.
Zurück zum Zitat Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008) Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008)
19.
Zurück zum Zitat Nakov, P., Al Obaidli, F., Guzman, F., Vogel, S.: Parameter optimization for statistical machine translation: it pays to learn from hard examples. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 504–510 (2013) Nakov, P., Al Obaidli, F., Guzman, F., Vogel, S.: Parameter optimization for statistical machine translation: it pays to learn from hard examples. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 504–510 (2013)
20.
Zurück zum Zitat Nakov, P., Guzmán, F., Vogel, S.: Optimizing for sentence-level BLEU+1 yields short translations. In: Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, pp. 1979–1994 (2012) Nakov, P., Guzmán, F., Vogel, S.: Optimizing for sentence-level BLEU+1 yields short translations. In: Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, pp. 1979–1994 (2012)
21.
Zurück zum Zitat Nakov, P., Guzmán, F., Vogel, S.: A tale about PRO and monsters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2013, Sofia, Bulgaria, pp. 12–17 (2013) Nakov, P., Guzmán, F., Vogel, S.: A tale about PRO and monsters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2013, Sofia, Bulgaria, pp. 12–17 (2013)
22.
Zurück zum Zitat Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, vol. 3, pp. 1358–1367 (2009) Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, vol. 3, pp. 1358–1367 (2009)
23.
Zurück zum Zitat Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, OH, USA, pp. 117–120 (2008) Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, OH, USA, pp. 117–120 (2008)
24.
Zurück zum Zitat Sajjad, H., Guzmán, F., Nakov, P., Abdelali, A., Murray, K., Al Obaidli, F., Vogel, S.: QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany (2013) Sajjad, H., Guzmán, F., Nakov, P., Abdelali, A., Murray, K., Al Obaidli, F., Vogel, S.: QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany (2013)
25.
Zurück zum Zitat Tiedemann, J.: Building a multilingual parallel subtitle corpus. In: Proceedings of the Computational Linguistics in the Netherlands, CLIN 2007, Nijmegen, Netherlands (2007) Tiedemann, J.: Building a multilingual parallel subtitle corpus. In: Proceedings of the Computational Linguistics in the Netherlands, CLIN 2007, Nijmegen, Netherlands (2007)
26.
Zurück zum Zitat Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, pp. 582–588 (2007) Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, pp. 582–588 (2007)
27.
Zurück zum Zitat Tiedemann, J.: Synchronizing translated movie subtitles. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008) Tiedemann, J.: Synchronizing translated movie subtitles. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
28.
Zurück zum Zitat Tiedemann. J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 2214–2218 (2012) Tiedemann. J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 2214–2218 (2012)
29.
Zurück zum Zitat Volk, M.: The automatic translation of film subtitles. A machine translation success story? J. Lang. Technol. Comput. Linguist. (JLCL) 24(3), 115–128 (2009) Volk, M.: The automatic translation of film subtitles. A machine translation success story? J. Lang. Technol. Comput. Linguist. (JLCL) 24(3), 115–128 (2009)
30.
Zurück zum Zitat Volk, M., Harder, S.: Evaluating MT with translations or translators: what is the difference? In: Proceedings of the Machine Translation Summit XI, MT-Summit 2007, Copenhagen, Denmark (2007) Volk, M., Harder, S.: Evaluating MT with translations or translators: what is the difference? In: Proceedings of the Machine Translation Summit XI, MT-Summit 2007, Copenhagen, Denmark (2007)
31.
Zurück zum Zitat Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012) Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012)
32.
Zurück zum Zitat Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 277–306 (2016)MathSciNetCrossRef Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 277–306 (2016)MathSciNetCrossRef
Metadaten
Titel
Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation
verfasst von
Fahad Al-Obaidli
Stephen Cox
Preslav Nakov
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-75487-1_11

Premium Partner