Skip to main content
Top

2018 | OriginalPaper | Chapter

Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation

Authors : Fahad Al-Obaidli, Stephen Cox, Preslav Nakov

Published in: Computational Linguistics and Intelligent Text Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We describe efforts towards getting better resources for English-Arabic machine translation of spoken text. In particular, we look at movie subtitles as a unique, rich resource, as subtitles in one language often get translated into other languages. Movie subtitles are not new as a resource and have been explored in previous research; however, here we create a much larger bi-text (the biggest to date), and we further generate better quality alignment for it. Given the subtitles for the same movie in different languages, a key problem is how to align them at the fragment level. Typically, this is done using length-based alignment, but for movie subtitles, there is also time information. Here we exploit this information to develop an original algorithm that outperforms the current best subtitle alignment tool, subalign. The evaluation results show that adding our bi-text to the IWSLT training bi-text yields an improvement of over two BLEU points absolute.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
2
The data can be found at http://​workshop2013.​iwslt.​org.
 
3
For a broader discussion see also [6, 21].
 
4
This is the official scoring method for the translation tracks into Arabic at IWSLT’13: http://​alt.​qcri.​org/​tools/​arabic-normalizer.
 
5
We set M to be at most 5 in order to prevent the algorithm from unreasonably iterating up to the last segment looking for a match.
 
Literature
1.
go back to reference Abdelali, A., Guzmán, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland (2014) Abdelali, A., Guzmán, F., Sajjad, H., Vogel, S.: The AMARA corpus: building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland (2014)
2.
go back to reference Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993) Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
3.
go back to reference Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany, pp. 15–24 (2013) Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Frederico, M.: Report on the 10th IWSLT evaluation campaign. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany, pp. 15–24 (2013)
4.
go back to reference Foster, G., Kuhn, R.: Stabilizing minimum error rate training. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT 2009, Athens, Greece, pp. 242–249 (2009) Foster, G., Kuhn, R.: Stabilizing minimum error rate training. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, WMT 2009, Athens, Greece, pp. 242–249 (2009)
5.
go back to reference Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993) Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. linguist. 19(1), 75–102 (1993)
6.
go back to reference Guzmán, F., Nakov, P., Vogel, S.: Analyzing optimization for statistical machine translation: MERT learns verbosity, PRO learns length. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL 2015, Beijing, China, pp. 62–72 (2015) Guzmán, F., Nakov, P., Vogel, S.: Analyzing optimization for statistical machine translation: MERT learns verbosity, PRO learns length. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL 2015, Beijing, China, pp. 62–72 (2015)
7.
go back to reference Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, UK, pp. 187–197 (2011) Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the 6th Workshop on Statistical Machine Translation, WMT 2011, Edinburgh, Scotland, UK, pp. 187–197 (2011)
8.
go back to reference Hopkins, M., May, J.: Tuning as ranking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, pp. 1352–1362 (2011) Hopkins, M., May, J.: Tuning as ranking. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, Scotland, UK, pp. 1352–1362 (2011)
10.
go back to reference El Kholy, A., Habash, N.: Orthographic and morphological processing for English-Arabic statistical machine translation. Mach. Transl. 26(1–2), 25–45 (2012)CrossRef El Kholy, A., Habash, N.: Orthographic and morphological processing for English-Arabic statistical machine translation. Mach. Transl. 26(1–2), 25–45 (2012)CrossRef
11.
go back to reference Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, MT Summit 2005, Phuket, Thailand, pp. 79–86 (2005) Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit, MT Summit 2005, Phuket, Thailand, pp. 79–86 (2005)
12.
go back to reference Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne, M., Talbot, D.: Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2005, Pittsburgh, PA, USA (2005) Koehn, P., Axelrod, A., Mayne, A.B., Callison-Burch, C., Osborne, M., Talbot, D.: Edinburgh system description for the 2005 IWSLT speech translation evaluation. In: Proceedings of the International Workshop on Spoken Language Translation, IWSLT 2005, Pittsburgh, PA, USA (2005)
13.
go back to reference Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL: Interactive Poster and Demonstration Sessions, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007) Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL: Interactive Poster and Demonstration Sessions, ACL 2007, Prague, Czech Republic, pp. 177–180 (2007)
14.
go back to reference Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL 2003, Edmonton, Canada, vol. 1, pp. 48–54 (2003) Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL 2003, Edmonton, Canada, vol. 1, pp. 48–54 (2003)
15.
go back to reference Lavecchia, C., Smaïli, K., Langlois, D.: Building parallel corpora from movies. In: Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2007, Funchal, Madeira, Portugal (2007) Lavecchia, C., Smaïli, K., Langlois, D.: Building parallel corpora from movies. In: Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2007, Funchal, Madeira, Portugal (2007)
16.
go back to reference Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Report in Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance (2005) Mangeot, M., Giguet, E.: Multilingual aligned corpora from movie subtitles. Report in Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance (2005)
17.
go back to reference Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2014, Baltimore, MD, USA, pp. 206–211 (2014) Monroe, W., Green, S., Manning, C.D.: Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2014, Baltimore, MD, USA, pp. 206–211 (2014)
18.
go back to reference Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008) Nakov, P.: Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In: Proceedings of the Third Workshop on Statistical Machine Translation, WMT 2008, Columbus, Ohio, USA, pp. 147–150 (2008)
19.
go back to reference Nakov, P., Al Obaidli, F., Guzman, F., Vogel, S.: Parameter optimization for statistical machine translation: it pays to learn from hard examples. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 504–510 (2013) Nakov, P., Al Obaidli, F., Guzman, F., Vogel, S.: Parameter optimization for statistical machine translation: it pays to learn from hard examples. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, pp. 504–510 (2013)
20.
go back to reference Nakov, P., Guzmán, F., Vogel, S.: Optimizing for sentence-level BLEU+1 yields short translations. In: Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, pp. 1979–1994 (2012) Nakov, P., Guzmán, F., Vogel, S.: Optimizing for sentence-level BLEU+1 yields short translations. In: Proceedings of the 24th International Conference on Computational Linguistics, COLING 2012, Mumbai, India, pp. 1979–1994 (2012)
21.
go back to reference Nakov, P., Guzmán, F., Vogel, S.: A tale about PRO and monsters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2013, Sofia, Bulgaria, pp. 12–17 (2013) Nakov, P., Guzmán, F., Vogel, S.: A tale about PRO and monsters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2013, Sofia, Bulgaria, pp. 12–17 (2013)
22.
go back to reference Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, vol. 3, pp. 1358–1367 (2009) Nakov, P., Ng, H.T.: Improved statistical machine translation for resource-poor languages using related resource-rich languages. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Singapore, vol. 3, pp. 1358–1367 (2009)
23.
go back to reference Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, OH, USA, pp. 117–120 (2008) Roth, R., Rambow, O., Habash, N., Diab, M., Rudin, C.: Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, ACL 2008, Columbus, OH, USA, pp. 117–120 (2008)
24.
go back to reference Sajjad, H., Guzmán, F., Nakov, P., Abdelali, A., Murray, K., Al Obaidli, F., Vogel, S.: QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany (2013) Sajjad, H., Guzmán, F., Nakov, P., Abdelali, A., Murray, K., Al Obaidli, F., Vogel, S.: QCRI at IWSLT 2013: experiments in Arabic-English and English-Arabic spoken language translation. In Proceedings of the 10th International Workshop on Spoken Language Translation, IWSLT 2013, Heidelberg, Germany (2013)
25.
go back to reference Tiedemann, J.: Building a multilingual parallel subtitle corpus. In: Proceedings of the Computational Linguistics in the Netherlands, CLIN 2007, Nijmegen, Netherlands (2007) Tiedemann, J.: Building a multilingual parallel subtitle corpus. In: Proceedings of the Computational Linguistics in the Netherlands, CLIN 2007, Nijmegen, Netherlands (2007)
26.
go back to reference Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, pp. 582–588 (2007) Tiedemann, J.: Improved sentence alignment for movie subtitles. In: Proceedings of the Conference on Recent Advances in Natural Language Processing, RANLP 2007, Borovets, Bulgaria, pp. 582–588 (2007)
27.
go back to reference Tiedemann, J.: Synchronizing translated movie subtitles. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008) Tiedemann, J.: Synchronizing translated movie subtitles. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)
28.
go back to reference Tiedemann. J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 2214–2218 (2012) Tiedemann. J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 2214–2218 (2012)
29.
go back to reference Volk, M.: The automatic translation of film subtitles. A machine translation success story? J. Lang. Technol. Comput. Linguist. (JLCL) 24(3), 115–128 (2009) Volk, M.: The automatic translation of film subtitles. A machine translation success story? J. Lang. Technol. Comput. Linguist. (JLCL) 24(3), 115–128 (2009)
30.
go back to reference Volk, M., Harder, S.: Evaluating MT with translations or translators: what is the difference? In: Proceedings of the Machine Translation Summit XI, MT-Summit 2007, Copenhagen, Denmark (2007) Volk, M., Harder, S.: Evaluating MT with translations or translators: what is the difference? In: Proceedings of the Machine Translation Summit XI, MT-Summit 2007, Copenhagen, Denmark (2007)
31.
go back to reference Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012) Wang, P., Nakov, P., Ng, H.T.: Source language adaptation for resource-poor machine translation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, Jeju Island, Korea, pp. 286–296 (2012)
32.
go back to reference Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 277–306 (2016)MathSciNetCrossRef Wang, P., Nakov, P., Ng, H.T.: Source language adaptation approaches for resource-poor machine translation. Comput. Linguist. 42, 277–306 (2016)MathSciNetCrossRef
Metadata
Title
Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation
Authors
Fahad Al-Obaidli
Stephen Cox
Preslav Nakov
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-75487-1_11

Premium Partner