Skip to main content

2021 | OriginalPaper | Buchkapitel

Machine Translation Customization via Automatic Training Data Selection from the Web

verfasst von : Thuy Vu, Alessandro Moschitti

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Machine translation (MT) systems, especially when designed for an industrial setting, are trained with general parallel data derived from the Web. Thus, their style is typically driven by word/structure distribution coming from the average of many domains. In contrast, MT customers want translations to be specialized to their domain, for which they are typically able to provide text samples. We describe an approach for customizing MT systems on specific domains by selecting data similar to the target customer data to train neural translation models. We build document classifiers using monolingual target data, e.g., provided by the customers to select parallel training data from Web crawled data. Finally, we train MT models on our automatically selected data, obtaining a system specialized to the target domain. We tested our approach on the benchmark from WMT-18 Translation Task for News domains enabling comparisons with state-of-the-art MT systems. The results show that our models outperform the top systems while using less data and smaller models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
As of May 2020, Google Translate provided riunione condominiale, which, although correct, is a bit too formal term for this kind of meeting.
 
Literatur
1.
Zurück zum Zitat Ahmed, F., Shafiq, M.Z., Liu, A.X.: The internet is for porn: measurement and analysis of online adult traffic. ICDCS 2016, 88–97 (2016) Ahmed, F., Shafiq, M.Z., Liu, A.X.: The internet is for porn: measurement and analysis of online adult traffic. ICDCS 2016, 88–97 (2016)
2.
Zurück zum Zitat Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. EMNLP 2011, 355–362 (2011) Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. EMNLP 2011, 355–362 (2011)
3.
Zurück zum Zitat Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015)
4.
Zurück zum Zitat Bañón, M., et al.: ParaCrawl: web-scale acquisition of parallel corpora. ACL 2020, 4555–4567 (2020) Bañón, M., et al.: ParaCrawl: web-scale acquisition of parallel corpora. ACL 2020, 4555–4567 (2020)
6.
Zurück zum Zitat Bojar, O., et al.: Findings of the 2018 Conference on Machine Translation (WMT 2018), Belgium, Brussels, pp. 272–307 (2018) Bojar, O., et al.: Findings of the 2018 Conference on Machine Translation (WMT 2018), Belgium, Brussels, pp. 272–307 (2018)
7.
Zurück zum Zitat Buck, C., Koehn, P.: Quick and reliable document alignment via TF/IDF-weighted cosine distance. In: WMT 2016, Berlin, Germany, pp. 672–678 (2016) Buck, C., Koehn, P.: Quick and reliable document alignment via TF/IDF-weighted cosine distance. In: WMT 2016, Berlin, Germany, pp. 672–678 (2016)
8.
Zurück zum Zitat Chen, B., Huang, F.: Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin (2016) Chen, B., Huang, F.: Semi-supervised convolutional networks for translation adaptation with tiny amount of in-domain data. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin (2016)
9.
Zurück zum Zitat Dinu, G., Mathur, P., Federico, M., Al-Onaizan, Y.: Training neural machine translation to apply terminology constraints. In: ACL 2019, Florence, Italy, pp. 3063–3068 (2019) Dinu, G., Mathur, P., Federico, M., Al-Onaizan, Y.: Training neural machine translation to apply terminology constraints. In: ACL 2019, Florence, Italy, pp. 3063–3068 (2019)
10.
Zurück zum Zitat Gao, J., Goodman, J., Li, M., Lee, K.F.: Toward a unified approach to statistical language modeling for chinese. In: ACM TALIP (2002) Gao, J., Goodman, J., Li, M., Lee, K.F.: Toward a unified approach to statistical language modeling for chinese. In: ACM TALIP (2002)
11.
Zurück zum Zitat Hieber, F., et al.: Sockeye: a toolkit for neural machine translation. CoRR (2017) Hieber, F., et al.: Sockeye: a toolkit for neural machine translation. CoRR (2017)
12.
Zurück zum Zitat Junczys-Dowmunt, M., et al.: Marian: fast neural machine translation in C++. CoRR (2018) Junczys-Dowmunt, M., et al.: Marian: fast neural machine translation in C++. CoRR (2018)
13.
Zurück zum Zitat Liu, L., Hong, Y., Liu, H., Wang, X., Yao, J.: Effective selection of translation model training data. In: ACL 2014, Baltimore, Maryland (2014) Liu, L., Hong, Y., Liu, H., Wang, X., Yao, J.: Effective selection of translation model training data. In: ACL 2014, Baltimore, Maryland (2014)
14.
Zurück zum Zitat Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP 2015, Lisbon, Portugal (2015) Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP 2015, Lisbon, Portugal (2015)
16.
Zurück zum Zitat Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Uppsala, Sweden, pp. 220–224 (2010) Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Uppsala, Sweden, pp. 220–224 (2010)
17.
Zurück zum Zitat Post, M.: A call for clarity in reporting BLEU scores. In: WMT 2018 (2018) Post, M.: A call for clarity in reporting BLEU scores. In: WMT 2018 (2018)
18.
Zurück zum Zitat Smith, J.R., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., Lopez, A.: Dirt cheap web-scale parallel text from the common crawl. In: ACL 2013, Sofia, Bulgaria, pp. 1374–1383 (2013) Smith, J.R., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., Lopez, A.: Dirt cheap web-scale parallel text from the common crawl. In: ACL 2013, Sofia, Bulgaria, pp. 1374–1383 (2013)
19.
Zurück zum Zitat Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: COLING 2010, Beijing, China (2010) Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: COLING 2010, Beijing, China (2010)
20.
Zurück zum Zitat Vaswani, A., et al..: Attention is all you need. In: NIPS 2017, pp. 5998–6008 (2017) Vaswani, A., et al..: Attention is all you need. In: NIPS 2017, pp. 5998–6008 (2017)
21.
Zurück zum Zitat Vu, T., Moschitti, A.: CDA: a cost efficient content-based multilingual web document aligner. In: EACL 2021 (2021) Vu, T., Moschitti, A.: CDA: a cost efficient content-based multilingual web document aligner. In: EACL 2021 (2021)
22.
Zurück zum Zitat Yasuda, K., Zhang, R., Yamamoto, H., Sumita, E.: Method of selecting training data to build a compact and efficient translation model. In: IJCNLP 2008 (2008) Yasuda, K., Zhang, R., Yamamoto, H., Sumita, E.: Method of selecting training data to build a compact and efficient translation model. In: IJCNLP 2008 (2008)
Metadaten
Titel
Machine Translation Customization via Automatic Training Data Selection from the Web
verfasst von
Thuy Vu
Alessandro Moschitti
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-72113-8_44

Neuer Inhalt