Top

Published in:

2018 | OriginalPaper | Chapter

Finding Better Subword Segmentation for Neural Machine Translation

Authors : Yingting Wu, Hai Zhao

Published in: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or unknown words with a \(\langle \)UNK\(\rangle \) token, which limits the translation performance to some extent. Most of recent work handled such a problem by splitting words into characters or other specially extracted subword units to enable open-vocabulary translation. Byte pair encoding (BPE) is one of the successful attempts that has been shown extremely competitive by providing effective subword segmentation for NMT systems. In this paper, we extend the BPE style segmentation to a general unsupervised framework with three statistical measures: frequency (FRQ), accessor variety (AV) and description length gain (DLG). We test our approach on two translation tasks: German to English and Chinese to English. The experimental results show that AV and DLG enhanced systems outperform the FRQ baseline in the frequency weighted schemes at different significant levels.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Collaborative Matching for Sentence Alignment

next chapter Improving Low-Resource Neural Machine Translation with Weight Sharing

The source code has been released at https://github.com/Lindsay125/gbpe.

Though DLG is already frequency weighted as its definition, the proposed extra frequency weight is empirically verified effective from our preliminary experiments.

https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer.

“++” indicates that the corresponding BLEU is significantly better than the best score of FRQ\('\)-BPE at the significant level p < 0.01, “+”: p < 0.05.

Ataman, D., Federico, M.: Compositional representation of morphologically-rich input for neural machine translation. arXiv preprint arXiv:1805.02036 (2018)

Ataman, D., Negri, M., Turchi, M., Federico, M.: Linguistically motivated vocabulary reduction for neural machine translation from Turkish to English. Prague Bull. Math. Linguist. 108(1), 331–342 (2017)CrossRef

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of 3rd International Conference on Learning Representations (2015)

Bojar, O., et al.: Findings of the 2017 conference on machine translation. In: Proceedings of the 2nd Conference on Machine Translation, vol. 2: Shared Task Papers, pp. 169–214 (2017)

Botha, J., Blunsom, P.: Compositional morphology for word representations and language modelling. In: International Conference on Machine Learning, pp. 1899–1907 (2014)

Cai, D., Zhao, H.: Neural word segmentation learning for Chinese. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409–420 (2016)

Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y., Huang, F.: Fast and accurate neural word segmentation for Chinese. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 608–615 (2017)

Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Federico, M.: Report on the 11th IWSLT evaluation campaign. In: The 11th International Workshop on Spoken Language Translation, Lake Tahoe, USA (2014)

Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), Berlin, Germany, pp. 1693–1703 (2016)

10.

Collins, M., Koehn, P., Kučerová, I.: Clause restructuring for statistical machine translation. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 531–540 (2005)

11.

Costa-jussà, M.R., Fonollosa, J.A.R.: Character-based neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), Berlin, Germany, pp. 357–361 (2016)

12.

Feng, H., Chen, K., Deng, X., Zheng, W.: Accessor variety criteria for Chinese word extraction. Comput. Linguist. 30(1), 75–93 (2004)CrossRef

13.

Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994)

14.

Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1700–1709 (2013)

15.

Kit, C.: A goodness measure for phrase learning via compression with the MDL principle. In: Proceedings of the ESSLLI Student Session, pp. 175–187 (1998)

16.

Kit, C., Wilks, Y.: Unsupervised learning of word boundary with description length gain. In: Proceedings of the 3rd Conference on Computational Natural Language Learning, pp. 1–6 (1999)

17.

Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. Trans. Assoc. Comput. Linguist. 5, 365–378 (2017)

18.

Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. CoRR abs/1511.04586 (2015)

19.

Luong, M.T., Manning, C.D.: Achieving open vocabulary neural machine translation with hybrid word-character models. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), Berlin, Germany, pp. 1054–1063 (2016)

20.

Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421 (2015)

21.

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

22.

Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1715–1725 (2016)

23.

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc. (2014)

24.

Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

25.

Zhao, H., Kit, C.: An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, pp. 9–16 (2008)

26.

Zhao, H., Utiyama, M., Sumita, E., Lu, B.-L.: An empirical study on word segmentation for Chinese machine translation. In: Gelbukh, A. (ed.) CICLing 2013. LNCS, vol. 7817, pp. 248–263. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37256-8_21CrossRef

Title: Finding Better Subword Segmentation for Neural Machine Translation
Authors: Yingting Wu
Hai Zhao
Publisher: Springer International Publishing
Book: Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data
Print ISBN: 978-3-030-01715-6

Electronic ISBN: 978-3-030-01716-3

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-030-01716-3_5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner