Skip to main content
Top

2019 | OriginalPaper | Chapter

Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents

Authors : Miguel Domingo, Francisco Casacuberta

Published in: New Trends in Image Analysis and Processing – ICIAP 2019

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The nature of human language and the lack of a spelling convention make historical documents hard to handle for natural language processing. Spelling normalization tackles this problem by adapting their spelling to modern standards in order to get an orthography consistency. In this work, we compare several character-based machine translation approaches, and propose a method to profit from modern documents to enrich neural machine translation models. We tested our proposal with four different data sets, and observed that the enriched models successfully improved the normalization quality of the neural models. Statistical models, however, yielded a better result.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
2.
go back to reference Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008) Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)
3.
go back to reference Bollmann, M.: Normalization of historical texts with neural network models. Ph.D. thesis, Sprachwissenschaftliches Institut, Ruhr-Universität (2018) Bollmann, M.: Normalization of historical texts with neural network models. Ph.D. thesis, Sprachwissenschaftliches Institut, Ruhr-Universität (2018)
4.
go back to reference Bollmann, M., Søgaard, A.: Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In: Proceedings of the International Conference on the Computational Linguistics, pp. 131–139 (2016) Bollmann, M., Søgaard, A.: Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In: Proceedings of the International Conference on the Computational Linguistics, pp. 131–139 (2016)
5.
go back to reference Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993) Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
6.
go back to reference Chatterjee, R., Farajian, M.A., Negri, M., Turchi, M., Srivastava, A., Pal, S.: Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 ape shared task. In: Proceedings of the Second Conference on Machine Translation, pp. 630–638 (2017) Chatterjee, R., Farajian, M.A., Negri, M., Turchi, M., Srivastava, A., Pal, S.: Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 ape shared task. In: Proceedings of the Second Conference on Machine Translation, pp. 630–638 (2017)
7.
go back to reference Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1693–1703 (2016) Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1693–1703 (2016)
8.
go back to reference Costa-Jussà, M.R., Aldón, D., Fonollosa, J.A.: Chinese-Spanish neural machine translation enhanced with character and word bitmap fonts. Mach. Transl. 31, 35–47 (2017)CrossRef Costa-Jussà, M.R., Aldón, D., Fonollosa, J.A.: Chinese-Spanish neural machine translation enhanced with character and word bitmap fonts. Mach. Transl. 31, 35–47 (2017)CrossRef
9.
go back to reference Costa-Jussà, M.R., Fonollosa, J.A.: Character-based neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 357–361 (2016) Costa-Jussà, M.R., Fonollosa, J.A.: Character-based neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 357–361 (2016)
10.
go back to reference Domingo, M., Casacuberta, F.: Spelling normalization of historical documents by using a machine translation approach. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 129–137 (2018) Domingo, M., Casacuberta, F.: Spelling normalization of historical documents by using a machine translation approach. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 129–137 (2018)
11.
go back to reference Jehle, F.: Works of Miguel de Cervantes in Old- and Modern-Spelling. Indiana University Purdue University Fort Wayne (2001) Jehle, F.: Works of Miguel de Cervantes in Old- and Modern-Spelling. Indiana University Purdue University Fort Wayne (2001)
12.
go back to reference Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of the Association for Computational Linguistics Software Engineering, Testing, and Quality Assurance Workshop, pp. 49–57 (2008) Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of the Association for Computational Linguistics Software Engineering, Testing, and Quality Assurance Workshop, pp. 49–57 (2008)
13.
14.
go back to reference Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)CrossRef Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)CrossRef
15.
go back to reference Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., Mäkelä, E.: Normalizing early English letters to present-day English spelling. In: Proceedings of the Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 87–96 (2018) Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., Mäkelä, E.: Normalizing early English letters to present-day English spelling. In: Proceedings of the Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 87–96 (2018)
17.
go back to reference Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of the Association for Computational Linguistics: System Demonstration, pp. 67–72 (2017) Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of the Association for Computational Linguistics: System Demonstration, pp. 67–72 (2017)
18.
go back to reference Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 177–180 (2007) Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 177–180 (2007)
19.
go back to reference Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 48–54 (2003) Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 48–54 (2003)
20.
go back to reference Korchagina, N.: Normalizing medieval German texts: from rules to deep learning. In: Proceedings of the Nordic Conference on Computational Linguistics Workshop on Processing Historical Language, pp. 12–17 (2017) Korchagina, N.: Normalizing medieval German texts: from rules to deep learning. In: Proceedings of the Nordic Conference on Computational Linguistics Workshop on Processing Historical Language, pp. 12–17 (2017)
21.
go back to reference Laing, M.: The linguistic analysis of medieval vernacular texts: Two projects at Edinburgh’. In: Rissanen, M., Kytd, M., Wright, S. (eds.) Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, vol. 25427, pp. 121–141. St Catharine’s College Cambridge (1993) Laing, M.: The linguistic analysis of medieval vernacular texts: Two projects at Edinburgh’. In: Rissanen, M., Kytd, M., Wright, S. (eds.) Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, vol. 25427, pp. 121–141. St Catharine’s College Cambridge (1993)
22.
23.
go back to reference Lison, P., Tiedemann, J.: OpenSubtitles 2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the International Conference on Language Resources Association (2016) Lison, P., Tiedemann, J.: OpenSubtitles 2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the International Conference on Language Resources Association (2016)
25.
go back to reference Ljubešic, N., Zupan, K., Fišer, D., Erjavec, T.: Normalising slovene data: historical texts vs. user-generated content. In: Proceedings of the Conference on Natural Language Processing, pp. 146–155 (2016) Ljubešic, N., Zupan, K., Fišer, D., Erjavec, T.: Normalising slovene data: historical texts vs. user-generated content. In: Proceedings of the Conference on Natural Language Processing, pp. 146–155 (2016)
26.
go back to reference Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 301–305 (2012) Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 301–305 (2012)
27.
go back to reference Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 160–167 (2003) Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 160–167 (2003)
28.
go back to reference Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 295–302 (2002) Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 295–302 (2002)
29.
go back to reference Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRef Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRef
30.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
31.
go back to reference Poncelas, A., Shterionov, D., Way, A., Maillette de Buy Wenniger, G., Passban, P.: Investigation backtranslation in neural machine translation. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 249–258 (2018) Poncelas, A., Shterionov, D., Way, A., Maillette de Buy Wenniger, G., Passban, P.: Investigation backtranslation in neural machine translation. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 249–258 (2018)
32.
go back to reference Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics, pp. 70–79 (2013) Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics, pp. 70–79 (2013)
33.
go back to reference Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation, pp. 186–191 (2018) Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation, pp. 186–191 (2018)
34.
go back to reference Riezler, S., Maxwell, J.T.: On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 57–64 (2005) Riezler, S., Maxwell, J.T.: On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 57–64 (2005)
36.
go back to reference Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)CrossRef Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)CrossRef
37.
go back to reference Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 58–62 (2013) Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 58–62 (2013)
38.
go back to reference Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015) Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:​1511.​06709 (2015)
39.
go back to reference Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas, pp. 223–231 (2006) Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas, pp. 223–231 (2006)
40.
go back to reference Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 257–286 (2002) Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 257–286 (2002)
41.
go back to reference Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Proc. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Proc. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014)
42.
go back to reference Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
43.
go back to reference Tang, G., Cap, F., Pettersson, E., Nivre, J.: An evaluation of neural machine translation models on historical spelling normalization. In: Proceedings of the International Conference on Computational Linguistics, pp. 1320–1331 (2018) Tang, G., Cap, F., Pettersson, E., Nivre, J.: An evaluation of neural machine translation models on historical spelling normalization. In: Proceedings of the International Conference on Computational Linguistics, pp. 1320–1331 (2018)
44.
go back to reference Tiedemann, J.: Character-based PSMT for closely related languages. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 12–19 (2009) Tiedemann, J.: Character-based PSMT for closely related languages. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 12–19 (2009)
45.
go back to reference Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
46.
go back to reference Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv:1609.08144 Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv:​1609.​08144
Metadata
Title
Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents
Authors
Miguel Domingo
Francisco Casacuberta
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-30754-7_7

Premium Partner