Skip to main content

2023 | OriginalPaper | Buchkapitel

Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification

verfasst von : Massimo La Morgia, Alessandro Mei, Eugenio Nerio Nemmi, Luca Sabatini, Francesco Sassi

Erschienen in: Advances in Intelligent Data Analysis XXI

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Machine Translation Systems are today used to break down linguistic barriers. People from different countries and languages can now interact with each other thanks to state-of-the-art translators from prominent software companies like Google and Microsoft. However, these tools are also used to expand the audience for phishing attacks, scam emails or to generate fake reviews to promote a product on different e-commerce platforms. In all these cases, detecting whether a text has been translated can be crucial information. In this work, we tackle the problem of the detection of translated texts from different angles. On top of addressing the classic task of machine translation detection, we investigate and find common patterns across different machine translation systems unrelated to the original text’s source language. Then, we show that it is possible to identify the machine translation system used to generate a translated text with high performances (F1-score 88.5%) and that it is also possible to identify the source language of the original text. We perform our tasks over two datasets that we use to evaluate our models: Books, a new dataset we built from scratch based on excerpts of novels, and the well-known Europarl dataset, based on proceedings of the European Parliament.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Arase, Y., Zhou, M.: Machine translation detection from monolingual web-text. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1597–1607. Association for Computational Linguistics, Sofia (2013). https://aclanthology.org/P13-1157 Arase, Y., Zhou, M.: Machine translation detection from monolingual web-text. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1597–1607. Association for Computational Linguistics, Sofia (2013). https://​aclanthology.​org/​P13-1157
6.
Zurück zum Zitat Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)CrossRefMATH Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)CrossRefMATH
8.
Zurück zum Zitat Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
9.
Zurück zum Zitat Forman, G., et al.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)MATH Forman, G., et al.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)MATH
10.
Zurück zum Zitat Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)MathSciNetCrossRefMATH Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)MathSciNetCrossRefMATH
11.
Zurück zum Zitat Gellerstam, M.: Translationese in swedish novels translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia: Poceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, pp. 88–95. no. 75 in Lund Studies in English, CWK Gleerup, Lund (1986) Gellerstam, M.: Translationese in swedish novels translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia: Poceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, pp. 88–95. no. 75 in Lund Studies in English, CWK Gleerup, Lund (1986)
14.
Zurück zum Zitat van Halteren, H.: Source language markers in EUROPARL translations. In: Proceedings of the 22nd International Conference on Computational Linguistics (2008), pp. 937–944. Coling 2008 Organizing Committee, Manchester, UK (2008). https://aclanthology.org/C08-1118 van Halteren, H.: Source language markers in EUROPARL translations. In: Proceedings of the 22nd International Conference on Computational Linguistics (2008), pp. 937–944. Coling 2008 Organizing Committee, Manchester, UK (2008). https://​aclanthology.​org/​C08-1118
15.
Zurück zum Zitat Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)CrossRefMATH Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)CrossRefMATH
17.
Zurück zum Zitat Kacmarcik, G., Gamon, M.: Obfuscating document stylometry to preserve author anonymity. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 444–451. Association for Computational Linguistics, Sydney (2006). https://aclanthology.org/P06-2058 Kacmarcik, G., Gamon, M.: Obfuscating document stylometry to preserve author anonymity. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 444–451. Association for Computational Linguistics, Sydney (2006). https://​aclanthology.​org/​P06-2058
19.
Zurück zum Zitat Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180. Association for Computational Linguistics, Prague (2007). https://aclanthology.org/P07-2045 Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180. Association for Computational Linguistics, Prague (2007). https://​aclanthology.​org/​P07-2045
20.
Zurück zum Zitat Koppel, M., Ordan, N.: Translationese and its dialects. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1318–1326. Association for Computational Linguistics, Portland (2011). https://aclanthology.org/P11-1132 Koppel, M., Ordan, N.: Translationese and its dialects. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1318–1326. Association for Computational Linguistics, Portland (2011). https://​aclanthology.​org/​P11-1132
21.
Zurück zum Zitat La Morgia, M., Mei, A., Nemmi, E., Raponi, S., Stefa, J.: Nationality and geolocation-based profiling in the dark (web). IEEE Trans. Serv. Comput. 15(1), 429–441 (2019)CrossRef La Morgia, M., Mei, A., Nemmi, E., Raponi, S., Stefa, J.: Nationality and geolocation-based profiling in the dark (web). IEEE Trans. Serv. Comput. 15(1), 429–441 (2019)CrossRef
23.
Zurück zum Zitat Li, Y., Wang, R., Zhao, H.: A machine learning method to distinguish machine translation from human translation. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters, pp. 354–360, Shanghai, China (2015). https://aclanthology.org/Y15-2041 Li, Y., Wang, R., Zhao, H.: A machine learning method to distinguish machine translation from human translation. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters, pp. 354–360, Shanghai, China (2015). https://​aclanthology.​org/​Y15-2041
24.
Zurück zum Zitat Lynch, G., Vogel, C.: Towards the automatic detection of the source language of a literary translation. In: Proceedings of COLING 2012: Posters, pp. 775–784. The COLING 2012 Organizing Committee, Mumbai, India (2012). https://aclanthology.org/C12-2076 Lynch, G., Vogel, C.: Towards the automatic detection of the source language of a literary translation. In: Proceedings of COLING 2012: Posters, pp. 775–784. The COLING 2012 Organizing Committee, Mumbai, India (2012). https://​aclanthology.​org/​C12-2076
25.
Zurück zum Zitat Mahmood, A., Ahmad, F., Shafiq, Z., Srinivasan, P., Zaffar, F.: A girl has no name: Automated authorship obfuscation using mutant-x. Proc. Priv. Enhancing Technol. 2019(4), 54–71 (2019)CrossRef Mahmood, A., Ahmad, F., Shafiq, Z., Srinivasan, P., Zaffar, F.: A girl has no name: Automated authorship obfuscation using mutant-x. Proc. Priv. Enhancing Technol. 2019(4), 54–71 (2019)CrossRef
27.
Zurück zum Zitat Nguyen-Son, H.Q., Nguyen, H.H., Tieu, N.D.T., Yamagishi, J., Echizen, I.: Identifying computer-translated paragraphs using coherence features. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. Association for Computational Linguistics, Hong Kong (2018). https://aclanthology.org/Y18-1056 Nguyen-Son, H.Q., Nguyen, H.H., Tieu, N.D.T., Yamagishi, J., Echizen, I.: Identifying computer-translated paragraphs using coherence features. In: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation. Association for Computational Linguistics, Hong Kong (2018). https://​aclanthology.​org/​Y18-1056
29.
Zurück zum Zitat Nguyen-Son, H.Q., Tieu, N.D.T., Nguyen, H.H., Yamagishi, J., Zen, I.E.: Identifying computer-generated text using statistical analysis. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1504–1511. IEEE (2017) Nguyen-Son, H.Q., Tieu, N.D.T., Nguyen, H.H., Yamagishi, J., Zen, I.E.: Identifying computer-generated text using statistical analysis. In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1504–1511. IEEE (2017)
30.
Zurück zum Zitat Padró, M., Padró, L.: Comparing methods for language identification. Procesamiento del lenguaje natural 33 (2004) Padró, M., Padró, L.: Comparing methods for language identification. Procesamiento del lenguaje natural 33 (2004)
33.
Zurück zum Zitat Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
34.
Zurück zum Zitat Popescu, M.: Studying translationese at the character level. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 634–639. Association for Computational Linguistics, Hissar (2011), https://aclanthology.org/R11-1091 Popescu, M.: Studying translationese at the character level. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 634–639. Association for Computational Linguistics, Hissar (2011), https://​aclanthology.​org/​R11-1091
35.
Zurück zum Zitat St, L., Wold, S., et al.: Analysis of variance (ANOVA). Chemom. Intell. Lab. Syst. 6(4), 259–272 (1989)CrossRef St, L., Wold, S., et al.: Analysis of variance (ANOVA). Chemom. Intell. Lab. Syst. 6(4), 259–272 (1989)CrossRef
36.
Zurück zum Zitat Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1, Long Papers, pp. 1138–1149 (2017) Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1, Long Papers, pp. 1138–1149 (2017)
38.
Zurück zum Zitat Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the 8th Workshop on Innovative use of NLP for Building Educational Applications, pp. 48–57 (2013) Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the 8th Workshop on Innovative use of NLP for Building Educational Applications, pp. 48–57 (2013)
39.
Zurück zum Zitat Wright, R.E.: Logistic regression (1995) Wright, R.E.: Logistic regression (1995)
Metadaten
Titel
Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification
verfasst von
Massimo La Morgia
Alessandro Mei
Eugenio Nerio Nemmi
Luca Sabatini
Francesco Sassi
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-30047-9_18

Premium Partner