Top

Published in:

2018 | OriginalPaper | Chapter

Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

Authors : Shiva Taslimipoor, Ruslan Mitkov, Gloria Corpas Pastor, Afsaneh Fazly

Published in: Computational Linguistics and Intelligent Text Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Genetic-Based Decoder for Statistical Machine Translation

next chapter Bi-text Alignment of Movie Subtitles for Spoken English-Arabic Statistical Machine Translation

We use the lexicon built by applying GIZA++ on the Spanish–English portion of the Europarl.

The software is available in the websites of the authors of [13].

We set the frequency threshold to 10 in our experiments.

http://www.abc.es and http://www.abc.net.au.

http://es.noticias.yahoo.com and http://uk.news.yahoo.com.

http://cnnespanol.cnn.com and http://cnn.com.

http://www.sport.es/es and http://www.sport-english.com/en.

http://es.euronews.com and http://euronews.net.

http://www.accurat-project.eu.

The comparable corpora that we prepared is available on https://github.com/shivaat/EnEsCC.

Note that we add noise in both Spanish–English and English–Spanish directions.

Aker, A., Kanoulas, E., Gaizauskas, R.: A light way to collect comparable corpora from the web. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)

Bannard, C.: A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In: Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, pp. 1–8. Association for Computational Linguistics (2007)

Bouamor, D., Semmar, N., Zweigenbaum, P.: Identifying bilingual multi-word expressions for statistical machine translation. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association (ELRA) (2012)

Bouamor, D., Semmar, N., Zweigenbaum, P.: Context vector disambiguation for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, Short Papers, vol. 2, pp. 759–764. Association for Computational Linguistics (2013)

Pastor, G.C.: Collocations in e-bilingual dictionaries: from underlying theoretical assumptions to practical lexicography and translation issues. In: Torner, S., Bernal, E. (eds.) Collocations and Other Lexical Combinations in Spanish: Theoretical and Applied Approaches, pp. 173–199. Routledge, Abingdon (2017)

Evert, S.: The statistics of word cooccurrences : word pairs and collocations. Ph.D. thesis, Universität Stuttgart, Holzgartenstr. 16, 70174 Stuttgart (2005)

Fazly, A.: Automatic acquisition of lexical knowledge about multiword predicates. Ph.D. thesis, Department of Computer Science, University of Toronto (2007)

Fung, P.: A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 1–17. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49478-2_1 CrossRef

Fung, P., McKeown, K.: Finding terminology translations from non-parallel corpora. In: Proceedings of the 5th Annual Workshop on Very Large Corpora, pp. 192–202 (1997)

10.

Ion, R.: PEXACC: a parallel sentence mining algorithm from comparable corpora. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012) (2012)

11.

Ismail, A., Manandhar, S.: Bilingual lexicon extraction from comparable corpora using in-domain terms. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 481–489. Association for Computational Linguistics (2010)

12.

Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: The Tenth Machine Translation Summit, Phuket, Thailand, pp. 79–86 (2005)

13.

Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, Short Papers, vol. 2, pp. 302–308. Association for Computational Linguistics (2014)

14.

McEnery, A., Xiao, R.: Parallel and comparable corpora: what is happening. In: Incorporating Corpora: The Linguist and the Translator, pp. 18–31 (2007)

15.

Mendoza Rivera, O., Mitkov, R., Corpas Pastor, G.: A flexible framework for collocation retrieval and translation from parallel and comparable corpora. In: Workshop on Multi-word Units in Machine Translation and Translation Technology (2013)

16.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

17.

Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)CrossRef

18.

Pal, S., Pakray, P., Naskar, S.K.: Automatic building and using parallel resources for SMT from comparable corpora. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) @ EACL, pp. 48–57 (2014)

19.

Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the ACL Student Research Workshop, ACLstudent 2005, Stroudsburg, PA, USA, pp. 13–18. Association for Computational Linguistics (2005)

20.

Pekar, V., Mitkov, R., Blagoev, D., Mulloni, A.: Finding translations for low-frequency words in comparable corpora. Mach. Transl. 20(4), 247–266 (2006)CrossRef

21.

Rapp, R.: Automatic identification of word translations from unrelated English and German corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 519–526. Association for Computational Linguistics (1999)

22.

Rapp, R., Sharoff, S.: Extracting multiword translations from aligned comparable documents. In: Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) @ EACL 2014, Gothenburg, Sweden, pp. 83–91 (2014)

23.

Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1 CrossRef

24.

Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19, 143–177 (1993)

25.

Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2010), pp. 403–411 (2010)

26.

Su, F., Babych, B.: Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-)parallel translation equivalents. In: Proceedings of the Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), EACL 2012, Stroudsburg, PA, USA, pp. 10–19. Association for Computational Linguistics (2012)

27.

Tiedemann, J.: Extraction of translation equivalents from parallel corpora. In: Proceedings of the 11th Nordic Conference on Computational Linguistics, pp. 120–128 (1998)

Title: Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations
Authors: Shiva Taslimipoor
Ruslan Mitkov
Gloria Corpas Pastor
Afsaneh Fazly
Publisher: Springer International Publishing
Book: Computational Linguistics and Intelligent Text Processing
Print ISBN: 978-3-319-75486-4

Electronic ISBN: 978-3-319-75487-1

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-75487-1_10

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner