Skip to main content
Top

Hint

Swipe to navigate through the chapters of this book

2023 | OriginalPaper | Chapter

Neural Approaches to Multilingual Information Retrieval

Authors : Dawn Lawrie, Eugene Yang, Douglas W. Oard, James Mayfield

Published in: Advances in Information Retrieval

Publisher: Springer Nature Switzerland

Abstract

Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.

To get access to this content you need the following product:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe



 


Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt 90 Tage mit der neuen Mini-Lizenz testen!

Appendix
Available only for authorised users
Footnotes
2
Batches include the same query paired with document passages translated into each language.
 
5
Although Marian [23] is faster than Sockeye 2, benchmark results from Sockeye 1 [20] and Sockeye 2 [19] confirm that Sockeye 2 is within a factor of 2 to 3 of Marian’s speed, leaving our conclusions unchanged.
 
Literature
1.
go back to reference Aljlayl, M., Frieder, O.: Effective Arabic-English cross-language information retrieval via machine-readable dictionaries and machine translation. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 295–302 (2001) Aljlayl, M., Frieder, O.: Effective Arabic-English cross-language information retrieval via machine-readable dictionaries and machine translation. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 295–302 (2001)
4.
go back to reference Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009) Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)
10.
go back to reference Choudhury, M., Deshpande, A.: How linguistically fair are multilingual pre-trained language models? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 12710–12718 (2021) Choudhury, M., Deshpande, A.: How linguistically fair are multilingual pre-trained language models? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 12710–12718 (2021)
13.
go back to reference Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 985–988 (2019) Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 985–988 (2019)
14.
go back to reference Darwish, K., Oard, D.W.: Probabilistic structured query methods. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 338–344 (2003) Darwish, K., Oard, D.W.: Probabilistic structured query methods. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 338–344 (2003)
15.
go back to reference Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Association for Computational Linguistics, Minneapolis, June 2019. https://​aclanthology.​org/​N19-1423 Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Association for Computational Linguistics, Minneapolis, June 2019. https://​aclanthology.​org/​N19-1423
16.
go back to reference Domhan, T., Denkowski, M., Vilar, D., Niu, X., Hieber, F., Heafield, K.: The Sockeye 2 neural machine translation toolkit at AMTA 2020. In: Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 110–115, Association for Machine Translation in the Americas, Virtual, October 2020 Domhan, T., Denkowski, M., Vilar, D., Niu, X., Hieber, F., Heafield, K.: The Sockeye 2 neural machine translation toolkit at AMTA 2020. In: Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 110–115, Association for Machine Translation in the Americas, Virtual, October 2020
17.
18.
go back to reference Granell, X.: Multilingual Information Management: Information, Technology and Translators. Chandos Publishing, Cambridge (2014) Granell, X.: Multilingual Information Management: Information, Technology and Translators. Chandos Publishing, Cambridge (2014)
21.
go back to reference Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in Python. Technical report, Explosion (2020) Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in Python. Technical report, Explosion (2020)
22.
go back to reference Hull, D.A., Grefenstette, G.: Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–57 (1996) Hull, D.A., Grefenstette, G.: Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–57 (1996)
23.
go back to reference Junczys-Dowmunt, M., Heafield, K., Hoang, H., Grundkiewicz, R., Aue, A.: Marian: cost-effective high-quality neural machine translation in C++. arXiv preprint arXiv:​1805.​12096 (2018) Junczys-Dowmunt, M., Heafield, K., Hoang, H., Grundkiewicz, R., Aue, A.: Marian: cost-effective high-quality neural machine translation in C++. arXiv preprint arXiv:​1805.​12096 (2018)
25.
go back to reference Kassner, N., Dufter, P., Schütze, H.: Multilingual lama: investigating knowledge in multilingual pretrained language models. arXiv preprint arXiv:​2102.​00894 (2021) Kassner, N., Dufter, P., Schütze, H.: Multilingual lama: investigating knowledge in multilingual pretrained language models. arXiv preprint arXiv:​2102.​00894 (2021)
26.
go back to reference Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48 (2020) Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48 (2020)
30.
go back to reference McCarley, J.S.: Should we translate the documents or the queries in cross-language information retrieval? In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 208–214 (1999) McCarley, J.S.: Should we translate the documents or the queries in cross-language information retrieval? In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 208–214 (1999)
31.
go back to reference Mitamura, T., et al.: Overview of the NTCIR-7 ACLIA tasks: advanced cross-lingual information access. In: NTCIR (2008) Mitamura, T., et al.: Overview of the NTCIR-7 ACLIA tasks: advanced cross-lingual information access. In: NTCIR (2008)
34.
go back to reference Oard, D.W., Dorr, B.J.: A survey of multilingual text retrieval. Technical report, UMIACS-TR-96019 CS-TR-3615, UMIACS (1996) Oard, D.W., Dorr, B.J.: A survey of multilingual text retrieval. Technical report, UMIACS-TR-96019 CS-TR-3615, UMIACS (1996)
36.
go back to reference Peters, C., Braschler, M.: The importance of evaluation for cross-language system development: the CLEF experience. In: LREC (2002) Peters, C., Braschler, M.: The importance of evaluation for cross-language system development: the CLEF experience. In: LREC (2002)
39.
go back to reference Rehder, B., Littman, M.L., Dumais, S.T., Landauer, T.K.: Automatic 3-language cross-language information retrieval with latent semantic indexing. In: TREC, pp. 233–239. Citeseer (1997) Rehder, B., Littman, M.L., Dumais, S.T., Landauer, T.K.: Automatic 3-language cross-language information retrieval with latent semantic indexing. In: TREC, pp. 233–239. Citeseer (1997)
40.
go back to reference Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: BM25 and beyond. Found. Trends® Inf. Retrieval 3(4), 333–389 (2009) Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: BM25 and beyond. Found. Trends® Inf. Retrieval 3(4), 333–389 (2009)
41.
go back to reference Santhanam, K., Khattab, O., Potts, C., Zaharia, M.: PLAID: an efficient engine for late interaction retrieval. arXiv preprint arXiv:​2205.​09707 (2022) Santhanam, K., Khattab, O., Potts, C., Zaharia, M.: PLAID: an efficient engine for late interaction retrieval. arXiv preprint arXiv:​2205.​09707 (2022)
43.
go back to reference Si, L., Callan, J., Cetintas, S., Yuan, H.: An effective and efficient results merging strategy for multilingual information retrieval in federated search environments. Inf. Retrieval 11(1), 1–24 (2008) CrossRef Si, L., Callan, J., Cetintas, S., Yuan, H.: An effective and efficient results merging strategy for multilingual information retrieval in federated search environments. Inf. Retrieval 11(1), 1–24 (2008) CrossRef
45.
go back to reference Tsai, M.F., Wang, Y.T., Chen, H.H.: A study of learning a merge model for multilingual information retrieval. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 195–202 (2008) Tsai, M.F., Wang, Y.T., Chen, H.H.: A study of learning a merge model for multilingual information retrieval. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 195–202 (2008)
46.
go back to reference Xu, H., Van Durme, B., Murray, K.: BERT, mBERT, or BiBERT? A study on contextualized embeddings for neural machine translation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6663–6675. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://​aclanthology.​org/​2021.​emnlp-main.​534 Xu, H., Van Durme, B., Murray, K.: BERT, mBERT, or BiBERT? A study on contextualized embeddings for neural machine translation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6663–6675. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://​aclanthology.​org/​2021.​emnlp-main.​534
47.
48.
go back to reference Yang, E., Nair, S., Chandradevan, R., Iglesias-Flores, R., Oard, D.W.: C3: continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2022). https://​arxiv.​org/​abs/​2204.​11989 Yang, E., Nair, S., Chandradevan, R., Iglesias-Flores, R., Oard, D.W.: C3: continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2022). https://​arxiv.​org/​abs/​2204.​11989
49.
go back to reference Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. TyDi: a multi-lingual benchmark for dense retrieval. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 127–137. Association for Computational Linguistics, Punta Cana, Dominican Republic, November 2021. https://​aclanthology.​org/​2021.​mrl-1.​12 Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. TyDi: a multi-lingual benchmark for dense retrieval. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 127–137. Association for Computational Linguistics, Punta Cana, Dominican Republic, November 2021. https://​aclanthology.​org/​2021.​mrl-1.​12
Metadata
Title
Neural Approaches to Multilingual Information Retrieval
Authors
Dawn Lawrie
Eugene Yang
Douglas W. Oard
James Mayfield
Copyright Year
2023
DOI
https://doi.org/10.1007/978-3-031-28244-7_33