Skip to main content
Top

2022 | OriginalPaper | Chapter

Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

Authors : Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Neural Information Retrieval models hold the promise to replace lexical matching models, e.g. BM25, in modern search engines. While their capabilities have fully shone on in-domain datasets like MS MARCO, they have recently been challenged on out-of-domain zero-shot settings (BEIR benchmark), questioning their actual generalization capabilities compared to bag-of-words approaches. Particularly, we wonder if these shortcomings could (partly) be the consequence of the inability of neural IR models to perform lexical matching off-the-shelf. In this work, we propose a measure of discrepancy between the lexical matching performed by any (neural) model and an “ideal” one. Based on this, we study the behavior of different state-of-the-art neural IR models, focusing on whether they are able to perform lexical matching when it’s actually useful, i.e. for important terms. Overall, we show that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
We excluded doc2query-T5 from the analysis, due to the high computation cost for obtaining the expanded collections.
 
Literature
1.
go back to reference Camara, A., Hauff, C.: Diagnosing BERT with Retrieval Heuristics. In: ECIR. p. 14 (2020), zSCC: NoCitationData[s0] Camara, A., Hauff, C.: Diagnosing BERT with Retrieval Heuristics. In: ECIR. p. 14 (2020), zSCC: NoCitationData[s0]
2.
go back to reference Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the trec 2020 deep learning track (2021) Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the trec 2020 deep learning track (2021)
3.
go back to reference Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the trec 2019 deep learning track (2020) Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the trec 2019 deep learning track (2020)
4.
go back to reference Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019). 10.18653/v1/n19-1423, https://doi.org/10.18653/v1/n19-1423 Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019). 10.18653/v1/n19-1423, https://​doi.​org/​10.​18653/​v1/​n19-1423
8.
go back to reference Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In: SIGIR, July 2021 Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In: SIGIR, July 2021
9.
go back to reference Jiang, Z., Tang, R., Xin, J., Lin, J.: How does BERT rerank passages? an attribution analysis with information bottlenecks. In: EMNLP Workshop, Black Box NLP, p. 14 (2021) Jiang, Z., Tang, R., Xin, J., Lin, J.: How does BERT rerank passages? an attribution analysis with information bottlenecks. In: EMNLP Workshop, Black Box NLP, p. 14 (2021)
13.
go back to reference Nogueira, R., Lin, J.: From doc2query to docTTTTTquery, p. 3, zSCC: 0000004 Nogueira, R., Lin, J.: From doc2query to docTTTTTquery, p. 3, zSCC: 0000004
16.
go back to reference Robertson, S.E., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval (2009) Robertson, S.E., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval (2009)
17.
go back to reference Sciavolino, C., Zhong, Z., Lee, J., Chen, D.: Simple entity-centric questions challenge dense retrievers. In: Empirical Methods in Natural Language Processing (EMNLP) (2021) Sciavolino, C., Zhong, Z., Lee, J., Chen, D.: Simple entity-centric questions challenge dense retrievers. In: Empirical Methods in Natural Language Processing (EMNLP) (2021)
Metadata
Title
Match Your Words! A Study of Lexical Matching in Neural Information Retrieval
Authors
Thibault Formal
Benjamin Piwowarski
Stéphane Clinchant
Copyright Year
2022
DOI
https://doi.org/10.1007/978-3-030-99739-7_14