Top

Published in:

2022 | OriginalPaper | Chapter

Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

Authors : Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Neural Information Retrieval models hold the promise to replace lexical matching models, e.g. BM25, in modern search engines. While their capabilities have fully shone on in-domain datasets like MS MARCO, they have recently been challenged on out-of-domain zero-shot settings (BEIR benchmark), questioning their actual generalization capabilities compared to bag-of-words approaches. Particularly, we wonder if these shortcomings could (partly) be the consequence of the inability of neural IR models to perform lexical matching off-the-shelf. In this work, we propose a measure of discrepancy between the lexical matching performed by any (neural) model and an “ideal” one. Based on this, we study the behavior of different state-of-the-art neural IR models, focusing on whether they are able to perform lexical matching when it’s actually useful, i.e. for important terms. Overall, we show that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Enhanced Sentence Meta-Embeddings for Textual Understanding

next chapter CARES: CAuse Recognition for Emotion in Suicide Notes

We excluded doc2query-T5 from the analysis, due to the high computation cost for obtaining the expanded collections.

Camara, A., Hauff, C.: Diagnosing BERT with Retrieval Heuristics. In: ECIR. p. 14 (2020), zSCC: NoCitationData[s0]

Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the trec 2020 deep learning track (2021)

Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the trec 2019 deep learning track (2020)

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019). 10.18653/v1/n19-1423, https://doi.org/10.18653/v1/n19-1423

Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval. arXiv:2109.10086 [cs], September 2021. http://arxiv.org/abs/2109.10086, arXiv: 2109.10086

Formal, T., Piwowarski, B., Clinchant, S.: Splade: sparse lexical and expansion model for first stage ranking. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, pp. 2288–2292. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3404835.3463098, https://doi.org/10.1145/3404835.3463098

Formal, T., Piwowarski, B., Clinchant, S.: A white box analysis of ColBERT. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 257–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_23CrossRef

Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In: SIGIR, July 2021

Jiang, Z., Tang, R., Xin, J., Lin, J.: How does BERT rerank passages? an attribution analysis with information bottlenecks. In: EMNLP Workshop, Black Box NLP, p. 14 (2021)

10.

Khattab, O., Zaharia, M.: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. arXiv:2004.12832 [cs], April 2020. http://arxiv.org/abs/2004.12832, arXiv: 2004.12832

11.

MacAvaney, S., Feldman, S., Goharian, N., Downey, D., Cohan, A.: ABNIRML: analyzing the behavior of neural IR models. arXiv:2011.00696 [cs], Nov 2020. http://arxiv.org/abs/2011.00696, zSCC: 0000000 arXiv: 2011.00696

12.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: Ms marco: a human generated machine reading comprehension dataset. CoRR abs/1611.09268 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#NguyenRSGTMD16

13.

Nogueira, R., Lin, J.: From doc2query to docTTTTTquery, p. 3, zSCC: 0000004

14.

Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(3), 129–146 (1976). 10/dvgb84, https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.4630270302,\_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.4630270302

15.

Robertson, S.E.: The probability ranking principle in IR. J. Documentation 33(4), 294–304 (1977). 10/ckqfpm, https://doi.org/10.1108/eb026647, publisher: MCB UP Ltd

16.

Robertson, S.E., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval (2009)

17.

Sciavolino, C., Zhong, Z., Lee, J., Chen, D.: Simple entity-centric questions challenge dense retrievers. In: Empirical Methods in Natural Language Processing (EMNLP) (2021)

18.

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv:2104.08663 [cs], September 2021. http://arxiv.org/abs/2104.08663

19.

Yates, A., Nogueira, R., Lin, J.: Pretrained transformers for text ranking: Bert and beyond. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM 2021, pp. 1154–1156. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3437963.3441667, https://doi.org/10.1145/3437963.3441667

20.

Yu, C.T., Salton, G.: Precision weighting - an effective automatic indexing method. J. ACM 23(1), 76–88 (1976). 10/d3fgsz, https://doi.org/10.1145/321921.321930

Title: Match Your Words! A Study of Lexical Matching in Neural Information Retrieval
Authors: Thibault Formal
Benjamin Piwowarski
Stéphane Clinchant
Publisher: Springer International Publishing
Book: Advances in Information Retrieval
Print ISBN: 978-3-030-99738-0

Electronic ISBN: 978-3-030-99739-7

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-030-99739-7_14

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"