nach oben

Discover Computing

Erschienen in:

06.08.2022

Highlighting exact matching via marking strategies for ad hoc document ranking with pretrained contextualized language models

verfasst von: Lila Boualili, Jose G. Moreno, Mohand Boughanem

Erschienen in: Discover Computing | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Pretrained language models (PLMs) exemplified by BERT have proven to be remarkably effective for ad hoc ranking. As opposed to pre-BERT models that required specialized neural components to capture different aspects of query-document relevance, PLMs are solely based on transformers where attention is the only mechanism used for extracting signals from term interactions. Thanks to the transformer’s cross-match attention, BERT was found to be an effective soft matching model. However, exact matching is still an essential signal for assessing the relevance of a document to an information-seeking query aside from semantic matching. We assume that BERT might benefit from explicit exact match cues to better adapt to the relevance classification task. In this work, we explore strategies for integrating exact matching signals using marker tokens to highlight exact term-matches between the query and the document. We find that this simple marking approach significantly improves over the common vanilla baseline. We empirically demonstrate the effectiveness of our approach through exhaustive experiments on three standard ad hoc benchmarks. Results show that explicit exact match cues conveyed by marker tokens are beneficial for BERT and ELECTRA variant to achieve higher or at least comparable performance. Our findings support that traditional information retrieval cues such as exact matching are still valuable for large pretrained contextualized models such as BERT.

Vorheriger Artikel Reinforcement online learning to rank with unbiased reward shaping

Nächster Artikel Sequence-aware news recommendations by combining intra- with inter-session user information

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

https://trec.nist.gov/data/robust/04.guidelines.html

http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm

http://boston.lti.cs.cmu.edu/appendices/TheWebConf2020-Zhuyun-Dai/rankings/

https://huggingface.co/sentence-transformers/facebook-dpr-question_encoder-multiset-base

https://huggingface.co/sentence-transformers/facebook-dpr-ctx_encoder-multiset-base

https://huggingface.co/sentence-transformers/msmarco-roberta-base-ance-firstp

With less than 4M documents, the size of the MS MARCO Document index was already as big as 200GB.

https://colab.research.google.com

https://pypi.org/project/pytrec-eval/

Akkalyoncu Yilmaz, Z., Yang, W., Zhang, H., & Lin, J. (2019). Cross-domain modeling of sentence-level evidence for document retrieval. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCN), ACL, Hong Kong, China, (pp. 3488–3494).

Boualili, L., Moreno, J. G., & Boughanem, M. (2020). MarkedBERT: Integrating traditional IR cues in pre-trained language models for passage retrieval (pp. 1977–1980). New York, NY, USA: Association for Computing Machinery.

Câmara, A., & Hauff, C. (2020). Diagnosing bert with retrieval heuristics. In J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, & F. Martins (Eds.), Advances in information retrieval (pp. 605–618). Cham: Springer International Publishing.

Chen, X., Li, C., He, B., & Sun, Y. (2019). UCAS at TREC-2019 deep learning track. In Voorhees EM, Ellis A (eds) Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, (Vol. 1250). 2019, National Institute of Standards and Technology (NIST), NIST Special Publication: Gaithersburg, Maryland, USA.

Chen, X., He, B., Sun, L., & Sun, Y. (2020). ICIP at TREC-2020 deep learning track. In Voorhees EM, Ellis A (eds) Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, (Vol. 1266) . National Institute of Standards and Technology (NIST), NIST Special Publication: Gaithersburg, Maryland, USA.

Clark, K., Luong, M.T., Le, Q.V., & Manning, C.D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.

Craswell, N., Mitra, B., Yilmaz, E., Campos, D., & Voorhees, E.M. (2020). Overview of the TREC 2019 deep learning track. arXiv:2003.07820

Craswell, N., Mitra, B., Yilmaz, E., & Campos, D. (2021). Overview of the trec 2020 deep learning track. arXiv:2102.07662

Dai, Z., & Callan, J. (2019a). Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687

Dai, Z., & Callan, J. (2019b). Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 985–988).

Dai, Z., & Callan, J. (2020a). Context-aware document term weighting for ad-hoc search. In Proceedings of The Web Conference 2020, Association for Computing Machinery: New York, NY, USA, (pp. 1897-1907).

Dai, Z., & Callan, J. (2020b). Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery: New York, NY, USA, (pp. 1533-1536).

Dai, Z., Xiong, C., Callan, J., & Liu, Z. (2018). Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Association for Computing Machinery: New York, NY, USA, WSDM ’18, (pp. 126-134)

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 NAACL-HLT Conference, 1, 4171–4186.

Guo, J., Fan, Y., Ai, Q., & Croft, W. (2016). A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, (pp. 55–64).

Humeau, S., Shuster, K., Lachaux, M.A., & Weston, J. (2020). Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In International Conference on Learning Representations.

Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale similarity search with gpus. arXiv:1702.08734

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.t. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, (pp. 6769–6781).

Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT (pp. 39–48). New York, NY, USA: Association for Computing Machinery.

Kingma, D.P., & Ba, J. (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.

Lavrenko, V., & Croft, W.B. (2001). Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery: New York, NY, USA, SIGIR ’01, (pp. 120-127).

Li, C., Yates, A., MacAvaney, S., He, B., & Sun, Y. (2020). PARADE: passage representation aggregation for document reranking. arXiv:2008.09093

Li, H. (2011). Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers.

Lin, J., Nogueira, R., & Yates, A. (2020). Pretrained transformers for text ranking: BERT and beyond. arXiv:2010.06467

Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. arXiv:1907.11692

Luan, Y., Eisenstein, J., Toutanova, K., & Collins, M, (2020). Sparse, dense, and attentional representations for text retrieval. arXiv:2005.00181

MacAvaney, S., Yates, A., Cohan, A., & Goharian, N. (2019). Cedr: Contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. (pp. 1101–1104).

MacAvaney, S., Feldman, S., Goharian, N., Downey, D., & Cohan, A. (2020a). ABNIRML: analyzing the behavior of neural IR models. arXiv:2011.00696

MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., & Frieder, O. (2020b). Efficient document re-ranking for transformers by precomputing term representations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery: New York, NY, USA, (pp. 49-58).

Mitra, B., Diaz, F., & Craswell, N. (2017). Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web. (pp. 1291–1299).

Mitra, B., Craswell, N., et al. (2018). An introduction to neural information retrieval. Foundations and Trends in Information Retrieval, 13(1), 1–126.

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). MS MARCO: A human generated machine reading comprehension dataset. arXiv:1611.09268

Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv:1901.04085

Nogueira, R., Lin, J., & Epistemic, A. (2019). From doc2query to doctttttquery. Online preprint 6.

Nogueira, R., Jiang, Z., Pradeep, R., & Lin, J. (2020). Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, (pp. 708–718).

Onal, K. D., Zhang, Y., Altingovde, I. S., Rahman, M. M., Karagoz, P., Braylan, A., Dang, B., Chang, H. L., Kim, H., Mcnamara, Q., Angert, A., Banner, E., Khetan, V., Mcdonnell, T., Nguyen, A. T., Xu, D., Wallace, B. C., Rijke, M., & Lease, M. (2018). Neural information retrieval: At the end of the early years. Information Retrieval Journal, 21(2–3), 111–182.

Padaki, R., Dai, Z., & Callan, J. (2020). Rethinking query expansion for bert reranking. In J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, & F. Martins (Eds.), Advances in Information Retrieval (pp. 297–304). Cham: Springer International Publishing.

Padigela, H., Zamani, H., & Croft, W.B. (2019). Investigating the successes and failures of BERT for passage re-ranking. arXiv:1905.01758

Pradeep, R., Ma, X., Zhang, X., Cui, H., Xu, R., Nogueira, R., & Lin, J. (2020). H2oloo at trec 2020: When all you got is a hammer... deep learning, health misinformation, and precision medicine. In Text Retrieval Conference (TREC).

Qiao, Y., Xiong, C., Liu, Z., & Liu, Z. (2019). Understanding the behaviors of BERT in ranking. arXiv:1904.07531

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.MathSciNetMATH

Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing. Association for Computational Linguistics. arXiv:1908.10084

Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics.

Rennings, D., Moraes, F., & Hauff, C. (2019). An axiomatic approach to diagnosing neural ir models. In L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff, & D. Hiemstra (Eds.), Advances in Information Retrieval (pp. 489–503). Cham: Springer International Publishing.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A.N., Kaiser, Ł., Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, (pp. 5998–6008).

Voorhees, E., Alam, T., Bedrick, S., Demner-Fushman, D., Hersh, W.R., Lo, K., Roberts, K., Soboroff, I., & Wang, L.L. (2021). Trec-covid: Constructing a pandemic information retrieval test collection. SIGIR Forum 54(1).

Wang, W., Bi, B., Yan, M., Wu, C., Xia, J., Bao, Z., Peng, L., Si, L. (2020). Structbert: Incorporating language structures into pre-training for deep language understanding. In International Conference on Learning Representations.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, (pp. 38–45).

Xiong, C., Dai, Z., Callan, J., Liu, Z., & Power, R. (2017). End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA, SIGIR ’17,( pp. 55-64).

Xiong, L., Xiong, C., Li, Y., Tang, K., Liu, J., Bennett, P.N., Ahmed, J., & Overwijk, A. (2021). Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, OpenReview.net.

Yan, M., Li, C., Xia, J., & Wang, W. (2019). Idst at trec 2019 deep learning track: Deep cascade ranking with generation-based document expansion and pre-trained language modeling. In:TREC.

Yang, P., Fang, H., & Lin, J. (2017). Anserini: Enabling the use of lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 1253–1256).

Yang, W., Lu, K., Yang, P., & Lin, J. (2019a). Critically examining the “neural hype” weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 1129–1132).

Yang, W., Zhang, H., & Lin, J. (2019b). Simple applications of BERT for ad hoc document retrieval. arXiv:1903.10972

Titel: Highlighting exact matching via marking strategies for ad hoc document ranking with pretrained contextualized language models
verfasst von: Lila Boualili
Jose G. Moreno
Mohand Boughanem
Publikationsdatum: 06.08.2022
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 4/2022
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-022-09414-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2022

Sequence-aware news recommendations by combining intra- with inter-session user information

Reinforcement online learning to rank with unbiased reward shaping

Shallow pooling for sparse labels

Applying burst-tries for error-tolerant prefix search

Premium Partner