Skip to main content

2022 | OriginalPaper | Buchkapitel

Establishing Strong Baselines For TripClick Health Retrieval

verfasst von : Sebastian Hofstätter, Sophia Althammer, Mete Sertkan, Allan Hanbury

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the – originally too noisy – training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact of different domain-specific pre-trained models on TripClick. Finally, we show that dense retrieval outperforms BM25 by considerable margins, even with simple training procedures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The TripDatabase allows users to use different ranking schemes, such as popularity, source quality and pure relevance, as well as filtering results by facets. Unfortunately, this information is not available in the public dataset.
 
Literatur
1.
Zurück zum Zitat Bajaj, P., et al.: MS MARCO: a human generated MAchine Reading COmprehension dataset. In: Proceedings of NIPS (2016) Bajaj, P., et al.: MS MARCO: a human generated MAchine Reading COmprehension dataset. In: Proceedings of NIPS (2016)
2.
Zurück zum Zitat Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of EMNLP-IJCNLP (2019) Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of EMNLP-IJCNLP (2019)
3.
Zurück zum Zitat Chuklin, A., Markov, I., de Rijke, M.: Click Models for Web Search. Morgan & Claypool, San Rafael (2015)CrossRef Chuklin, A., Markov, I., de Rijke, M.: Click Models for Web Search. Morgan & Claypool, San Rafael (2015)CrossRef
4.
Zurück zum Zitat Cormack, G., Grossman, M.: Technology-assisted review in empirical medicine: waterloo participation in clef ehealth 2018. In CLEF (Working Notes) (2018) Cormack, G., Grossman, M.: Technology-assisted review in empirical medicine: waterloo participation in clef ehealth 2018. In CLEF (Working Notes) (2018)
5.
Zurück zum Zitat Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL (2019) Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL (2019)
6.
Zurück zum Zitat Fernández-Pichel, M., Losada, D., Pichel, J.C., Elsweiler, D.: Citius at the trec 2020 health misinformation track (2020) Fernández-Pichel, M., Losada, D., Pichel, J.C., Elsweiler, D.: Citius at the trec 2020 health misinformation track (2020)
7.
Zurück zum Zitat Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing (2020) Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing (2020)
8.
Zurück zum Zitat Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint2010.02666 (2020) Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint2010.02666 (2020)
9.
Zurück zum Zitat Hofstätter, S., Hanbury, A.: Let’s measure run time! Extending the IR replicability infrastructure to include performance aspects. In: Proceedings of OSIRRC (2019) Hofstätter, S., Hanbury, A.: Let’s measure run time! Extending the IR replicability infrastructure to include performance aspects. In: Proceedings of OSIRRC (2019)
10.
Zurück zum Zitat Hofstätter, S., Lipani, A., Althammer, S., Zlabinger, M., Hanbury, A.: Mitigating the position bias of transformer models in passage re-ranking. In: Proceedings of ECIR (2021) Hofstätter, S., Lipani, A., Althammer, S., Zlabinger, M., Hanbury, A.: Mitigating the position bias of transformer models in passage re-ranking. In: Proceedings of ECIR (2021)
11.
Zurück zum Zitat Hofstätter, S., Rekabsaz, N., Eickhoff, C., Hanbury, A.: On the effect of low-frequency terms on neural-IR models. In: Proceedings of SIGIR (2019) Hofstätter, S., Rekabsaz, N., Eickhoff, C., Hanbury, A.: On the effect of low-frequency terms on neural-IR models. In: Proceedings of SIGIR (2019)
12.
Zurück zum Zitat Hofstätter, S., Zlabinger, M., Hanbury, A.: Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. In: Proceedings of ECAI (2020) Hofstätter, S., Zlabinger, M., Hanbury, A.: Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking. In: Proceedings of ECAI (2020)
13.
Zurück zum Zitat Khattab, O., Zaharia, M.: Colbert: efficient and effective passage search via contextualized late interaction over Bert. In: Proceedings of SIGIR (2020) Khattab, O., Zaharia, M.: Colbert: efficient and effective passage search via contextualized late interaction over Bert. In: Proceedings of SIGIR (2020)
14.
Zurück zum Zitat Li, M., Li, M., Xiong, K., Lin, J.: Multi-task dense retrieval via model uncertainty fusion for open-domain question answering. In: Findings of EMNLP (2021) Li, M., Li, M., Xiong, K., Lin, J.: Multi-task dense retrieval via model uncertainty fusion for open-domain question answering. In: Findings of EMNLP (2021)
15.
Zurück zum Zitat Lima, L.C., et al.: Denmark’s participation in the search engine TREC COVID-19 challenge: lessons learned about searching for precise biomedical scientific information on COVID-19. arXiv preprint2011.12684 (2020) Lima, L.C., et al.: Denmark’s participation in the search engine TREC COVID-19 challenge: lessons learned about searching for precise biomedical scientific information on COVID-19. arXiv preprint2011.12684 (2020)
16.
Zurück zum Zitat Lin, J.: A proposed conceptual framework for a representational approach to information retrieval. arXiv preprint2110.01529 (2021) Lin, J.: A proposed conceptual framework for a representational approach to information retrieval. arXiv preprint2110.01529 (2021)
17.
Zurück zum Zitat Lu, W., Jiao, J., Zhang, R.: Twinbert: distilling knowledge to twin-structured Bert models for efficient retrieval. arXiv preprint arXiv:2002.06275 (2020) Lu, W., Jiao, J., Zhang, R.: Twinbert: distilling knowledge to twin-structured Bert models for efficient retrieval. arXiv preprint arXiv:​2002.​06275 (2020)
18.
Zurück zum Zitat Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval. arXiv preprint arXiv:2005.00181 (2020) Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional representations for text retrieval. arXiv preprint arXiv:​2005.​00181 (2020)
19.
Zurück zum Zitat MacAvaney, S., Cohan, A., Goharian, N.: Sledge: a simple yet effective baseline for COVID-19 scientific knowledge search. arXiv preprint2005.02365 (2020) MacAvaney, S., Cohan, A., Goharian, N.: Sledge: a simple yet effective baseline for COVID-19 scientific knowledge search. arXiv preprint2005.02365 (2020)
20.
Zurück zum Zitat MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: CEDR: contextualized embeddings for document ranking. In: Proceedings of SIGIR (2019) MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: CEDR: contextualized embeddings for document ranking. In: Proceedings of SIGIR (2019)
21.
Zurück zum Zitat McDonald, R., Brokos, G.-I., Androutsopoulos, I.: Deep relevance ranking using enhanced document-query interactions. arXiv preprint1809.01682 (2018) McDonald, R., Brokos, G.-I., Androutsopoulos, I.: Deep relevance ranking using enhanced document-query interactions. arXiv preprint1809.01682 (2018)
22.
Zurück zum Zitat Möller, T., Reina, A., Jayakumar, R., Pietsch, M.: COVID-QA: a question answering dataset for COVID-19. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July 2020. Association for Computational Linguistics (2020) Möller, T., Reina, A., Jayakumar, R., Pietsch, M.: COVID-QA: a question answering dataset for COVID-19. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, July 2020. Association for Computational Linguistics (2020)
23.
Zurück zum Zitat Nentidis, A., et al.: Overview of BioASQ 2020: the Eighth BioASQ challenge on large-scale biomedical semantic indexing and question answering, pp. 194–214, September 2020 Nentidis, A., et al.: Overview of BioASQ 2020: the Eighth BioASQ challenge on large-scale biomedical semantic indexing and question answering, pp. 194–214, September 2020
25.
Zurück zum Zitat Paszke, A., et al.: Automatic differentiation in PYTORCH. In: Proceedings of NIPS-W (2017) Paszke, A., et al.: Automatic differentiation in PYTORCH. In: Proceedings of NIPS-W (2017)
26.
Zurück zum Zitat Reddy, R.G., et al.: End-to-end QA on COVID-19: domain adaptation with synthetic training. arXiv preprint2012.01414 (2020) Reddy, R.G., et al.: End-to-end QA on COVID-19: domain adaptation with synthetic training. arXiv preprint2012.01414 (2020)
27.
Zurück zum Zitat Rekabsaz, N., Lesota, O., Schedl, M., Brassey, J., Eickhoff, C.: Tripclick: the log files of a large health web search engine. arXiv preprint2103.07901 (2021) Rekabsaz, N., Lesota, O., Schedl, M., Brassey, J., Eickhoff, C.: Tripclick: the log files of a large health web search engine. arXiv preprint2103.07901 (2021)
28.
Zurück zum Zitat Roberts, K., et al.:. Overview of the TREC 2019 precision medicine track. The ... text REtrieval Conference: TREC. Text REtrieval Conference, 26 (2019) Roberts, K., et al.:. Overview of the TREC 2019 precision medicine track. The ... text REtrieval Conference: TREC. Text REtrieval Conference, 26 (2019)
29.
Zurück zum Zitat Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:​1910.​01108 (2019)
30.
Zurück zum Zitat Tang, R., et al.: Rapidly bootstrapping a question answering dataset for COVID-19. CoRR, abs/2004.11339 (2020) Tang, R., et al.: Rapidly bootstrapping a question answering dataset for COVID-19. CoRR, abs/2004.11339 (2020)
31.
Zurück zum Zitat Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: Beir: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 4 2021 Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: Beir: a heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:​2104.​08663 4 2021
32.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al.: Attention is all you need. In: Proceedings of NIPS (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., et al.: Attention is all you need. In: Proceedings of NIPS (2017)
33.
Zurück zum Zitat Voorhees, E., et al.: TREC-COVID: constructing a pandemic information retrieval test collection. ArXiv, abs/2005.04474 (2020) Voorhees, E., et al.: TREC-COVID: constructing a pandemic information retrieval test collection. ArXiv, abs/2005.04474 (2020)
34.
Zurück zum Zitat Wang, K., Reimers, N., Gurevych, I.: TSDAE: using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. arXiv preprint arXiv:2104.06979, April 2021 Wang, K., Reimers, N., Gurevych, I.: TSDAE: using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. arXiv preprint arXiv:​2104.​06979, April 2021
35.
Zurück zum Zitat Wang, L.L., et al.: CORD-19: the COVID-19 open research dataset. arXiv preprint2004.10706 (2020) Wang, L.L., et al.: CORD-19: the COVID-19 open research dataset. arXiv preprint2004.10706 (2020)
36.
Zurück zum Zitat Wang, X.J., Grossman, M.R., Hyun, S.G.: Participation in TREC 2020 COVID track using continuous active learning. arXiv preprint2011.01453 (2020) Wang, X.J., Grossman, M.R., Hyun, S.G.: Participation in TREC 2020 COVID track using continuous active learning. arXiv preprint2011.01453 (2020)
37.
Zurück zum Zitat Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. ArXiv, pages arXiv-1910 (2019) Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. ArXiv, pages arXiv-1910 (2019)
38.
Zurück zum Zitat Xiong, C., et al.: CMT in TREC-COVID round 2: mitigating the generalization gaps from web to special domain search. arXiv preprint2011.01580 (2020) Xiong, C., et al.: CMT in TREC-COVID round 2: mitigating the generalization gaps from web to special domain search. arXiv preprint2011.01580 (2020)
39.
Zurück zum Zitat Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020) Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:​2007.​00808 (2020)
40.
Zurück zum Zitat Yilmaz, Z.A., Yang, W., Zhang, H., Lin,J.: Cross-domain modeling of sentence-level evidence for document retrieval. In: Proceedings of EMNLP-IJCNLP (2019) Yilmaz, Z.A., Yang, W., Zhang, H., Lin,J.: Cross-domain modeling of sentence-level evidence for document retrieval. In: Proceedings of EMNLP-IJCNLP (2019)
Metadaten
Titel
Establishing Strong Baselines For TripClick Health Retrieval
verfasst von
Sebastian Hofstätter
Sophia Althammer
Mete Sertkan
Allan Hanbury
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-99739-7_17