Skip to main content

2023 | OriginalPaper | Buchkapitel

An Experimental Study on Pretraining Transformers from Scratch for IR

verfasst von : Carlos Lassance, Hervé Dejean, Stéphane Clinchant

Erschienen in: Advances in Information Retrieval

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
For instance, freezing the BERT encoding and learning an additional linear layer is sufficient to obtain good performance in NLP [11], while such approach is not as effective in IR.
 
2
We could not find in the literature an easy/practical way to perform statistical significance testing over BEIR.
 
3
We were not able to find the parameters used in the experiments.
 
4
Note that MContriever TyDi (first row) is not available, statistical tests cannot be performed. We do our best to evaluate fairly under our training setting (second row).
 
5
We suspect they use more compute, but could not find accurate compute information.
 
Literatur
5.
Zurück zum Zitat Bonifacio, L.H., Campiotti, I., Jeronymo, V., Lotufo, R., Nogueira, R.: MMARCO: a multilingual version of the MS MARCO passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021) Bonifacio, L.H., Campiotti, I., Jeronymo, V., Lotufo, R., Nogueira, R.: MMARCO: a multilingual version of the MS MARCO passage ranking dataset. arXiv preprint arXiv:​2108.​13897 (2021)
8.
9.
Zurück zum Zitat Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 87–94 (2008) Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 87–94 (2008)
10.
Zurück zum Zitat Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2017, pp. 65–74. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3077136.3080832 Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2017, pp. 65–74. Association for Computing Machinery, New York (2017). https://​doi.​org/​10.​1145/​3077136.​3080832
12.
Zurück zum Zitat El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021) El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:​2112.​10740 (2021)
13.
Zurück zum Zitat Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: From distillation to hard negative sampling: making sparse neural IR models more effective. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 2353–2359. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3477495.3531857 Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: From distillation to hard negative sampling: making sparse neural IR models more effective. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 2353–2359. Association for Computing Machinery, New York (2022). https://​doi.​org/​10.​1145/​3477495.​3531857
18.
Zurück zum Zitat Guo, Y., et al.: Webformer: pre-training with web pages for information retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 1502–1512. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3477495.3532086 Guo, Y., et al.: Webformer: pre-training with web pages for information retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 1502–1512. Association for Computing Machinery, New York (2022). https://​doi.​org/​10.​1145/​3477495.​3532086
20.
Zurück zum Zitat Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation (2020) Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation (2020)
21.
Zurück zum Zitat Hofstätter, S., Althammer, S., Sertkan, M., Hanbury, A.: Establishing strong baselines for tripclick health retrieval (2022) Hofstätter, S., Althammer, S., Sertkan, M., Hanbury, A.: Establishing strong baselines for tripclick health retrieval (2022)
22.
Zurück zum Zitat Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of SIGIR (2021) Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of SIGIR (2021)
23.
Zurück zum Zitat Izacard, G., et al.: Towards unsupervised dense information retrieval with contrastive learning (2021) Izacard, G., et al.: Towards unsupervised dense information retrieval with contrastive learning (2021)
24.
Zurück zum Zitat Kaplan, J., et al.: Scaling laws for neural language models. arXiv abs/2001.08361 (2020) Kaplan, J., et al.: Scaling laws for neural language models. arXiv abs/2001.08361 (2020)
26.
Zurück zum Zitat Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2020, pp. 39–48. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3397271.3401075 Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2020, pp. 39–48. Association for Computing Machinery, New York (2020). https://​doi.​org/​10.​1145/​3397271.​3401075
28.
Zurück zum Zitat Lassance, C., Clinchant, S.: An efficiency study for splade models. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 2220–2226. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3477495.3531833 Lassance, C., Clinchant, S.: An efficiency study for splade models. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 2220–2226. Association for Computing Machinery, New York (2022). https://​doi.​org/​10.​1145/​3477495.​3531833
33.
Zurück zum Zitat Ma, X., Guo, J., Zhang, R., Fan, Y., Ji, X., Cheng, X.: B-prop: bootstrapped pre-training with representative words prediction for ad-hoc retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021) Ma, X., Guo, J., Zhang, R., Fan, Y., Ji, X., Cheng, X.: B-prop: bootstrapped pre-training with representative words prediction for ad-hoc retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021)
34.
Zurück zum Zitat Ma, X., Guo, J., Zhang, R., Fan, Y., Ji, X., Cheng, X.: Prop: pre-training with representative words prediction for ad-hoc retrieval. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021) Ma, X., Guo, J., Zhang, R., Fan, Y., Ji, X., Cheng, X.: Prop: pre-training with representative words prediction for ad-hoc retrieval. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021)
35.
Zurück zum Zitat Ma, Z., et al.: Pre-training for ad-hoc retrieval: hyperlink is also you need. In: Proceedings of the 30th ACM International Conference on Information and Knowledge Management (2021) Ma, Z., et al.: Pre-training for ad-hoc retrieval: hyperlink is also you need. In: Proceedings of the 30th ACM International Conference on Information and Knowledge Management (2021)
38.
Zurück zum Zitat Nair, S., Yang, E., Lawrie, D., Mayfield, J., Oard, D.W.: Learning a sparse representation model for neural CLIR. In: Design of Experimental Search and Information REtrieval Systems (DESIRES) (2022) Nair, S., Yang, E., Lawrie, D., Mayfield, J., Oard, D.W.: Learning a sparse representation model for neural CLIR. In: Design of Experimental Search and Information REtrieval Systems (DESIRES) (2022)
39.
Zurück zum Zitat Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo@ NIPs (2016) Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo@ NIPs (2016)
40.
Zurück zum Zitat Nogueira, R., Cho, K.: Passage re-ranking with BERT (2019) Nogueira, R., Cho, K.: Passage re-ranking with BERT (2019)
41.
Zurück zum Zitat Paria, B., Yeh, C.K., Yen, I.E.H., Xu, N., Ravikumar, P., Póczos, B.: Minimizing flops to learn efficient sparse representations (2020) Paria, B., Yeh, C.K., Yen, I.E.H., Xu, N., Ravikumar, P., Póczos, B.: Minimizing flops to learn efficient sparse representations (2020)
42.
Zurück zum Zitat Qu, Y., et al: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In: In Proceedings of NAACL (2021) Qu, Y., et al: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In: In Proceedings of NAACL (2021)
43.
Zurück zum Zitat Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. http://arxiv.org/abs/1908.10084 Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. http://​arxiv.​org/​abs/​1908.​10084
44.
Zurück zum Zitat Rekabsaz, N., Lesota, O., Schedl, M., Brassey, J., Eickhoff, C.: Tripclick: the log files of a large health web search engine. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2507–2513 (2021). https://doi.org/10.1145/3404835.3463242 Rekabsaz, N., Lesota, O., Schedl, M., Brassey, J., Eickhoff, C.: Tripclick: the log files of a large health web search engine. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2507–2513 (2021). https://​doi.​org/​10.​1145/​3404835.​3463242
46.
Zurück zum Zitat Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. Nist Special Publication Sp, pp. 73–96 (1996) Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. Nist Special Publication Sp, pp. 73–96 (1996)
47.
Zurück zum Zitat Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:​1910.​01108 (2019)
48.
Zurück zum Zitat Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction (2021) Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction (2021)
50.
Zurück zum Zitat Tay, Y., et al.: Scale efficiently: insights from pre-training and fine-tuning transformers. arXiv abs/2109.10686 (2022) Tay, Y., et al.: Scale efficiently: insights from pre-training and fine-tuning transformers. arXiv abs/2109.10686 (2022)
51.
52.
Zurück zum Zitat Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016) Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016)
54.
Zurück zum Zitat Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. Tydi: a multi-lingual benchmark for dense retrieval. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 127–137 (2021) Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. Tydi: a multi-lingual benchmark for dense retrieval. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 127–137 (2021)
Metadaten
Titel
An Experimental Study on Pretraining Transformers from Scratch for IR
verfasst von
Carlos Lassance
Hervé Dejean
Stéphane Clinchant
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-28244-7_32

Neuer Inhalt