nach oben

Discover Computing

Erschienen in:

06.03.2020

A passage-based approach to learning to rank documents

verfasst von: Eilon Sheetrit, Anna Shtok, Oren Kurland

Erschienen in: Discover Computing | Ausgabe 2/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

According to common relevance-judgments regimes, such as TREC’s, a document can be deemed relevant to a query even if it contains a very short passage of text with pertinent information. This fact has motivated work on passage-based document retrieval: document ranking methods that induce information from the document’s passages. However, the main source of passage-based information utilized was passage-query similarities. In this paper, we address the challenge of utilizing richer sources of passage-based information to improve document retrieval effectiveness. Specifically, we devise a suite of learning-to-rank-based document retrieval methods that utilize an effective ranking of passages produced in response to the query. Some of the methods quantify the ranking of the passages of a document. Others utilize the feature-based representation of the document’s passages. Empirical evaluation attests to the clear merits of our methods with respect to highly effective baselines. Our best performing method is based on learning a document ranking function using document-query features and passage-query features of the document’s passage most highly ranked; the passage-query features are those used to learn a highly effective passage ranker.

Vorheriger Artikel Deep cross-platform product matching in e-commerce

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Note that these passages are also the passages of documents in \(\mathcal {D}_{init}\) since \(\mathcal {D}_{LTR}\) is a re-rank of \(\mathcal {D}_{init}\).

Experiments—actual numbers are omitted as they convey no additional insight—showed that simply using the passage-based document ranking without the additional fusion often yields performance (substantially) inferior to that of FPD.

www.lemurproject.org.

Unless otherwise stated, we used the jforests implementation of LambdaMART: https://code.google.com/p/jforests/. In Sect. 5.1.7 we also present the performance results of our best performing method when using the LightGBM implementation of LambdaMART (https://github.com/microsoft/LightGBM).

https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html.

www.research.microsoft.com/en-us/projects/mslr.

The only exception was that the passage LTR method applied on TREC corpora was learned using all queries in the INEX dataset.

Not smoothing these language models was shown to yield highly effective RM3 performance (Raiber and Kurland 2013).

The finding that init-LMart underperforms init-SVM can be attributed to the fact that LMart is a non-linear ranker while SVM is, and the number of queries used for training is not very large.

We note that the use of the lowest ranked passage did not result in substantial performance decrease due to the length of passages used here: 300; that is, such passages can incorporate a descent amount of information from the entire document, especially in cases of relatively short documents.

To avoid having the same features used for the two passages, the following features were removed from the feature vector of the second ranked passage: DocQuerySim, MaxPDSim, AvgPDSim, StdPDSim and QueryLength.

We do not present the comparison for the JPDm approach as it is independent of the passage ranking.

JPDs-SVM uses 24 features and JPDs-LMart uses 25 features—the additional feature is the query length which is not useful for a linear ranker.

In this analysis we set \(\nu\), the free parameter of SMPD, to a value which is effective across the train folds.

https://sourceforge.net/p/lemur/wiki/RankLib/

https://github.com/microsoft/LightGBM

www.research.microsoft.com/en-us/projects/mslr.

Abdul-Jaleel, N., Allan, J., Croft, W. B., Diaz, F., Larkey, L., Li, X., et al. (2004). UMASS at TREC 2004: Novelty and hard. In Proceedings of TREC.

Arvola, P., Geva, S., Kamps, J., Schenkel, R., Trotman, A., & Vainio, J. (2011). Overview of the INEX 2010 ad hoc track. In Comparative evaluation of focused retrieval (pp. 1–32).

Bendersky, M., & Kurland, O. (2008). Re-ranking search results using document-passage graphs. In Proceedings of SIGIR (pp. 853–854).

Bendersky, M., & Kurland, O. (2010). Utilizing passage-based language models for ad hoc document retrieval. Information Retrieval, 13(2), 157–187.CrossRef

Bendersky, M., Croft, W. B., & Diao, Y. (2011). Quality-biased ranking of web documents. In Proceedings of WSDM (pp. 95–104).

Buffoni, D., Usunier, N., & Gallinari, P. (2010). Lip6 at INEX: OWPC for ad hoc track. In Focused retrieval and evaluation (pp. 59–69).

Burges, C. J. (2010). From ranknet to lambdarank to lambdamart: An overview. Microsoft Research: Technical report.

Callan, J. P. (1994). Passage-level evidence in document retrieval. In Proceedings of SIGIR (pp. 302–310).

Carmel, D., Shtok, A., & Kurland, O. (2013). Position-based contextualization for passage retrieval. In Proceedings of CIKM (pp. 1241–1244).

Chen, R., Spina, D., Croft, W. B., Sanderson, M., & Scholer, F. (2015). Harnessing semantics for answer sentence retrieval. In Proceedings of ESAIR (pp. 21–27).

Chen, R. C., Yulianti, E., Sanderson, M., & Cro, W. B. (2017). On the benefit of incorporating external features in a neural architecture for answer sentence selection. In Proceedings of SIGIR (pp. 1017–1020).

Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of SIGIR (pp. 758–759).

Cormack, G. V., Smucker, M. D., & Clarke, C. L. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14(5), 441–465.CrossRef

Dehghani, M., Zamani, H., Severyn, A., Kamps, J., & Croft, W. B. (2017). Neural ranking models with weak supervision. In Proceedings of SIGIR (pp. 65–74).

Denoyer, L., Zaragoza, H., & Gallinari, P. (2001). HMM-based passage models for document classification and ranking. In Proceedings of ECIR.

Fan, Y., Guo, J., Lan, Y., Xu, J., Zhai, C., & Cheng, X. (2018). Modeling diverse relevance patterns in ad-hoc retrieval. In Proceedings of SIGIR (pp. 375–384).

Fernández, R. T., & Losada, D. E. (2012). Effective sentence retrieval based on query-independent evidence. Information Processing and Management, 48(6), 1203–1229.CrossRef

Fernández, R. T., Losada, D. E., & Azzopardi, L. A. (2011). Extending the language modeling framework for sentence retrieval to include local context. Information Retrieval, 14(4), 355–389.CrossRef

Ferragina, P., & Scaiella, U. (2012). Fast and accurate annotation of short texts with wikipedia pages. IEEE Software, 29(1), 70–75.CrossRef

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.MathSciNetCrossRef

Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. Proceedings of IJCAI, 7, 1606–1611.

Geva, S., Kamps, J., Lethonen, M., Schenkel, R., Thom, J. A., & Trotman, A. (2010). Overview of the inex 2009 ad hoc track. In Focused retrieval and evaluation (pp. 4–25).

Hearst, M. A., & Plaunt, C. (1993). Subtopic structuring for full-length document access. In Proceedings of SIGIR (pp. 59–68).

Jiang, J., & Zhai, C. (2004). Uiuc in hard 2004—passage retrieval using HMMS. In: TREC.

Joachims, T. (2006). Training linear SVMs in linear time. In Proceedings of KDD (pp. 217–226).

Kaszkiel, M., & Zobel, J. (1997). Passage retrieval revisited. In Proceedings of SIGIR (pp. 178–185).

Kaszkiel, M., & Zobel, J. (2001). Effective ranking with arbitrary passages. Journal of the American Society for Information Science and Technology, 52(4), 344–364.CrossRef

Keikha, M., Park, J. H., & Croft, W. B. (2014a). Evaluating answer passages using summarization measures. In Proceedings of SIGIR (pp. 963–966).

Keikha, M., Park, J. H., Croft, W. B., & Sanderson, M. (2014b). Retrieving passages and finding answers. In Proceedings of ADCS (p. 81).

Krikon, E., Kurland, O., & Bendersky, M. (2010). Utilizing inter-passage and inter-document similarities for reranking search results. ACM Transactions on Information Systems, 29(1), 3:1–3:28.

Kurland, O., & Domshlak, C. (2008). A rank-aggregation approach to searching for optimal query-specific clusters. In Proceedings of SIGIR (pp. 547–554).

Kurland, O., & Krikon, E. (2011). The opposite of smoothing: A language model approach to ranking query-specific document clusters. Journal of Artificial Intelligence Research, 41, 367–395.MathSciNetCrossRef

Lang, H., Metzler, D., Wang, B., & Li, J. (2010). Improved latent concept expansion using hierarchical markov random fields. In Proceedings of CIKM (pp. 249–258).

Lin, J. (2018). The neural hype and comparisons against weak baselines. SIGIR Forum, 52(2), 40–51.

Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRef

Liu, X., & Croft, WB. (2002). Passage retrieval based on language models. In Proceedings of CIKM (pp. 375–382).

Liu, X., & Croft, WB. (2004). Cluster-based retrieval using language models. In Proceedings of SIGIR (pp. 186–193).

Lv, Y., & Zhai, C. (2009). Positional language models for information retrieval. In Proceedings of SIGIR (pp. 299–306).

Lv, Y., & Zhai, C. (2010). Positional relevance model for pseudo-relevance feedback. In Proceedings of SIGIR (pp. 579–586).

Macdonald, C., Santos, R. L., & Ounis, I. (2012). On the usefulness of query features for learning to rank. In Proceedings of CIKM (pp. 2559–2562).

Metzler, D., & Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of SIGIR (pp. 472–479).

Metzler, D., & Croft, W. B. (2007a). Latent concept expansion using markov random fields. In Proceedings of SIGIR (pp. 311–318).

Metzler, D., & Croft, W. B. (2007b). Linear feature-based models for information retrieval. Information Retrieval, 10(3), 257–274.CrossRef

Metzler, D., & Kanungo, T. (2008). Machine learned sentence selection strategies for query-biased summarization. In Proceedings of SIGIR (pp. 40–47).

Miao, J., Huang, J. X., & Ye, Z. (2012). Proximity-based rocchio’s model for pseudo relevance. In Proceedings of SIGIR (pp. 535–544).

Mittendorf, E., & Schäuble, P. (1994). Document and passage retrieval based on hidden markov models. In Proceedings of SIGIR (pp. 318–327). New York: Springer.

Murdock, V., & Croft, W. B. (2005). A translation model for sentence retrieval. In Proceedings of HLT/EMNLP (pp. 684–691). Association for Computational Linguistics.

Murdock, V. G. (2006). Aspects of sentence retrieval. PhD thesis, University of Massachusetts Amherst.

Na, S., Kang, I., Lee, Y., & Lee, J. (2008). Completely-arbitrary passage retrieval in language modeling approach. In Proceedings of AIRS (pp. 22–33).

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., et al. (2016). MS MARCO: A human generated machine reading comprehension dataset. CoRR. arXiv:1611.09268.

Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of WWW (pp. 83–92).

Raiber, F., & Kurland, O. (2013). Ranking document clusters using markov random fields. In Proceedings of SIGIR (pp. 333–342).

Raifer, N., Raiber, F., Tennenholtz, M., & Kurland, O. (2017). Information retrieval meets game theory: The ranking competition between documents? authors. In Proceedings of SIGIR (pp. 465–474).

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at trec-3. In Proceedings of TREC (Vol. 109, p. 109).

Salton, G., Allan, J., & Buckley, C. (1993). Approaches to passage retrieval in full text information systems. In Proceedings of SIGIR (pp. 49–58).

Sheetrit, E., & Kurland, O. (2019). Cluster-based focused retrieval. In Proceedings of CIKM (pp. 2305–2308).

Soboroff, I. (2004). Overview of the TREC 2004 novelty track. In Proceedings of TREC.

Soboroff, I., & Harman, D. (2003). Overview of the TREC 2003 novelty track. In Proceedings of TREC (pp. 38–53).

Tao, T., & Zhai, C. (2007). An exploration of proximity measures in information retrieval. In Proceedings of SIGIR (pp. 295–302).

Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiments and evaluation in information retrieval. Cambridge: MIT Press.

Wan, X., Yang, J., & Xiao, J. (2008). Towards a unified approach to document similarity search using manifold-ranking of blocks. Information Processing and Management, 44(3), 1032–1048.CrossRef

Wang, M., & Si, L. (2008). Discriminative probabilistic models for passage based retrieval. In Proceedings of SIGIR (pp. 419–426).

Wilkinson, R. (1994). Effective retrieval of structured documents. In Proceedings of SIGIR (pp. 311–317).

Yang, L., Ai, Q., Spina, D., Chen, R. C., Pang, L., Croft, W. B., Guo, J., & Scholer, F. (2016). Beyond factoid QA: Effective methods for non-factoid answer sentence retrieval. In Proceedings of ECIR (pp. 115–128). Berlin: Springer.

Yulianti, E., Chen, R., Scholer, F., & Sanderson, M. (2016). Using semantic and context features for answer summary extraction. In Proceedings of ADCS (pp. 81–84).

Yulianti, E., Chen, R., Scholer, F., Croft, W. B., & Sanderson, M. (2018). Ranking documents by answer-passage quality. In Proceedings of SIGIR (pp. 335–344).

Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR (pp. 334–342).

Zhao, J., & Yun, Y. (2009). A proximity language model for information retrieval. In Proceedings of SIGIR (pp. 291–298).

Titel: A passage-based approach to learning to rank documents
verfasst von: Eilon Sheetrit
Anna Shtok
Oren Kurland
Publikationsdatum: 06.03.2020
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 2/2020
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-020-09369-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"