nach oben

Discover Computing

Erschienen in:

01.08.2016

The effect of pooling and evaluation depth on IR metrics

verfasst von: Xiaolu Lu, Alistair Moffat, J. Shane Culpepper

Erschienen in: Discover Computing | Ausgabe 4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Batch IR evaluations are usually performed in a framework that consists of a document collection, a set of queries, a set of relevance judgments, and one or more effectiveness metrics. A large number of evaluation metrics have been proposed, with two primary families having emerged: recall-based metrics, and utility-based metrics. In both families, the pragmatics of forming judgments mean that it is usual to evaluate the metric to some chosen depth such as \(k=20\) or \(k=100\), without necessarily fully considering the ramifications associated with that choice. Our aim is this paper is to explore the relative risks arising with fixed-depth evaluation in the two families, and document the complex interplay between metric evaluation depth and judgment pooling depth. Using a range of TREC resources including NewsWire data and the ClueWeb collection, we: (1) examine the implications of finite pooling on the subsequent usefulness of different test collections, including specifying options for truncated evaluation; and (2) determine the extent to which various metrics correlate with themselves when computed to different evaluation depths using those judgments. We demonstrate that the judgment pools constructed for the ClueWeb collections lack resilience, and are suited primarily to the application of top-heavy utility-based metrics rather than recall-based metrics; and that on the majority of the established test collections, and across a range of evaluation depths, recall-based metrics tend to be more volatile in the system rankings they generate than are utility-based metrics. That is, experimentation using utility-based metrics is more robust to choices such as the evaluation depth employed than is experimentation using recall-based metrics. This distinction should be noted by researchers as they plan and execute system-versus-system retrieval experiments.

Vorheriger Artikel Mining unstructured content for recommender systems: an ensemble approach

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://trec.nist.gov/trec_eval/.

https://github.com/bevankoopman/inst_eval.

Aslam, J. A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 541–548).

Aslam, J. A., Yilmaz, E., & Pavlu, V. (2005). The maximum entropy method for analyzing retrieval measures. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 27–34).

Bailey, P., Moffat, A., Scholer, F., & Thomas, P. (2015). User variability and IR system evaluation. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 625–634).

Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 33–40).

Buckley, C., & Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 25–32).

Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Information Retrieval Journal, 10, 491–508.CrossRef

Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 63–70).

Carterette, B., Kanoulas, E., & Yilmaz, E. (2010). Low cost evaluation in information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (p. 903).

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the Conference on Information and Knowledge Management (pp. 621–630). ACM.

Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. V. (2010). Overview of the TREC 2010 Web track. In Proceedings of TREC.

Demartini, G., & Mizzaro, S. (2006). A classification of IR effectiveness metrics. In Proceedings of the European Conference on IR Research (pp. 488–491). Berlin, Heidelberg: Springer.

Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446.CrossRef

Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for NDCG. In Proceedings of Conference on Information and Knowledge Management (pp. 611–620). ACM.

Moffat, A. (2013). Seven numeric properties of effectiveness metrics. In Proceedings of Asian Information Retrieval Societies Conference (pp. 1–12).

Moffat, A., Bailey, P., Scholer, F., & Thomas, P. (2015). INST: An adaptive metric for information retrieval evaluation. In Proceedings of the Australasian Document Computing Symposium (pp. 5:1–5:4).

Moffat, A., Thomas, P., & Scholer, F. (2013). Users versus models: What observation tells us about effectiveness metrics. In Proceedings of Conference on Information and Knowledge Management (pp. 659–668).

Moffat, A., Webber, W., & Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 375–382).

Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2.CrossRef

Ravana, S. D., & Moffat, A. (2010). Score estimation, incomplete judgments, and significance testing in IR evaluation. In Proceedings of the Asian Information Retrieval Societies Conference (pp. 97–109).

Roberston, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the ACM-SIGIR Interenational Conference on Research and Development in Information Retrieval (pp. 603–610).

Sakai, T. (2004). New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NII Testbeds and Communities for Information Access and Research.

Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 525–532). New York, NY: ACM Press.

Sakai, T. (2007). Alternatives to BPref. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 71–78).

Sakai, T. (2014). Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases: PROMISE Winter School 2013, Bressanone, Italy, February 4–8, 2013. Revised Tutorial Lectures (pp. 116–163). Berlin, Heidelberg: Springer.

Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval Journal, 11(5), 447–470.CrossRef

Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefMATH

Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 74–82). ACM.

Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (pp. 355–370). Berlin, Heidelberg: Springer.

Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge: The MIT Press.

Webber, W., Moffat, A., & Zobel, J. (2010). The effect of pooling and evaluation depth on metric stability. In Proceedings of the Workshop Evaluation Information Access (pp. 7–15).

Webber, W., Moffat, A., & Zobel, J. (2010). A similarity measure for indefinite rankings. ACM Transactions on Information Systems, 28(4), 20.CrossRef

Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the Conference on Information and Knowledge Management (pp. 102–111).

Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 587–594). ACM.

Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 603–610).

Yilmaz, E., & Robertson, S. (2010). On the choice of effectiveness measures for learning to rank. Information Retrieval Journal, 13(3), 271–290.CrossRef

Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 307–314).

Titel: The effect of pooling and evaluation depth on IR metrics
verfasst von: Xiaolu Lu
Alistair Moffat
J. Shane Culpepper
Publikationsdatum: 01.08.2016
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 4/2016
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-016-9282-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"