Skip to main content

2021 | OriginalPaper | Buchkapitel

On the Instability of Diminishing Return IR Measures

verfasst von : Tetsuya Sakai

Erschienen in: Advances in Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The diminishing return property of ERR (Expected Reciprocal Rank) is highly intuitive and attractive: its user model says, for example, that after the users have found a highly relevant document at rank r, few of them will continue to examine rank \((r+1)\) and beyond. Recently, another IR evaluation measure based on diminishing return called iRBU (intentwise Rank-Biased Utility) was proposed, and it was reported that nDCG (normalised Discounted Cumulative Gain) and iRBU align surprisingly well with users’ SERP (Search Engine Result Page) preferences. The present study conducts offline evaluations of diminishing return measures including ERR and iRBU along with other popular measures such as nDCG, using four test collections and the associated runs from recent TREC tracks and NTCIR tasks. Our results show that the diminishing return measures generally underperform other graded relevance measures in terms of system ranking consistency across two disjoint topic sets as well as discriminative power. The results generalise a previous finding on ERR regarding its limited discriminative power, showing that the diminishing return user model hurts the stability of evaluation measures regardless of the utility function part of the measure. Hence, while we do recommend iRBU along with nDCG for evaluating adhoc IR systems from multiple user-oriented angles, iRBU should be used under the awareness that it can be much less statistically stable than nDCG.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Section 2 discusses an alternative framework for defining a family of measures [24].
 
2
For example, the TREC 2014 Web Track used 20 as the document cutoff [13]; the NTCIR We Want Web tasks haved used 10 [23].
 
3
Topic set sizes can also be theoretically determined based on statistical power, given some pilot data for variance estimation [30].
 
4
For example, Amigó et al. [2] refer to the correlation of system rankings across data sets as robustness.
 
5
Not all adaptive measures are diminishing return measures. Moffat et al. [24] classify Reciprocal Rank (RR) as adaptive, but RR does not accommodate diminishing return: once a relevant document is found, there is no further return.
 
7
The relevance assessments of the four test collections we use in our experiments are expected to be incomplete: see the “rel. per topic” column in Table 2. Hence, using a large cutoff L probably would not give us reliable results.
 
10
The search results for the first 80 topics (i.e., the reused WWW-2 topics) were copied from a run from the NTCIR-14 WWW-2 task [23] and the other 80 topics (i.e., the new WWW-3 test topics) were processed by a new system.
 
Literatur
1.
Zurück zum Zitat Al-Maskari, A., Sanderson, M., Clough, P., Airio, E.: The good and the bad system: does the test collection predict users’ effectiveness. In: Proceedings of ACM SIGIR 2018, pp. 59–66 (2008) Al-Maskari, A., Sanderson, M., Clough, P., Airio, E.: The good and the bad system: does the test collection predict users’ effectiveness. In: Proceedings of ACM SIGIR 2018, pp. 59–66 (2008)
2.
Zurück zum Zitat Amigó, E., Gonzalo, J., Mizzaro, S., de Albornoz, J.C.: An effectiveness metric for ordinal classification: formal properties and experimental results. In: Proceedings of ACL 2020 (2020) Amigó, E., Gonzalo, J., Mizzaro, S., de Albornoz, J.C.: An effectiveness metric for ordinal classification: formal properties and experimental results. In: Proceedings of ACL 2020 (2020)
3.
Zurück zum Zitat Amigó, E., Spina, D., de Albornoz, J.C.: An axiomatic analysis of diversity evaluation metrics: introducting the rank-biased utility metric. In: Proceedings of ACM SIGIR 2018, pp. 625–634 (2018) Amigó, E., Spina, D., de Albornoz, J.C.: An axiomatic analysis of diversity evaluation metrics: introducting the rank-biased utility metric. In: Proceedings of ACM SIGIR 2018, pp. 625–634 (2018)
4.
Zurück zum Zitat Anelli, V.W., Di Noia, T., Di Sciascio, E., Pomo, C., Ragone, A.: On the discriminative power of hyper-parameters in cross-validation and how to choose them. In: Proceedings of ACM RecSys 2019, pp. 447–451 (2019) Anelli, V.W., Di Noia, T., Di Sciascio, E., Pomo, C., Ragone, A.: On the discriminative power of hyper-parameters in cross-validation and how to choose them. In: Proceedings of ACM RecSys 2019, pp. 447–451 (2019)
5.
Zurück zum Zitat Ashkan, A., Metzler, D.: Revisiting online personal search metrics with the user in mind. In: Proceedings ACM SIGIR 2019, pp. 625–634 (2019) Ashkan, A., Metzler, D.: Revisiting online personal search metrics with the user in mind. In: Proceedings ACM SIGIR 2019, pp. 625–634 (2019)
6.
Zurück zum Zitat Azzopardi, L., Thomas, P., Craswell, N.: Measuring the utility of search engine result pages. In: Proceedings of ACM SIGIR 2018, pp. 605–614 (2018) Azzopardi, L., Thomas, P., Craswell, N.: Measuring the utility of search engine result pages. In: Proceedings of ACM SIGIR 2018, pp. 605–614 (2018)
7.
Zurück zum Zitat Buckley, C., Voorhees, E.M.: Retrieval system evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: Experiment and Evaluation in Information Retrieval, pp. 53–75. The MIT Press (2005) Buckley, C., Voorhees, E.M.: Retrieval system evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: Experiment and Evaluation in Information Retrieval, pp. 53–75. The MIT Press (2005)
8.
Zurück zum Zitat Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1), 1–34 (2012) Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1), 1–34 (2012)
9.
Zurück zum Zitat Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of ACM CIKM 2009, pp. 621–630 (2009) Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of ACM CIKM 2009, pp. 621–630 (2009)
10.
Zurück zum Zitat Chuklin, A., Serdyuov, P., de Rijke, M.: Click model-based information retrieval metrics. In: Proceedings of ACM SIGIR 2013, pp. 493–502 (2013) Chuklin, A., Serdyuov, P., de Rijke, M.: Click model-based information retrieval metrics. In: Proceedings of ACM SIGIR 2013, pp. 493–502 (2013)
11.
Zurück zum Zitat Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of ACM WSDM 2011, pp. 75–84 (2011) Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of ACM WSDM 2011, pp. 75–84 (2011)
12.
Zurück zum Zitat Clarke, C.L., Vtyurina, A., Smucker, M.D.: Offline evaluation without gain. In: Proceedings of ICTIR 2020, pp. 185–192 (2020) Clarke, C.L., Vtyurina, A., Smucker, M.D.: Offline evaluation without gain. In: Proceedings of ICTIR 2020, pp. 185–192 (2020)
13.
Zurück zum Zitat Collins-Thompson, K., Macdonald, C., Bennett, P., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of TREC 2014 (2015) Collins-Thompson, K., Macdonald, C., Bennett, P., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of TREC 2014 (2015)
14.
Zurück zum Zitat Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. In: Proceedings of TREC 2019 (2020) Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. In: Proceedings of TREC 2019 (2020)
15.
Zurück zum Zitat Dou, Z., Yang, X., Li, D., Wen, J.R., Sakai, T.: Low-cost, bottom-up measures for evaluating search result diversification. Inform. Retrieval J. 23, 86–113 (2020)CrossRef Dou, Z., Yang, X., Li, D., Wen, J.R., Sakai, T.: Low-cost, bottom-up measures for evaluating search result diversification. Inform. Retrieval J. 23, 86–113 (2020)CrossRef
16.
Zurück zum Zitat Ferro, N., Kim, Y., Sanderson, M.: Using collection shards to study retrieval performance effect sizes. ACM TOIS 37(3), 1–40 (2019) Ferro, N., Kim, Y., Sanderson, M.: Using collection shards to study retrieval performance effect sizes. ACM TOIS 37(3), 1–40 (2019)
17.
Zurück zum Zitat Golbus, P.B., Aslam, J.A., Clarke, C.L.: Increasing evaluation sensitivity to diversity. Inform. Retrieval 16, 530–555 (2013)CrossRef Golbus, P.B., Aslam, J.A., Clarke, C.L.: Increasing evaluation sensitivity to diversity. Inform. Retrieval 16, 530–555 (2013)CrossRef
18.
Zurück zum Zitat Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20(4), 422–446 (2002)CrossRef Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20(4), 422–446 (2002)CrossRef
19.
Zurück zum Zitat Kanoulas, E., Aslam, J.A.: Empirical justification of the gain and discount function for nDCG. In: Proceedings of ACM CIKM 2009, pp. 611–620 (2009) Kanoulas, E., Aslam, J.A.: Empirical justification of the gain and discount function for nDCG. In: Proceedings of ACM CIKM 2009, pp. 611–620 (2009)
20.
Zurück zum Zitat Leelanupab, T., Zuccon, G., Jose, J.M.: A comprehensive analysis of parameter settings for novelty-biased cumulative gain. In: Proceedings of ACM CIKM 2012, pp. 1950–1954 (2012) Leelanupab, T., Zuccon, G., Jose, J.M.: A comprehensive analysis of parameter settings for novelty-biased cumulative gain. In: Proceedings of ACM CIKM 2012, pp. 1950–1954 (2012)
21.
Zurück zum Zitat Lu, X., Moffat, A., Culpepper, J.S.: The effect of pooling and evaluation depth on IR metrics. Inform. Retrieval J. 19(4), 416–445 (2016)CrossRef Lu, X., Moffat, A., Culpepper, J.S.: The effect of pooling and evaluation depth on IR metrics. Inform. Retrieval J. 19(4), 416–445 (2016)CrossRef
22.
Zurück zum Zitat Luo, J., Wing, C., Yang, H., Hearst, M.A.: The water filling model and the cube test: multi-dimensional evaluation for professional search. In: Proceedings of ACM CIKM 2013, pp. 709–714 (2013) Luo, J., Wing, C., Yang, H., Hearst, M.A.: The water filling model and the cube test: multi-dimensional evaluation for professional search. In: Proceedings of ACM CIKM 2013, pp. 709–714 (2013)
24.
Zurück zum Zitat Moffat, A., Bailey, P., Scholer, F., Thomas, P.: Incorporating user expectations and behavior into the measurement of search effectiveness. ACM TOIS 35(3), 1–38 (2017) Moffat, A., Bailey, P., Scholer, F., Thomas, P.: Incorporating user expectations and behavior into the measurement of search effectiveness. ACM TOIS 35(3), 1–38 (2017)
25.
Zurück zum Zitat Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS 27(1), 1–27 (2008) Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS 27(1), 1–27 (2008)
26.
Zurück zum Zitat Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending average precision to graded relevance judgements. In: Proceedings of ACM SIGIR 2010, pp. 603–610 (2010) Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending average precision to graded relevance judgements. In: Proceedings of ACM SIGIR 2010, pp. 603–610 (2010)
28.
Zurück zum Zitat Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR 2006, pp. 525–532 (2006) Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR 2006, pp. 525–532 (2006)
31.
Zurück zum Zitat Sakai, T., Dou, Z.: Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In: Proceedings of ACM SIGIR 2013, pp. 473–482 (2013) Sakai, T., Dou, Z.: Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In: Proceedings of ACM SIGIR 2013, pp. 473–482 (2013)
32.
Zurück zum Zitat Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inform. Retrieval 11(5), 447–470 (2008)CrossRef Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inform. Retrieval 11(5), 447–470 (2008)CrossRef
33.
Zurück zum Zitat Sakai, T., Robertson, S.: Modelling a user population for designing information retrieval metrics. In: Proceedings of EVIA 2008, pp. 30–41 (2008) Sakai, T., Robertson, S.: Modelling a user population for designing information retrieval metrics. In: Proceedings of EVIA 2008, pp. 30–41 (2008)
34.
Zurück zum Zitat Sakai, T., Song, R.: Diversified search evaluation: lessons from the NTCIR-9 INTENT task. Inform. Retrieval 16(4), 504–529 (2013)CrossRef Sakai, T., Song, R.: Diversified search evaluation: lessons from the NTCIR-9 INTENT task. Inform. Retrieval 16(4), 504–529 (2013)CrossRef
35.
Zurück zum Zitat Sakai, T., et al.: Overview of the NTCIR-15 we want web with CENTRE task. In: Proceedings of NTCIR-15, pp. 219–234 (2020) Sakai, T., et al.: Overview of the NTCIR-15 we want web with CENTRE task. In: Proceedings of NTCIR-15, pp. 219–234 (2020)
36.
Zurück zum Zitat Sakai, T., Zeng, Z.: Which diversity evaluation measures are “good”? In: Proceedings of ACM SIGIR 2019, pp. 595–604 (2019) Sakai, T., Zeng, Z.: Which diversity evaluation measures are “good”? In: Proceedings of ACM SIGIR 2019, pp. 595–604 (2019)
37.
Zurück zum Zitat Sakai, T., Zeng, Z.: Good evaluation measures based on document preferences. In: Proceedings of ACM SIGIR 2020, pp. 359–368 (2020) Sakai, T., Zeng, Z.: Good evaluation measures based on document preferences. In: Proceedings of ACM SIGIR 2020, pp. 359–368 (2020)
38.
Zurück zum Zitat Sakai, T., Zeng, Z.: Retrieval evaluation measures that agree with users’ serp preferences: traditional, preference-based, and diversity measures. ACM TOIS 39(2), 1–35 (2020) Sakai, T., Zeng, Z.: Retrieval evaluation measures that agree with users’ serp preferences: traditional, preference-based, and diversity measures. ACM TOIS 39(2), 1–35 (2020)
39.
Zurück zum Zitat Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proceedings of ACM SIGIR 2010, pp. 555–562 (2010) Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proceedings of ACM SIGIR 2010, pp. 555–562 (2010)
40.
Zurück zum Zitat Sanderson, M., Zobel, J.: Information retrieval evaluation: effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, pp. 162–169 (2005) Sanderson, M., Zobel, J.: Information retrieval evaluation: effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, pp. 162–169 (2005)
41.
Zurück zum Zitat Shang, L., et al.: Overview of the NTCIR-13 short text conversation task. In: Proceedings of NTCIR-13, pp. 194–210 (2017) Shang, L., et al.: Overview of the NTCIR-13 short text conversation task. In: Proceedings of NTCIR-13, pp. 194–210 (2017)
42.
Zurück zum Zitat Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In: Proceedings of ACM SIGIR 2012, pp. 95–104 (2012) Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In: Proceedings of ACM SIGIR 2012, pp. 95–104 (2012)
43.
Zurück zum Zitat Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proceedings of ACM SIGIR 2006, pp. 11–18 (2006) Turpin, A., Scholer, F.: User performance versus precision measures for simple search tasks. In: Proceedings of ACM SIGIR 2006, pp. 11–18 (2006)
44.
Zurück zum Zitat Urbano, J.: Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation. Inform. Retrieval J. 19(3), 313–350 (2016)CrossRef Urbano, J.: Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation. Inform. Retrieval J. 19(3), 313–350 (2016)CrossRef
46.
Zurück zum Zitat Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manag. 36, 697–716 (2000)CrossRef Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manag. 36, 697–716 (2000)CrossRef
47.
Zurück zum Zitat Voorhees, E.M.: Topic set size redux. In: Proceedings of ACM SIGIR 2009, pp. 806–807 (2009) Voorhees, E.M.: Topic set size redux. In: Proceedings of ACM SIGIR 2009, pp. 806–807 (2009)
48.
Zurück zum Zitat Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, pp. 316–323 (2002) Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, pp. 316–323 (2002)
49.
Zurück zum Zitat Wang, X., Wen, J.R., Dou, Z., Sakai, T., Zhang, R.: Search result diversity evaluation based on intent hierarchies. IEEE Trans. Knowl. Data Eng. 30(1), 156–169 (2018)CrossRef Wang, X., Wen, J.R., Dou, Z., Sakai, T., Zhang, R.: Search result diversity evaluation based on intent hierarchies. IEEE Trans. Knowl. Data Eng. 30(1), 156–169 (2018)CrossRef
50.
Zurück zum Zitat Zhang, F., Liu, Y., Li, X., Zhang, M., Xu, Y., Ma, S.: Evaluating web search with a bejeweled player model. In: Proceedings of ACM SIGIR 2017, pp. 425–434 (2017) Zhang, F., Liu, Y., Li, X., Zhang, M., Xu, Y., Ma, S.: Evaluating web search with a bejeweled player model. In: Proceedings of ACM SIGIR 2017, pp. 425–434 (2017)
51.
Zurück zum Zitat Zhang, F., et al.: Models versus satisfaction: towards a better understanding of evaluation metrics. In: Proceedings of ACM SIGIR 2020, pp. 379–388 (2020) Zhang, F., et al.: Models versus satisfaction: towards a better understanding of evaluation metrics. In: Proceedings of ACM SIGIR 2020, pp. 379–388 (2020)
52.
Zurück zum Zitat Zhou, K., Lalmas, M., Sakai, T., Cummins, R., Jose, J.M.: On the reliability and intuitiveness of aggregated search metrics. In: Proceedings of ACM CIKM 2013, pp. 689–698 (2013) Zhou, K., Lalmas, M., Sakai, T., Cummins, R., Jose, J.M.: On the reliability and intuitiveness of aggregated search metrics. In: Proceedings of ACM CIKM 2013, pp. 689–698 (2013)
53.
Zurück zum Zitat Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998) Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998)
Metadaten
Titel
On the Instability of Diminishing Return IR Measures
verfasst von
Tetsuya Sakai
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-72113-8_38

Neuer Inhalt