Skip to main content
Erschienen in: Discover Computing 3/2016

01.06.2016 | Information Retrieval Evaluation Using Test Collections

Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation

verfasst von: Julián Urbano

Erschienen in: Discover Computing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The number of topics that a test collection contains has a direct impact on how well the evaluation results reflect the true performance of systems. However, large collections can be prohibitively expensive, so researchers are bound to balance reliability and cost. This issue arises when researchers have an existing collection and they would like to know how much they can trust their results, and also when they are building a new collection and they would like to know how many topics it should contain before they can trust the results. Several measures have been proposed in the literature to quantify the accuracy of a collection to estimate the true scores, as well as different ways to estimate the expected accuracy of hypothetical collections with a certain number of topics. We can find ad-hoc measures such as Kendall tau correlation and swap rates, and statistical measures such as statistical power and indexes from generalizability theory. Each measure focuses on different aspects of evaluation, has a different theoretical basis, and makes a number of assumptions that are not met in practice, such as normality of distributions, homoscedasticity, uncorrelated effects and random sampling. However, how good these estimates are in practice remains a largely open question. In this paper we first compare measures and estimators of test collection accuracy and propose unbiased statistical estimators of the Kendall tau and tau AP correlation coefficients. Second, we detail a method for stochastic simulation of evaluation results under different statistical assumptions, which can be used for a variety of evaluation research where we need to know the true scores of systems. Third, through large-scale simulation from TREC data, we analyze the bias of a range of estimators of test collection accuracy. Fourth, we analyze the robustness to statistical assumptions of these estimators, in order to understand what aspects of an evaluation are affected by what assumptions and guide in the development of new collections and new measures. All the results in this paper are fully reproducible with data and code available online.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Actually, they assume that the residuals are normal, not the score distributions.
 
2
Some models assume independence, which is an even stronger assumption. The statistical measures we review assume uncorrelated effects, but not independence.
 
3
We loosely use the notation \(\varvec{E}_rf(X)\) to refer to the expected value of f(X) over the population, restricted by r, from which X is sampled.
 
4
We use \(|\mathbf {X}|\) to denote the number of topics in \(\mathbf {X}\).
 
5
In both papers, Sakai uses total variance rather than error variance in the denominator of \(F_1\), so statistical power is even more underestimated and there is virtually no difference between one- and two-way ANOVA. Sakai (2015) reports the results with error variance.
 
Literatur
Zurück zum Zitat Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. California: Wadsworth. Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. California: Wadsworth.
Zurück zum Zitat Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter? In International ACM SIGIR conference on research and development in information retrieval, pp. 667–674. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter? In International ACM SIGIR conference on research and development in information retrieval, pp. 667–674.
Zurück zum Zitat Bodoff, D., & Li, P. (2007). Test theory for assessing IR test collections. In International ACM SIGIR conference on research and development in information retrieval, pp. 367–374. Bodoff, D., & Li, P. (2007). Test theory for assessing IR test collections. In International ACM SIGIR conference on research and development in information retrieval, pp. 367–374.
Zurück zum Zitat Boytsov, L., Belova, A., & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 403–412. Boytsov, L., Belova, A., & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 403–412.
Zurück zum Zitat Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14(3), 277–289.CrossRef Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14(3), 277–289.CrossRef
Zurück zum Zitat Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In International ACM SIGIR conference on research and development in information retrieval, pp. 33–34. Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In International ACM SIGIR conference on research and development in information retrieval, pp. 33–34.
Zurück zum Zitat Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Journal of Information Retrieval, 10(6), 491–508.CrossRef Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Journal of Information Retrieval, 10(6), 491–508.CrossRef
Zurück zum Zitat Carterette, B. (2009). On rank correlation and the distance between rankings. In International ACM SIGIR conference on research and development in information retrieval, pp. 436–443. Carterette, B. (2009). On rank correlation and the distance between rankings. In International ACM SIGIR conference on research and development in information retrieval, pp. 436–443.
Zurück zum Zitat Carterette, B. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems, 30(1), 4. doi:10.1145/2094072.2094076. Carterette, B. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems, 30(1), 4. doi:10.​1145/​2094072.​2094076.
Zurück zum Zitat Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 539–546. Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 539–546.
Zurück zum Zitat Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I had a million queries. In European conference on information retrieval, pp. 288–300. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I had a million queries. In European conference on information retrieval, pp. 288–300.
Zurück zum Zitat Carterette, B., Kanoulas, E., & Yilmaz, E. (2011). Simulating simple user behavior for system effectiveness evaluation. In ACM international conference on information and knowledge management, pp. 611–620. Carterette, B., Kanoulas, E., & Yilmaz, E. (2011). Simulating simple user behavior for system effectiveness evaluation. In ACM international conference on information and knowledge management, pp. 611–620.
Zurück zum Zitat Cohen, J. (1988). Statistical power analysis for the behavioral sciences. New Jersey: Lawrence Erlbaum Associates. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. New Jersey: Lawrence Erlbaum Associates.
Zurück zum Zitat Cormack, G. V., & Lynam, T. R. (2006). Statistical precision of information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 533–540. Cormack, G. V., & Lynam, T. R. (2006). Statistical precision of information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 533–540.
Zurück zum Zitat Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. The Annals of Mathematical Statistics, 27(4), 907–949.MathSciNetCrossRefMATH Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. The Annals of Mathematical Statistics, 27(4), 907–949.MathSciNetCrossRefMATH
Zurück zum Zitat Cramér, H. (1928). On the composition of elementary errors II. Scandinavian Actuarial Journal, 11(1), 141–180.CrossRefMATH Cramér, H. (1928). On the composition of elementary errors II. Scandinavian Actuarial Journal, 11(1), 141–180.CrossRefMATH
Zurück zum Zitat Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. London: Wiley. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. London: Wiley.
Zurück zum Zitat Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 329–338. Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 329–338.
Zurück zum Zitat Joe, H. (2014). Dependence modeling with copulas. Boca Raton: CRC Press.MATH Joe, H. (2014). Dependence modeling with copulas. Boca Raton: CRC Press.MATH
Zurück zum Zitat Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Information Processing and Management, 41(5), 1019–1033.CrossRef Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Information Processing and Management, 41(5), 1019–1033.CrossRef
Zurück zum Zitat Lin, W. H., & Hauptmann, A. (2005). Revisiting the effect of topic set size on retrieval error. In International ACM SIGIR conference on research and development in information retrieval, pp. 637–638. Lin, W. H., & Hauptmann, A. (2005). Revisiting the effect of topic set size on retrieval error. In International ACM SIGIR conference on research and development in information retrieval, pp. 637–638.
Zurück zum Zitat Melucci, M. (2007). On rank correlation in information retrieval evaluation. ACM SIGIR Forum, 41(1), 18–33.CrossRef Melucci, M. (2007). On rank correlation in information retrieval evaluation. ACM SIGIR Forum, 41(1), 18–33.CrossRef
Zurück zum Zitat Robertson, S., & Kanoulas, E. (2012). On per-topic variance in IR evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 891–900. Robertson, S., & Kanoulas, E. (2012). On per-topic variance in IR evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 891–900.
Zurück zum Zitat Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In International ACM SIGIR conference on research and development in information retrieval, pp. 525–532. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In International ACM SIGIR conference on research and development in information retrieval, pp. 525–532.
Zurück zum Zitat Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management, 43(2), 531–548.CrossRef Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management, 43(2), 531–548.CrossRef
Zurück zum Zitat Sakai, T. (2014a). Designing test collections for comparing many systems. In ACM international conference on information and knowledge management, pp. 61–70. Sakai, T. (2014a). Designing test collections for comparing many systems. In ACM international conference on information and knowledge management, pp. 61–70.
Zurück zum Zitat Sakai, T. (2014b). Topic set size design with variance estimates from two-way ANOVA. In International workshop on evaluating information access, pp. 1–8. Sakai, T. (2014b). Topic set size design with variance estimates from two-way ANOVA. In International workshop on evaluating information access, pp. 1–8.
Zurück zum Zitat Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Journal of Information Retrieval, 11(5), 447–470.CrossRef Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Journal of Information Retrieval, 11(5), 447–470.CrossRef
Zurück zum Zitat Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefMATH Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefMATH
Zurück zum Zitat Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 162–169. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 162–169.
Zurück zum Zitat Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in effectiveness across sub-collections. In ACM international conference on information and knowledge management, pp. 1965–1969. Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in effectiveness across sub-collections. In ACM international conference on information and knowledge management, pp. 1965–1969.
Zurück zum Zitat Searle, S. R., Casella, G., & McCulloch, C. E. (2006). Variance components. London: Wiley.MATH Searle, S. R., Casella, G., & McCulloch, C. E. (2006). Variance components. London: Wiley.MATH
Zurück zum Zitat Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In ACM international conference on information and knowledge management, pp. 623–632. Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In ACM international conference on information and knowledge management, pp. 623–632.
Zurück zum Zitat Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In International ACM SIGIR conference on research and development in information retrieval, pp. 630–631. Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In International ACM SIGIR conference on research and development in information retrieval, pp. 630–631.
Zurück zum Zitat Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490.CrossRef Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490.CrossRef
Zurück zum Zitat Urbano, J., & Marrero, M. (2015). How do gain and discount functions affect the correlation between DCG and user satisfaction? In European conference on information retrieval, pp. 197–202. Urbano, J., & Marrero, M. (2015). How do gain and discount functions affect the correlation between DCG and user satisfaction? In European conference on information retrieval, pp. 197–202.
Zurück zum Zitat Urbano, J., Marrero, M., & Martín, D. (2013a). A comparison of the optimality of statistical significance tests for information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 925–928. Urbano, J., Marrero, M., & Martín, D. (2013a). A comparison of the optimality of statistical significance tests for information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 925–928.
Zurück zum Zitat Urbano, J., Marrero, M., & Martín, D. (2013b). On the measurement of test collection reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 393–402. Urbano, J., Marrero, M., & Martín, D. (2013b). On the measurement of test collection reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 393–402.
Zurück zum Zitat van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworths. van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworths.
Zurück zum Zitat von Mises, R. (1931). Wahrscheinlichkeitsrechnung und ihre Anwendungen in der Statistik und theoretischen Physik. von Mises, R. (1931). Wahrscheinlichkeitsrechnung und ihre Anwendungen in der Statistik und theoretischen Physik.
Zurück zum Zitat Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In International ACM SIGIR conference on research and development in information retrieval, pp. 315–323. Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In International ACM SIGIR conference on research and development in information retrieval, pp. 315–323.
Zurück zum Zitat Voorhees, E. M. (2001). Evaluation by highly relevant documents. In International ACM SIGIR conference on research and development in information retrieval, pp. 74–82. Voorhees, E. M. (2001). Evaluation by highly relevant documents. In International ACM SIGIR conference on research and development in information retrieval, pp. 74–82.
Zurück zum Zitat Voorhees, E. M. (2009). Topic set size redux. In International ACM SIGIR conference on research and development in information retrieval, pp. 806–807. Voorhees, E. M. (2009). Topic set size redux. In International ACM SIGIR conference on research and development in information retrieval, pp. 806–807.
Zurück zum Zitat Voorhees, E. M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In International ACM SIGIR conference on research and development in information retrieval, pp. 316–323. Voorhees, E. M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In International ACM SIGIR conference on research and development in information retrieval, pp. 316–323.
Zurück zum Zitat Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. Handbook of Statistics, 26, 81–124.CrossRefMATH Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. Handbook of Statistics, 26, 81–124.CrossRefMATH
Zurück zum Zitat Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In ACM international conference on information and knowledge management, pp. 571–580. Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In ACM international conference on information and knowledge management, pp. 571–580.
Zurück zum Zitat Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In International ACM SIGIR conference on research and development in information retrieval, pp. 587–594. Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In International ACM SIGIR conference on research and development in information retrieval, pp. 587–594.
Zurück zum Zitat Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval, pp. 307–314. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval, pp. 307–314.
Metadaten
Titel
Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation
verfasst von
Julián Urbano
Publikationsdatum
01.06.2016
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2016
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-015-9274-y

Weitere Artikel der Ausgabe 3/2016

Discover Computing 3/2016 Zur Ausgabe

Information Retrieval Evaluation Using Test Collections

Topic set size design

Premium Partner