nach oben

Discover Computing

Erschienen in:

01.06.2016 | Information Retrieval Evaluation Using Test Collections

Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation

verfasst von: Julián Urbano

Erschienen in: Discover Computing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The number of topics that a test collection contains has a direct impact on how well the evaluation results reflect the true performance of systems. However, large collections can be prohibitively expensive, so researchers are bound to balance reliability and cost. This issue arises when researchers have an existing collection and they would like to know how much they can trust their results, and also when they are building a new collection and they would like to know how many topics it should contain before they can trust the results. Several measures have been proposed in the literature to quantify the accuracy of a collection to estimate the true scores, as well as different ways to estimate the expected accuracy of hypothetical collections with a certain number of topics. We can find ad-hoc measures such as Kendall tau correlation and swap rates, and statistical measures such as statistical power and indexes from generalizability theory. Each measure focuses on different aspects of evaluation, has a different theoretical basis, and makes a number of assumptions that are not met in practice, such as normality of distributions, homoscedasticity, uncorrelated effects and random sampling. However, how good these estimates are in practice remains a largely open question. In this paper we first compare measures and estimators of test collection accuracy and propose unbiased statistical estimators of the Kendall tau and tau AP correlation coefficients. Second, we detail a method for stochastic simulation of evaluation results under different statistical assumptions, which can be used for a variety of evaluation research where we need to know the true scores of systems. Third, through large-scale simulation from TREC data, we analyze the bias of a range of estimators of test collection accuracy. Fourth, we analyze the robustness to statistical assumptions of these estimators, in order to understand what aspects of an evaluation are affected by what assumptions and guide in the development of new collections and new measures. All the results in this paper are fully reproducible with data and code available online.

Vorheriger Artikel Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Actually, they assume that the residuals are normal, not the score distributions.

Some models assume independence, which is an even stronger assumption. The statistical measures we review assume uncorrelated effects, but not independence.

We loosely use the notation \(\varvec{E}_rf(X)\) to refer to the expected value of f(X) over the population, restricted by r, from which X is sampled.

We use \(|\mathbf {X}|\) to denote the number of topics in \(\mathbf {X}\).

In both papers, Sakai uses total variance rather than error variance in the denominator of \(F_1\), so statistical power is even more underestimated and there is virtually no difference between one- and two-way ANOVA. Sakai (2015) reports the results with error variance.

Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. California: Wadsworth.

Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter? In International ACM SIGIR conference on research and development in information retrieval, pp. 667–674.

Bodoff, D., & Li, P. (2007). Test theory for assessing IR test collections. In International ACM SIGIR conference on research and development in information retrieval, pp. 367–374.

Boytsov, L., Belova, A., & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 403–412.

Brennan, R. L. (2001). Generalizability theory. Berlin: Springer.CrossRefMATH

Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14(3), 277–289.CrossRef

Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In International ACM SIGIR conference on research and development in information retrieval, pp. 33–34.

Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Journal of Information Retrieval, 10(6), 491–508.CrossRef

Carterette, B. (2009). On rank correlation and the distance between rankings. In International ACM SIGIR conference on research and development in information retrieval, pp. 436–443.

Carterette, B. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems, 30(1), 4. doi:10.1145/2094072.2094076.

Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 539–546.

Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I had a million queries. In European conference on information retrieval, pp. 288–300.

Carterette, B., Kanoulas, E., & Yilmaz, E. (2011). Simulating simple user behavior for system effectiveness evaluation. In ACM international conference on information and knowledge management, pp. 611–620.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. New Jersey: Lawrence Erlbaum Associates.

Cormack, G. V., & Lynam, T. R. (2006). Statistical precision of information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 533–540.

Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. The Annals of Mathematical Statistics, 27(4), 907–949.MathSciNetCrossRefMATH

Cramér, H. (1928). On the composition of elementary errors II. Scandinavian Actuarial Journal, 11(1), 141–180.CrossRefMATH

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. London: Wiley.

Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 329–338.

Joe, H. (2014). Dependence modeling with copulas. Boca Raton: CRC Press.MATH

Jones, K. S. (1974). Automatic indexing. Journal of Documentation, 30(4), 393–432. doi:10.1108/eb026524.CrossRef

Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Information Processing and Management, 41(5), 1019–1033.CrossRef

Lin, W. H., & Hauptmann, A. (2005). Revisiting the effect of topic set size on retrieval error. In International ACM SIGIR conference on research and development in information retrieval, pp. 637–638.

Melucci, M. (2007). On rank correlation in information retrieval evaluation. ACM SIGIR Forum, 41(1), 18–33.CrossRef

Robertson, S., & Kanoulas, E. (2012). On per-topic variance in IR evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 891–900.

Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In International ACM SIGIR conference on research and development in information retrieval, pp. 525–532.

Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management, 43(2), 531–548.CrossRef

Sakai, T. (2014a). Designing test collections for comparing many systems. In ACM international conference on information and knowledge management, pp. 61–70.

Sakai, T. (2014b). Topic set size design with variance estimates from two-way ANOVA. In International workshop on evaluating information access, pp. 1–8.

Sakai, T. (2015). Topic set size design. Journal of Information Retrieval. doi:10.1007/s10791-015-9273-z.

Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Journal of Information Retrieval, 11(5), 447–470.CrossRef

Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefMATH

Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 162–169.

Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in effectiveness across sub-collections. In ACM international conference on information and knowledge management, pp. 1965–1969.

Searle, S. R., Casella, G., & McCulloch, C. E. (2006). Variance components. London: Wiley.MATH

Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In ACM international conference on information and knowledge management, pp. 623–632.

Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In International ACM SIGIR conference on research and development in information retrieval, pp. 630–631.

Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490.CrossRef

Urbano, J., & Marrero, M. (2015). How do gain and discount functions affect the correlation between DCG and user satisfaction? In European conference on information retrieval, pp. 197–202.

Urbano, J., Marrero, M., & Martín, D. (2013a). A comparison of the optimality of statistical significance tests for information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 925–928.

Urbano, J., Marrero, M., & Martín, D. (2013b). On the measurement of test collection reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 393–402.

van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworths.

von Mises, R. (1931). Wahrscheinlichkeitsrechnung und ihre Anwendungen in der Statistik und theoretischen Physik.

Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In International ACM SIGIR conference on research and development in information retrieval, pp. 315–323.

Voorhees, E. M. (2001). Evaluation by highly relevant documents. In International ACM SIGIR conference on research and development in information retrieval, pp. 74–82.

Voorhees, E. M. (2009). Topic set size redux. In International ACM SIGIR conference on research and development in information retrieval, pp. 806–807.

Voorhees, E. M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In International ACM SIGIR conference on research and development in information retrieval, pp. 316–323.

Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. Handbook of Statistics, 26, 81–124.CrossRefMATH

Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In ACM international conference on information and knowledge management, pp. 571–580.

Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In International ACM SIGIR conference on research and development in information retrieval, pp. 587–594.

Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval, pp. 307–314.

Titel: Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation
verfasst von: Julián Urbano
Publikationsdatum: 01.06.2016
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 3/2016
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-015-9274-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2016

The strange case of reproducibility versus representativeness in contextual suggestion test collections

Information retrieval evaluation using test collections

Topic set size design

Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

Premium Partner