Skip to main content

2018 | OriginalPaper | Buchkapitel

2. t-Tests

verfasst von : Tetsuya Sakai

Erschienen in: Laboratory Experiments in Information Retrieval

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This chapter first explains how the following classical significance tests for comparing two means work: the paired t-test for paired data (Sect. 2.2) and (Student’s) two-sample t-test and Welch’s two-sample t-test for unpaired data (Sects. 2.3 and 2.4). You have paired data if, for example, you evaluate two search engines using the same topic set with some evaluation measure such as normalised Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen, ACM TOIS 20(4):422–446, 2002). (For a survey on IR evaluation measures, see Sakai (Metrics, statistics, tests. In: PROMISE winter school 2013: bridging between information retrieval and databases. LNCS 8173, pp 116–163, 2014).) You have unpaired data if, for example, you evaluated System 1 with User Group A and System 2 with User Group B; the group sizes may differ. This chapter then discusses the relationship between the aforementioned two two-sample t-tests (Sect. 2.5) and shows how the three t-tests can easily be conducted using Excel (Sect. 2.6) and R (Sect. 2.7). Finally, it describes how confidence intervals for the mean differences can be constructed, based on the assumptions that form the basis of the three t-tests (Sect. 2.8).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
See Chap. 1 Sect. 1.​2.​2 for a brief discussion on: How large is “sufficiently large?”.
 
2
In contrast, in 1998, Zobel [21] conducted topic set splitting experiments with early TREC data to compare parametric and nonparametric tests and recommended Wilcoxon signed-rank test over the paired t-test and ANOVA. Moreover, in 2013, Urbano, Marrero, and Martín [17] reported that the Wilcoxon test, the paired t-test, and the bootstrap are more reliable than the randomisation test.
 
3
In 1908, William Sealy Gosset, who worked for Arthur Guinness Son & Co., Ltd., published his seminal paper on the t distribution under the pseudonym “Student” [15, 19]. There is no mention of “t” in Gosset’s original paper [15]; the test statistic is referred to as “z” there. “In 1912, Fisher, while still an undergraduate at Cambridge, made a tiny correction to Gosset’s z, and in 1922 they agreed to rename the corrected tables and test “Student’s” t” [20].
 
4
At the time of this writing, I am using Microsoft Office 2013, but later versions will probably support all the functionalities discussed in this book.
 
5
Set the third argument to 1 if a one-sided test is needed.
 
Literatur
1.
Zurück zum Zitat M.J. Crawley, Statistics: An Introduction Using R, 2nd edn. (Wiley, Chichester, 2015)MATH M.J. Crawley, Statistics: An Introduction Using R, 2nd edn. (Wiley, Chichester, 2015)MATH
2.
Zurück zum Zitat D. Hull, Using statistical testing in the evaluation of retrieval experiments, in Proceedings of ACM SIGIR’93, Pittsburgh, 1993, pp. 329–338 D. Hull, Using statistical testing in the evaluation of retrieval experiments, in Proceedings of ACM SIGIR’93, Pittsburgh, 1993, pp. 329–338
3.
Zurück zum Zitat K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef
4.
Zurück zum Zitat J.P. Lander, R for Everyone (Addison Wesley, Upper Saddle River, 2014) J.P. Lander, R for Everyone (Addison Wesley, Upper Saddle River, 2014)
5.
Zurück zum Zitat Y. Nagata, Introduction to Statistical Analysis (in Japanese) (JUSE Press, Shibuya, 1992) Y. Nagata, Introduction to Statistical Analysis (in Japanese) (JUSE Press, Shibuya, 1992)
6.
Zurück zum Zitat Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996) Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)
7.
Zurück zum Zitat T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532 T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532
8.
Zurück zum Zitat T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging Between Information Retrieval and Databases, Bressanone. LNCS 8173, 2014, pp. 116–163 T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging Between Information Retrieval and Databases, Bressanone. LNCS 8173, 2014, pp. 116–163
10.
Zurück zum Zitat T. Sakai, Two-sample t-tests for IR evaluation: student or welch? in Proceedings of ACM SIGIR, Pisa, 2016, pp. 1045–1048 T. Sakai, Two-sample t-tests for IR evaluation: student or welch? in Proceedings of ACM SIGIR, Pisa, 2016, pp. 1045–1048
11.
Zurück zum Zitat G. Salton, M.E. Lesk, Computer evaluation of indexing and text processing. J. ACM 15(1), 8–36 (1968)CrossRef G. Salton, M.E. Lesk, Computer evaluation of indexing and text processing. J. ACM 15(1), 8–36 (1968)CrossRef
12.
Zurück zum Zitat J. Savoy, Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manag. 33(4), 495–512 (1997)CrossRef J. Savoy, Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manag. 33(4), 495–512 (1997)CrossRef
13.
Zurück zum Zitat M.D. Smucker, J. Allan, B. Carterette, A comparison of statistical significance tests for information retrieval evaluation, in Proceedings of ACM CIKM, Lisbon, 2007, pp. 623–632 M.D. Smucker, J. Allan, B. Carterette, A comparison of statistical significance tests for information retrieval evaluation, in Proceedings of ACM CIKM, Lisbon, 2007, pp. 623–632
14.
Zurück zum Zitat K. Sparck Jones, P. Willet (eds.), Readings in Information Retrieval (Morgan Kaufmann, San Francisco, 1997) K. Sparck Jones, P. Willet (eds.), Readings in Information Retrieval (Morgan Kaufmann, San Francisco, 1997)
15.
Zurück zum Zitat Student, The probable error of a mean. Biometrika 6, 1–25 (1908) Student, The probable error of a mean. Biometrika 6, 1–25 (1908)
16.
Zurück zum Zitat J. Tague-Sutcliffe, The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manag. 28, 467–490 (1992)CrossRef J. Tague-Sutcliffe, The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manag. 28, 467–490 (1992)CrossRef
17.
Zurück zum Zitat J. Urbano, M. Marrero, D. Martín, A comparison of the optimality of statistical significance tests for information retrieval evaluation, in Proceedings of ACM SIGIR, Dublin, 2013, pp. 925–928 J. Urbano, M. Marrero, D. Martín, A comparison of the optimality of statistical significance tests for information retrieval evaluation, in Proceedings of ACM SIGIR, Dublin, 2013, pp. 925–928
18.
Zurück zum Zitat C.J. van Rijsbergen, Information Retrieval, Chap. 7 (Butterworths, London, 1979) C.J. van Rijsbergen, Information Retrieval, Chap. 7 (Butterworths, London, 1979)
19.
Zurück zum Zitat S.L. Zabell, On student’s 1908 article “the probable error of a mean”. J. Am. Stat. Assoc. 103(481), 1–7 (2008)MathSciNetCrossRef S.L. Zabell, On student’s 1908 article “the probable error of a mean”. J. Am. Stat. Assoc. 103(481), 1–7 (2008)MathSciNetCrossRef
20.
Zurück zum Zitat S.T. Ziliak, D.N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (The University of Michigan Press, Ann Arbor, 2008)MATH S.T. Ziliak, D.N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (The University of Michigan Press, Ann Arbor, 2008)MATH
21.
Zurück zum Zitat J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314 J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314
Metadaten
Titel
t-Tests
verfasst von
Tetsuya Sakai
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-1199-4_2

Neuer Inhalt