nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

2. t-Tests

verfasst von : Tetsuya Sakai

Erschienen in: Laboratory Experiments in Information Retrieval

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This chapter first explains how the following classical significance tests for comparing two means work: the paired t-test for paired data (Sect. 2.2) and (Student’s) two-sample t-test and Welch’s two-sample t-test for unpaired data (Sects. 2.3 and 2.4). You have paired data if, for example, you evaluate two search engines using the same topic set with some evaluation measure such as normalised Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen, ACM TOIS 20(4):422–446, 2002). (For a survey on IR evaluation measures, see Sakai (Metrics, statistics, tests. In: PROMISE winter school 2013: bridging between information retrieval and databases. LNCS 8173, pp 116–163, 2014).) You have unpaired data if, for example, you evaluated System 1 with User Group A and System 2 with User Group B; the group sizes may differ. This chapter then discusses the relationship between the aforementioned two two-sample t-tests (Sect. 2.5) and shows how the three t-tests can easily be conducted using Excel (Sect. 2.6) and R (Sect. 2.7). Finally, it describes how confidence intervals for the mean differences can be constructed, based on the assumptions that form the basis of the three t-tests (Sect. 2.8).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Preliminaries

Nächstes Kapitel Analysis of Variance

See Chap. 1 Sect. 1.2.2 for a brief discussion on: How large is “sufficiently large?”.

In contrast, in 1998, Zobel [21] conducted topic set splitting experiments with early TREC data to compare parametric and nonparametric tests and recommended Wilcoxon signed-rank test over the paired t-test and ANOVA. Moreover, in 2013, Urbano, Marrero, and Martín [17] reported that the Wilcoxon test, the paired t-test, and the bootstrap are more reliable than the randomisation test.

In 1908, William Sealy Gosset, who worked for Arthur Guinness Son & Co., Ltd., published his seminal paper on the t distribution under the pseudonym “Student” [15, 19]. There is no mention of “t” in Gosset’s original paper [15]; the test statistic is referred to as “z” there. “In 1912, Fisher, while still an undergraduate at Cambridge, made a tiny correction to Gosset’s z, and in 1922 they agreed to rename the corrected tables and test “Student’s” t” [20].

At the time of this writing, I am using Microsoft Office 2013, but later versions will probably support all the functionalities discussed in this book.

Set the third argument to 1 if a one-sided test is needed.

M.J. Crawley, Statistics: An Introduction Using R, 2nd edn. (Wiley, Chichester, 2015)MATH

D. Hull, Using statistical testing in the evaluation of retrieval experiments, in Proceedings of ACM SIGIR’93, Pittsburgh, 1993, pp. 329–338

K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef

J.P. Lander, R for Everyone (Addison Wesley, Upper Saddle River, 2014)

Y. Nagata, Introduction to Statistical Analysis (in Japanese) (JUSE Press, Shibuya, 1992)

Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)

T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532

T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging Between Information Retrieval and Databases, Bressanone. LNCS 8173, 2014, pp. 116–163

T. Sakai, Statistical reform in information retrieval? SIGIR Forum 48(1), 3–12 (2014)MathSciNetCrossRef

10.

T. Sakai, Two-sample t-tests for IR evaluation: student or welch? in Proceedings of ACM SIGIR, Pisa, 2016, pp. 1045–1048

11.

G. Salton, M.E. Lesk, Computer evaluation of indexing and text processing. J. ACM 15(1), 8–36 (1968)CrossRef

12.

J. Savoy, Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manag. 33(4), 495–512 (1997)CrossRef

13.

M.D. Smucker, J. Allan, B. Carterette, A comparison of statistical significance tests for information retrieval evaluation, in Proceedings of ACM CIKM, Lisbon, 2007, pp. 623–632

14.

K. Sparck Jones, P. Willet (eds.), Readings in Information Retrieval (Morgan Kaufmann, San Francisco, 1997)

15.

Student, The probable error of a mean. Biometrika 6, 1–25 (1908)

16.

J. Tague-Sutcliffe, The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manag. 28, 467–490 (1992)CrossRef

17.

J. Urbano, M. Marrero, D. Martín, A comparison of the optimality of statistical significance tests for information retrieval evaluation, in Proceedings of ACM SIGIR, Dublin, 2013, pp. 925–928

18.

C.J. van Rijsbergen, Information Retrieval, Chap. 7 (Butterworths, London, 1979)

19.

S.L. Zabell, On student’s 1908 article “the probable error of a mean”. J. Am. Stat. Assoc. 103(481), 1–7 (2008)MathSciNetCrossRef

20.

S.T. Ziliak, D.N. McCloskey, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (The University of Michigan Press, Ann Arbor, 2008)MATH

21.

J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314

Titel: t-Tests
verfasst von: Tetsuya Sakai
Verlag: Springer Singapore
Buch: Laboratory Experiments in Information Retrieval
Print ISBN: 978-981-13-1198-7

Electronic ISBN: 978-981-13-1199-4

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-981-13-1199-4_2

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Interview Entropie Bild 1/© Bernhard Weßling, Joerg Schweinsberg/© Datacore Software, Smart Factory Symbolbild/© TensorSpark | Generated with AI | Getty Images, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.