nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

5. The Correct Ways to Use Significance Tests

verfasst von : Tetsuya Sakai

Erschienen in: Laboratory Experiments in Information Retrieval

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Statistical significance testing has been under attack for decades. This section first discusses the criticisms on, and limitations of, significance testing (Sect. 5.1). Then it argues the importance of effect sizes, which typically represent the magnitude of the difference between systems (Sect. 5.2), and finally proposes how researchers should present their significance test results in technical papers and reports (Sect. 5.3). Reporting individual results effectively means that the research community as a whole can accumulate reproducible pieces of evidence and draw general conclusions from them; if researchers adhere to bad practices, that would mean a community where very little is learnt from one another.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Multiple Comparison Procedures

Nächstes Kapitel Topic Set Size Design Using Excel

A book edited by Harlow, Mulaik, and Steiger [11] contains a small collection of good arguments for and against classical significance testing.

Pr(H|D) can be directly addressed using Bayesian statistics [2, 29], but this is beyond the scope of this book; see also Chap. 8.

In his influential paper that advocated the use of parametric tests for IR evaluation, Hull described the p-value as “a measurement of the probability that the observed difference could have occurred by chance” [14]. Nooo! On the other hand, he also noted: “Researchers should simply be cautioned to consider both the statistical significance and the magnitude of the difference” and thus correctly pointed out the importance of effect size.

As was mentioned earlier, the standardised mean difference measures the effect in standard deviation units. Thus, given the same raw difference \(\bar {d}\), the effect size is considered relatively small for high-variance distributions and relatively large for low-variance ones.

To add to the confusion, in a book on effect sizes by Ellis [6], the standard deviation formulas given for Cohen’s d and Hedge’s g are actually equivalent (pp. 26–27). The difference between Hedge’s g and Cohen’s d is also discussed in Grissom and Kim [10] (p. 58).

When n ₁ = n ₂, \(\hat {\delta }\) is the unique minimum variance unbiased estimator of δ [13].

The estimates \(\hat {\sigma }_{A}^2\) and \(\hat {\sigma }_{B}^2\) have different forms because while A (system) is a fixed factor (i.e. we are interested in a particular set of systems and no other system), B (topic) is considered to be a random factor (i.e. we could have had a different set of topics); see Kline [19] (Chapter 6, pp. 185–196).

The formula for \(\hat {\omega }_{p}^2\) provided in Okubo and Okada [24] (Chapter 3 Eq. 3.69) contains an error, despite their claim that they substituted Eq. 5.27 into Eq. 5.25 (using a set of notations different from this book). For this reason, Eq. 13 in Sakai [27] also contains an error: as Eq. 5.29 shows, the denominator involves mn, not n.

Both A and B are treated as fixed factors: see Kline [19] (Chapter 7, p.232).

Not significant.

While Sakai [27] recommended reporting on an ANOVA results prior to discussing an RTHSD result, we omit the ANOVA step in this book for the reason given at the beginning of Chap. 4.

If the above description of ES _E2 seems too lengthy, it might be a good idea to just cite this book instead!

D. Bakan, The test of significance in psychological research. Psychol. Bull. 66(6), 423–437 (1966)CrossRef

B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40

J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)

J. Cohen, The earth is round (p < .05). Am. Psychol. 49(12), 997–1003 (1994)CrossRef

W. Edwards Deming, On probability as a basic for action. Am. Stat. 29(4), 146–152 (1975)

P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)CrossRef

A. Field, G. Hole, How to Design and Report Experiments (Sage Publications, London, 2003)

G.V. Glass, B. McGaw, M.L. Smith, Meta-Analysis in Social Research (Sage Publications, Beverly Hills, 1981)

S. Greenland, S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman, Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)CrossRef

10.

R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)

11.

L.L. Harlow, S.A. Mulaik, J.H. Steiger, What If There Were No Significance Tests? (Classic Edition) (Routledge, London, 2016)CrossRef

12.

W.L. Hays, Statistics (Fifth Edition/International Edition) (Harcourt Brace College Publishers, Fort Worth, 1994)

13.

L.V. Hedges, I. Olkin, Statistical Methods for Meta-Analysis (Academic Press, San Diego, 1985)MATH

14.

D. Hull, Using statistical testing in the evaluation of retrieval experiments, in Proceedings of ACM SIGIR’93, Pittsburgh, 1993, pp. 329–338

15.

D.H. Johnson, The insignificance of statistical significance testing. J. Wildlife Manag. 63(3), 763–772 (1999)CrossRef

16.

E.M. Keen, Presenting results of experimental retrieval comparisons. Inf. Process. Manag. 28(4), 491–502 (1992)CrossRef

17.

K. Kelley, K.J. Preacher, On effect size. Psychol. Meth. 17(2), 137–152 (2012)CrossRef

18.

G. Keren, C. Lewis, Partial omega squared for ANOVA designs. Educ. Psychol. Meas. 39(1), 119–128 (1969)CrossRef

19.

R.B. Kline, Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research (American Psychology Association, Washington, 2004)CrossRef

20.

H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)CrossRef

21.

J.P. Lander, R for Everyone (Addison Wesley, Upper Saddle River, 2014)

22.

G.R. Loftus, On the tyranny of hypothesis testing in the social sciences. Contemp. Psychol. 36(2), 102–105 (1991)CrossRef

23.

R.E. McGrath, G.J. Meyer, When effect sizes disagree: the case of r and d. Psychol. Methods 11(4), 386–401 (2006)CrossRef

24.

M. Okubo, K. Okada, Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval (in Japanese) (Keiso Shobo, Bunkyo, 2012)

25.

S. Olejnik, J. Algina, Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychol. Res. 8(4), 434–447 (2003)

26.

K.J. Rothman, Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)CrossRef

27.

T. Sakai, Statistical reform in information retrieval? SIGIR Forum 48(1), 3–12 (2014)MathSciNetCrossRef

28.

T. Sakai, Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, in Proceedings of ACM SIGIR, Pisa, 2016, pp. 5–14

29.

T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34

30.

F.L. Schmidt, Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Meth. 1(2), 115–129 (1996)CrossRef

31.

K. Sparck Jones, Automatic indexing 1974: a state of the art review. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5193 (1974)CrossRef

32.

K. Sparck Jones, Retrieval system tests 1958–1978, in Information Retrieval Experiment, chap. 12, ed. by K. Sparck Jones. (Butterworths, London, 1981)

33.

R.L. Wasserstein, N.A. Lazar, The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70(2), 129–133 (2016)MathSciNetCrossRef

34.

S.T. Ziliak, D.N. McCloskey, The Cult of Statistical Significance: how the Standard Error Costs us Jobs, Justice, and Lives (The University of Michigan Press, Ann Arbor, 2008)MATH

Titel: The Correct Ways to Use Significance Tests
verfasst von: Tetsuya Sakai
Verlag: Springer Singapore
Buch: Laboratory Experiments in Information Retrieval
Print ISBN: 978-981-13-1198-7

Electronic ISBN: 978-981-13-1199-4

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-981-13-1199-4_5

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Buchstaben, die aus einem Megaphon kommen/© MicroStockHub/Getty Images/iStock, Digitale Lieferkette/© zapp2photo / stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.