Skip to main content

2018 | OriginalPaper | Buchkapitel

5. The Correct Ways to Use Significance Tests

verfasst von : Tetsuya Sakai

Erschienen in: Laboratory Experiments in Information Retrieval

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Statistical significance testing has been under attack for decades. This section first discusses the criticisms on, and limitations of, significance testing (Sect. 5.1). Then it argues the importance of effect sizes, which typically represent the magnitude of the difference between systems (Sect. 5.2), and finally proposes how researchers should present their significance test results in technical papers and reports (Sect. 5.3). Reporting individual results effectively means that the research community as a whole can accumulate reproducible pieces of evidence and draw general conclusions from them; if researchers adhere to bad practices, that would mean a community where very little is learnt from one another.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A book edited by Harlow, Mulaik, and Steiger [11] contains a small collection of good arguments for and against classical significance testing.
 
2
Pr(H|D) can be directly addressed using Bayesian statistics [2, 29], but this is beyond the scope of this book; see also Chap. 8.
 
3
In his influential paper that advocated the use of parametric tests for IR evaluation, Hull described the p-value as “a measurement of the probability that the observed difference could have occurred by chance” [14]. Nooo! On the other hand, he also noted: “Researchers should simply be cautioned to consider both the statistical significance and the magnitude of the difference” and thus correctly pointed out the importance of effect size.
 
4
As was mentioned earlier, the standardised mean difference measures the effect in standard deviation units. Thus, given the same raw difference \(\bar {d}\), the effect size is considered relatively small for high-variance distributions and relatively large for low-variance ones.
 
5
To add to the confusion, in a book on effect sizes by Ellis [6], the standard deviation formulas given for Cohen’s d and Hedge’s g are actually equivalent (pp. 26–27). The difference between Hedge’s g and Cohen’s d is also discussed in Grissom and Kim [10] (p. 58).
 
6
When n 1 = n 2, \(\hat {\delta }\) is the unique minimum variance unbiased estimator of δ [13].
 
7
The estimates \(\hat {\sigma }_{A}^2\) and \(\hat {\sigma }_{B}^2\) have different forms because while A (system) is a fixed factor (i.e. we are interested in a particular set of systems and no other system), B (topic) is considered to be a random factor (i.e. we could have had a different set of topics); see Kline [19] (Chapter 6, pp. 185–196).
 
8
The formula for \(\hat {\omega }_{p}^2\) provided in Okubo and Okada [24] (Chapter 3 Eq. 3.69) contains an error, despite their claim that they substituted Eq. 5.27 into Eq. 5.25 (using a set of notations different from this book). For this reason, Eq. 13 in Sakai [27] also contains an error: as Eq. 5.29 shows, the denominator involves mn, not n.
 
9
Both A and B are treated as fixed factors: see Kline [19] (Chapter 7, p.232).
 
10
Not significant.
 
11
While Sakai [27] recommended reporting on an ANOVA results prior to discussing an RTHSD result, we omit the ANOVA step in this book for the reason given at the beginning of Chap. 4.
 
12
If the above description of ES E2 seems too lengthy, it might be a good idea to just cite this book instead!
 
Literatur
1.
Zurück zum Zitat D. Bakan, The test of significance in psychological research. Psychol. Bull. 66(6), 423–437 (1966)CrossRef D. Bakan, The test of significance in psychological research. Psychol. Bull. 66(6), 423–437 (1966)CrossRef
2.
Zurück zum Zitat B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40 B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40
3.
Zurück zum Zitat J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988) J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)
4.
Zurück zum Zitat J. Cohen, The earth is round (p < .05). Am. Psychol. 49(12), 997–1003 (1994)CrossRef J. Cohen, The earth is round (p < .05). Am. Psychol. 49(12), 997–1003 (1994)CrossRef
5.
Zurück zum Zitat W. Edwards Deming, On probability as a basic for action. Am. Stat. 29(4), 146–152 (1975) W. Edwards Deming, On probability as a basic for action. Am. Stat. 29(4), 146–152 (1975)
6.
Zurück zum Zitat P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)CrossRef P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)CrossRef
7.
Zurück zum Zitat A. Field, G. Hole, How to Design and Report Experiments (Sage Publications, London, 2003) A. Field, G. Hole, How to Design and Report Experiments (Sage Publications, London, 2003)
8.
Zurück zum Zitat G.V. Glass, B. McGaw, M.L. Smith, Meta-Analysis in Social Research (Sage Publications, Beverly Hills, 1981) G.V. Glass, B. McGaw, M.L. Smith, Meta-Analysis in Social Research (Sage Publications, Beverly Hills, 1981)
9.
Zurück zum Zitat S. Greenland, S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman, Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)CrossRef S. Greenland, S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman, Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)CrossRef
10.
Zurück zum Zitat R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012) R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)
11.
Zurück zum Zitat L.L. Harlow, S.A. Mulaik, J.H. Steiger, What If There Were No Significance Tests? (Classic Edition) (Routledge, London, 2016)CrossRef L.L. Harlow, S.A. Mulaik, J.H. Steiger, What If There Were No Significance Tests? (Classic Edition) (Routledge, London, 2016)CrossRef
12.
Zurück zum Zitat W.L. Hays, Statistics (Fifth Edition/International Edition) (Harcourt Brace College Publishers, Fort Worth, 1994) W.L. Hays, Statistics (Fifth Edition/International Edition) (Harcourt Brace College Publishers, Fort Worth, 1994)
13.
Zurück zum Zitat L.V. Hedges, I. Olkin, Statistical Methods for Meta-Analysis (Academic Press, San Diego, 1985)MATH L.V. Hedges, I. Olkin, Statistical Methods for Meta-Analysis (Academic Press, San Diego, 1985)MATH
14.
Zurück zum Zitat D. Hull, Using statistical testing in the evaluation of retrieval experiments, in Proceedings of ACM SIGIR’93, Pittsburgh, 1993, pp. 329–338 D. Hull, Using statistical testing in the evaluation of retrieval experiments, in Proceedings of ACM SIGIR’93, Pittsburgh, 1993, pp. 329–338
15.
Zurück zum Zitat D.H. Johnson, The insignificance of statistical significance testing. J. Wildlife Manag. 63(3), 763–772 (1999)CrossRef D.H. Johnson, The insignificance of statistical significance testing. J. Wildlife Manag. 63(3), 763–772 (1999)CrossRef
16.
Zurück zum Zitat E.M. Keen, Presenting results of experimental retrieval comparisons. Inf. Process. Manag. 28(4), 491–502 (1992)CrossRef E.M. Keen, Presenting results of experimental retrieval comparisons. Inf. Process. Manag. 28(4), 491–502 (1992)CrossRef
17.
Zurück zum Zitat K. Kelley, K.J. Preacher, On effect size. Psychol. Meth. 17(2), 137–152 (2012)CrossRef K. Kelley, K.J. Preacher, On effect size. Psychol. Meth. 17(2), 137–152 (2012)CrossRef
18.
Zurück zum Zitat G. Keren, C. Lewis, Partial omega squared for ANOVA designs. Educ. Psychol. Meas. 39(1), 119–128 (1969)CrossRef G. Keren, C. Lewis, Partial omega squared for ANOVA designs. Educ. Psychol. Meas. 39(1), 119–128 (1969)CrossRef
19.
Zurück zum Zitat R.B. Kline, Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research (American Psychology Association, Washington, 2004)CrossRef R.B. Kline, Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research (American Psychology Association, Washington, 2004)CrossRef
20.
Zurück zum Zitat H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)CrossRef H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)CrossRef
21.
Zurück zum Zitat J.P. Lander, R for Everyone (Addison Wesley, Upper Saddle River, 2014) J.P. Lander, R for Everyone (Addison Wesley, Upper Saddle River, 2014)
22.
Zurück zum Zitat G.R. Loftus, On the tyranny of hypothesis testing in the social sciences. Contemp. Psychol. 36(2), 102–105 (1991)CrossRef G.R. Loftus, On the tyranny of hypothesis testing in the social sciences. Contemp. Psychol. 36(2), 102–105 (1991)CrossRef
23.
Zurück zum Zitat R.E. McGrath, G.J. Meyer, When effect sizes disagree: the case of r and d. Psychol. Methods 11(4), 386–401 (2006)CrossRef R.E. McGrath, G.J. Meyer, When effect sizes disagree: the case of r and d. Psychol. Methods 11(4), 386–401 (2006)CrossRef
24.
Zurück zum Zitat M. Okubo, K. Okada, Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval (in Japanese) (Keiso Shobo, Bunkyo, 2012) M. Okubo, K. Okada, Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval (in Japanese) (Keiso Shobo, Bunkyo, 2012)
25.
Zurück zum Zitat S. Olejnik, J. Algina, Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychol. Res. 8(4), 434–447 (2003) S. Olejnik, J. Algina, Generalized eta and omega squared statistics: measures of effect size for some common research designs. Psychol. Res. 8(4), 434–447 (2003)
26.
Zurück zum Zitat K.J. Rothman, Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)CrossRef K.J. Rothman, Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)CrossRef
28.
Zurück zum Zitat T. Sakai, Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, in Proceedings of ACM SIGIR, Pisa, 2016, pp. 5–14 T. Sakai, Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, in Proceedings of ACM SIGIR, Pisa, 2016, pp. 5–14
29.
Zurück zum Zitat T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34 T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34
30.
Zurück zum Zitat F.L. Schmidt, Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Meth. 1(2), 115–129 (1996)CrossRef F.L. Schmidt, Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Meth. 1(2), 115–129 (1996)CrossRef
31.
Zurück zum Zitat K. Sparck Jones, Automatic indexing 1974: a state of the art review. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5193 (1974)CrossRef K. Sparck Jones, Automatic indexing 1974: a state of the art review. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5193 (1974)CrossRef
32.
Zurück zum Zitat K. Sparck Jones, Retrieval system tests 1958–1978, in Information Retrieval Experiment, chap. 12, ed. by K. Sparck Jones. (Butterworths, London, 1981) K. Sparck Jones, Retrieval system tests 1958–1978, in Information Retrieval Experiment, chap. 12, ed. by K. Sparck Jones. (Butterworths, London, 1981)
33.
Zurück zum Zitat R.L. Wasserstein, N.A. Lazar, The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70(2), 129–133 (2016)MathSciNetCrossRef R.L. Wasserstein, N.A. Lazar, The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70(2), 129–133 (2016)MathSciNetCrossRef
34.
Zurück zum Zitat S.T. Ziliak, D.N. McCloskey, The Cult of Statistical Significance: how the Standard Error Costs us Jobs, Justice, and Lives (The University of Michigan Press, Ann Arbor, 2008)MATH S.T. Ziliak, D.N. McCloskey, The Cult of Statistical Significance: how the Standard Error Costs us Jobs, Justice, and Lives (The University of Michigan Press, Ann Arbor, 2008)MATH
Metadaten
Titel
The Correct Ways to Use Significance Tests
verfasst von
Tetsuya Sakai
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-1199-4_5

Neuer Inhalt