Skip to main content
Top
Published in:
Cover of the book

2018 | OriginalPaper | Chapter

1. Preliminaries

Author : Tetsuya Sakai

Published in: Laboratory Experiments in Information Retrieval

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter discusses the basic principles of classical statistical significance testing (Sect. 1.1) and defines some well-known probability distributions that are necessary for discussing parametric significance tests (Sect. 1.2). (“A problem is parametric if the form of the underlying distribution is known, and it is nonparametric if we have no knowledge concerning the distribution(s) from which the observations are drawn.” Good (Permutation, parametric, and bootstrap tests of hypothesis, 3rd edn. Springer, New York, 2005, p. 14). For example, the paired t-test is a parametric test for paired data as it relies on the assumption that the observed data independently obey normal distributions (See Chap. 2 Sect. 2.​2); the sign test is a nonparametric test; they may be applied to the same data if the normality assumption is not valid. This book only discusses parametric tests for comparing means, namely, t-tests and ANOVAs. See Chap. 2 for a discussion on the robustness of the t-test to the normality assumption violation.) As this book is intended for IR researchers such as myself, not statisticians, well-known theorems are presented without proofs; only brief proofs for corollaries are given. In the next two chapters, we shall use these basic theorems and corollaries as black boxes just as programmers utilise standard libraries when writing their own code. This chapter also defines less well-known distributions called noncentral distributions (Sect. 1.3), which we shall need for discussing sample size design and power analysis in Chaps. 6 and 7. Hence Sect. 1.3 may be skipped if the reader only wishes to learn about the principles and limitations of significance testing; however, such readers should read up to Chap. 5 before abandoning this book.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Henceforth, for simplicity, I will basically ignore the difference between “topics” (i.e., information need statements) and “queries” (i.e., the sets of keywords input to the search engine) and that between “queries” and the “scores” computed for their search results.
 
2
While this book primarily discusses sample means over n topics, some IR researchers have explored the approach of regarding the document collection used in an experiment as a sample from a large population of documents [4, 14, 15].
 
3
Is a TREC (Text REtrieval Conference) topic set a random sample? Probably not. However, the reader should be aware that IR researchers who rely on significance tests such as t-tests and ANOVA for comparing system means implicitly rely on the basic assumption that a topic set is a random sample. The exact assumptions for t-tests and ANOVA are stated in Chaps. 2 and 3. Also, a computer-based significance test that does not rely on the random sampling and any distributional assumptions will be described in Chap. 4 Sect. 4.​5.
 
4
In this book, a random variable and its realisations are denoted by the same symbol, e.g. x.
 
5
In contrast to classical significance testing, Bayesian statistics [2, 10, 1719] treat population parameters as random variables. See also Chap. 8.
 
6
Not to be confused with two-sample tests, which means you have a sample for System X and a different sample for System Y , possibly with different sample sizes (See Chap. 2 Sect. 2.​3).
 
7
A one-sided test of the form H 1 : μ X > μ Y would make sense if either μ X < μ Y is simply impossible, or if you do not need to consider the case where μ X < μ Y even if this is possible [12]. For example, if you are measuring the effect of introducing an aggressive stemming algorithm into your IR system in terms of recall (not precision) and you know that this can never hurt recall, a one-side test may be appropriate. But in practice, when you propose a new IR algorithm and want to compare it with a competitive baseline, it is rarely the case that you know in advance that your proposal is better. Hence I recommend the two-sided test as default. Whichever you choose, hypotheses H 0 and H 1 must be set up before actually examining the data.
 
8
“Dichotomous thinking,” one of the major reasons why classical significance testing has been severely criticised for decades, will be discussed in Chap. 5.
 
9
For a discussion on the difference between convergence in probability (which is used in the weak law of large numbers) and almost sure convergence (which is used in the strong law of large numbers), see, for example, https://​stats.​stackexchange.​com/​questions/​2230/​convergence-in-probability-vs-almost-sure-convergence.
 
10
With Microsoft Excel, z inv(P) can be obtained as NORM.S.INV(1 − P); with R, it can be obtained as qnorm( P, lower.tail=FALSE).
 
11
It is often recommended that a binomial distribution be approximated with a normal distribution provided nP ≥ 5 and n(1 − P) ≥ 5 hold [12]. Note that when n = 15andP = 0.5, we have nP = n(1 − P) = 7.5 > 5, whereas, when n = 15andP = 0.2, we have nP = 3 < 5. In the latter situation, the above recommendation suggests that we should increase the sample size to n = 25: however, note that even this is not a large number.
 
12
The sample mean uses n as the denominator because it is based on n independent pieces of information, namely, x i. In contrast, while Eq. 1.13 shows that S is based on \((x_{i}-\bar {x})\)’s, these are not actually n independent pieces of information, since \(\sum _{i=1}^{n}(x_{i}-\bar {x}) = \sum _{i}^{n}x_{i} - n\bar {x} = 0\) holds. There are only (n − 1) independent pieces of information in the sense that, once we have decided on (n − 1) values out of n, the last one is automatically determined due to the above constraint. For this reason, dividing S by n − 1 makes sense [12]. See also Sect. 1.2.4 where the degrees of freedom of S is discussed.
 
13
To be more specific, \(E(S/n)=\frac {n-1}{n}\sigma ^2 < \sigma ^2\) and hence Sn underestimates σ 2 [12]. Also of note is that the sample standard deviation \(\sqrt {V}\) is not an unbiased estimator of the population standard deviation σ despite the fact that E(V ) = σ 2.
 
14
It is known that the population mean and the population variance of χ 2 are given by E(χ 2) = ϕ and V (χ 2) = 2ϕ, respectively.
 
15
With Microsoft Excel, \(\chi _{\mathit {inv}}^{2}(\phi ; P)\) can be obtained as CHISQ.INV.RT(P, ϕ); with R, it can be obtained as qchisq( P, ϕ, lower.tail=FALSE).
 
16
It is known that the population mean and the population variance of t are given by E(t) = 0 (for ϕ ≥ 2) and \(V(t)=\frac {\phi }{\phi -2}\) (for ϕ ≥ 3), respectively.
 
17
With Microsoft Excel, t inv(ϕ; P) can be obtained as T.INV.2T(P, ϕ); with R, it can be obtained as qt( P∕2, ϕ, lower.tail=FALSE).
 
18
It is known that the population mean and the population variance of F are given by \(E(F)=\frac {\phi _{2}}{\phi _{2}-2}\) (for ϕ 2 ≥ 3) and \(V(F)=\frac {2 \phi _{2}^2(\phi _{1}+\phi _{2}-2)}{\phi _{1}(\phi _{2}-2)^2(\phi _{2}-4)}\) (for ϕ 2 ≥ 5), respectively.
 
19
With Microsoft Excel, F inv(ϕ 1, ϕ 2; P) can be obtained as F.INV.RT(P, ϕ 1, ϕ 2); with R, it can be obtained as qf( P, ϕ 1, ϕ 2, lower.tail=FALSE).
 
20
It is known that the population mean and the population variance of t are given by \(E(t^{\prime })= \frac {\lambda \sqrt {\phi /2} {\Gamma }((\phi -1)/2)}{{\Gamma }(\phi /2)}\) (for ϕ ≥ 2) and \(V(t^{\prime })= \frac {\phi (1+\lambda ^2)}{\phi -2} - \{E(t^{\prime })\}^2\) (for ϕ ≥ 3), respectively.
 
21
It is known that the population mean and the population variance of χ 2 are given by E(χ 2) = ϕ + λ and V (χ 2) = 2(ϕ + 2λ), respectively.
 
22
It is known that the population mean and the population variance of F are given by \(E(F^{\prime }) = \frac {\phi _{2}(\phi _{1} + \lambda )}{\phi _{1}(\phi _{2}-2)}\) (for ϕ 2 ≥ 3) and \(V(F^{\prime })=2\left (\frac {\phi _{2}}{\phi _{1}}\right )^2 \frac {(\phi _{1}+\lambda )^2 + (\phi _{1}+2\lambda )(\phi _{2}-2)}{(\phi _{2}-2)^2(\phi _{2}-4)} \) (for ϕ 2 ≥ 5), respectively.
 
Literature
1.
go back to reference C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, Chap. 3, ed. by E.M. Voorhees, D.K. Harman (The MIT Press, Cambridge, 2005), pp. 53–75 C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, Chap. 3, ed. by E.M. Voorhees, D.K. Harman (The MIT Press, Cambridge, 2005), pp. 53–75
2.
go back to reference B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40 B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40
3.
go back to reference J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988) J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)
4.
go back to reference G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 282–289 G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 282–289
5.
go back to reference G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, New York/London, 2012) G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, New York/London, 2012)
6.
go back to reference P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)CrossRef P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)CrossRef
7.
go back to reference P. Good, Permutation, Parametric, and Bootstrap Tests of Hypothesis, 3rd edn. (Springer, New York, 2005)MATH P. Good, Permutation, Parametric, and Bootstrap Tests of Hypothesis, 3rd edn. (Springer, New York, 2005)MATH
8.
go back to reference R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012) R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)
9.
go back to reference K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef
10.
go back to reference J.K. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Elsevier, Amsterdam, 2015)MATH J.K. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Elsevier, Amsterdam, 2015)MATH
11.
go back to reference K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis, 4th edn. (Routledge, New York, 2014)CrossRef K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis, 4th edn. (Routledge, New York, 2014)CrossRef
12.
go back to reference Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996) Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)
13.
go back to reference Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003) Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)
14.
go back to reference S.E. Robertson, On document populations and measures of IR effectiveness, in Proceedings of ICTIR, Budapest, 2007, pp. 9–22 S.E. Robertson, On document populations and measures of IR effectiveness, in Proceedings of ICTIR, Budapest, 2007, pp. 9–22
15.
go back to reference S.E. Robertson, E. Kanoulas, On per-topic variance in IR evaluation, in Proceedings of ACM SIGIR, Portland, 2012, pp. 891–900 S.E. Robertson, E. Kanoulas, On per-topic variance in IR evaluation, in Proceedings of ACM SIGIR, Portland, 2012, pp. 891–900
16.
17.
go back to reference T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34 T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34
18.
go back to reference H. Toyoda, (ed.), Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese) (Asakura Shoten, Shinjuku, 2015) H. Toyoda, (ed.), Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese) (Asakura Shoten, Shinjuku, 2015)
19.
go back to reference H. Toyoda. An Introduction to Statistical Data Analysis: Bayesian Statistics for ‘post p-value era’ (in Japanese) (Asakuha Shoten, Shinjuku, 2016) H. Toyoda. An Introduction to Statistical Data Analysis: Bayesian Statistics for ‘post p-value era’ (in Japanese) (Asakuha Shoten, Shinjuku, 2016)
Metadata
Title
Preliminaries
Author
Tetsuya Sakai
Copyright Year
2018
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-1199-4_1