Top

Published in:

2018 | OriginalPaper | Chapter

1. Preliminaries

Author : Tetsuya Sakai

Published in: Laboratory Experiments in Information Retrieval

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This chapter discusses the basic principles of classical statistical significance testing (Sect. 1.1) and defines some well-known probability distributions that are necessary for discussing parametric significance tests (Sect. 1.2). (“A problem is parametric if the form of the underlying distribution is known, and it is nonparametric if we have no knowledge concerning the distribution(s) from which the observations are drawn.” Good (Permutation, parametric, and bootstrap tests of hypothesis, 3rd edn. Springer, New York, 2005, p. 14). For example, the paired t-test is a parametric test for paired data as it relies on the assumption that the observed data independently obey normal distributions (See Chap. 2 Sect. 2.2); the sign test is a nonparametric test; they may be applied to the same data if the normality assumption is not valid. This book only discusses parametric tests for comparing means, namely, t-tests and ANOVAs. See Chap. 2 for a discussion on the robustness of the t-test to the normality assumption violation.) As this book is intended for IR researchers such as myself, not statisticians, well-known theorems are presented without proofs; only brief proofs for corollaries are given. In the next two chapters, we shall use these basic theorems and corollaries as black boxes just as programmers utilise standard libraries when writing their own code. This chapter also defines less well-known distributions called noncentral distributions (Sect. 1.3), which we shall need for discussing sample size design and power analysis in Chaps. 6 and 7. Hence Sect. 1.3 may be skipped if the reader only wishes to learn about the principles and limitations of significance testing; however, such readers should read up to Chap. 5 before abandoning this book.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

next chapter t-Tests

Henceforth, for simplicity, I will basically ignore the difference between “topics” (i.e., information need statements) and “queries” (i.e., the sets of keywords input to the search engine) and that between “queries” and the “scores” computed for their search results.

While this book primarily discusses sample means over n topics, some IR researchers have explored the approach of regarding the document collection used in an experiment as a sample from a large population of documents [4, 14, 15].

Is a TREC (Text REtrieval Conference) topic set a random sample? Probably not. However, the reader should be aware that IR researchers who rely on significance tests such as t-tests and ANOVA for comparing system means implicitly rely on the basic assumption that a topic set is a random sample. The exact assumptions for t-tests and ANOVA are stated in Chaps. 2 and 3. Also, a computer-based significance test that does not rely on the random sampling and any distributional assumptions will be described in Chap. 4 Sect. 4.5.

In this book, a random variable and its realisations are denoted by the same symbol, e.g. x.

In contrast to classical significance testing, Bayesian statistics [2, 10, 17‐19] treat population parameters as random variables. See also Chap. 8.

Not to be confused with two-sample tests, which means you have a sample for System X and a different sample for System Y , possibly with different sample sizes (See Chap. 2 Sect. 2.3).

A one-sided test of the form H ₁ : μ _X > μ _Y would make sense if either μ _X < μ _Y is simply impossible, or if you do not need to consider the case where μ _X < μ _Y even if this is possible [12]. For example, if you are measuring the effect of introducing an aggressive stemming algorithm into your IR system in terms of recall (not precision) and you know that this can never hurt recall, a one-side test may be appropriate. But in practice, when you propose a new IR algorithm and want to compare it with a competitive baseline, it is rarely the case that you know in advance that your proposal is better. Hence I recommend the two-sided test as default. Whichever you choose, hypotheses H ₀ and H ₁ must be set up before actually examining the data.

“Dichotomous thinking,” one of the major reasons why classical significance testing has been severely criticised for decades, will be discussed in Chap. 5.

For a discussion on the difference between convergence in probability (which is used in the weak law of large numbers) and almost sure convergence (which is used in the strong law of large numbers), see, for example, https://stats.stackexchange.com/questions/2230/convergence-in-probability-vs-almost-sure-convergence.

With Microsoft Excel, z _inv(P) can be obtained as NORM.S.INV(1 − P); with R, it can be obtained as qnorm( P, lower.tail=FALSE).

It is often recommended that a binomial distribution be approximated with a normal distribution provided nP ≥ 5 and n(1 − P) ≥ 5 hold [12]. Note that when n = 15andP = 0.5, we have nP = n(1 − P) = 7.5 > 5, whereas, when n = 15andP = 0.2, we have nP = 3 < 5. In the latter situation, the above recommendation suggests that we should increase the sample size to n = 25: however, note that even this is not a large number.

The sample mean uses n as the denominator because it is based on n independent pieces of information, namely, x _i. In contrast, while Eq. 1.13 shows that S is based on \((x_{i}-\bar {x})\)’s, these are not actually n independent pieces of information, since \(\sum _{i=1}^{n}(x_{i}-\bar {x}) = \sum _{i}^{n}x_{i} - n\bar {x} = 0\) holds. There are only (n − 1) independent pieces of information in the sense that, once we have decided on (n − 1) values out of n, the last one is automatically determined due to the above constraint. For this reason, dividing S by n − 1 makes sense [12]. See also Sect. 1.2.4 where the degrees of freedom of S is discussed.

To be more specific, \(E(S/n)=\frac {n-1}{n}\sigma ^2 < \sigma ^2\) and hence S∕n underestimates σ ² [12]. Also of note is that the sample standard deviation \(\sqrt {V}\) is not an unbiased estimator of the population standard deviation σ despite the fact that E(V ) = σ ².

It is known that the population mean and the population variance of χ ² are given by E(χ ²) = ϕ and V (χ ²) = 2ϕ, respectively.

With Microsoft Excel, \(\chi _{\mathit {inv}}^{2}(\phi ; P)\) can be obtained as CHISQ.INV.RT(P, ϕ); with R, it can be obtained as qchisq( P, ϕ, lower.tail=FALSE).

It is known that the population mean and the population variance of t are given by E(t) = 0 (for ϕ ≥ 2) and \(V(t)=\frac {\phi }{\phi -2}\) (for ϕ ≥ 3), respectively.

With Microsoft Excel, t _inv(ϕ; P) can be obtained as T.INV.2T(P, ϕ); with R, it can be obtained as qt( P∕2, ϕ, lower.tail=FALSE).

It is known that the population mean and the population variance of F are given by \(E(F)=\frac {\phi _{2}}{\phi _{2}-2}\) (for ϕ ₂ ≥ 3) and \(V(F)=\frac {2 \phi _{2}^2(\phi _{1}+\phi _{2}-2)}{\phi _{1}(\phi _{2}-2)^2(\phi _{2}-4)}\) (for ϕ ₂ ≥ 5), respectively.

With Microsoft Excel, F _inv(ϕ ₁, ϕ ₂; P) can be obtained as F.INV.RT(P, ϕ ₁, ϕ ₂); with R, it can be obtained as qf( P, ϕ ₁, ϕ ₂, lower.tail=FALSE).

It is known that the population mean and the population variance of t ^′ are given by \(E(t^{\prime })= \frac {\lambda \sqrt {\phi /2} {\Gamma }((\phi -1)/2)}{{\Gamma }(\phi /2)}\) (for ϕ ≥ 2) and \(V(t^{\prime })= \frac {\phi (1+\lambda ^2)}{\phi -2} - \{E(t^{\prime })\}^2\) (for ϕ ≥ 3), respectively.

It is known that the population mean and the population variance of χ ^′2 are given by E(χ ^′2) = ϕ + λ and V (χ ^′2) = 2(ϕ + 2λ), respectively.

It is known that the population mean and the population variance of F ^′ are given by \(E(F^{\prime }) = \frac {\phi _{2}(\phi _{1} + \lambda )}{\phi _{1}(\phi _{2}-2)}\) (for ϕ ₂ ≥ 3) and \(V(F^{\prime })=2\left (\frac {\phi _{2}}{\phi _{1}}\right )^2 \frac {(\phi _{1}+\lambda )^2 + (\phi _{1}+2\lambda )(\phi _{2}-2)}{(\phi _{2}-2)^2(\phi _{2}-4)} \) (for ϕ ₂ ≥ 5), respectively.

C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, Chap. 3, ed. by E.M. Voorhees, D.K. Harman (The MIT Press, Cambridge, 2005), pp. 53–75

B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40

J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)

G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 282–289

G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, New York/London, 2012)

P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)CrossRef

P. Good, Permutation, Parametric, and Bootstrap Tests of Hypothesis, 3rd edn. (Springer, New York, 2005)MATH

R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)

K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef

10.

J.K. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Elsevier, Amsterdam, 2015)MATH

11.

K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis, 4th edn. (Routledge, New York, 2014)CrossRef

12.

Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)

13.

Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)

14.

S.E. Robertson, On document populations and measures of IR effectiveness, in Proceedings of ICTIR, Budapest, 2007, pp. 9–22

15.

S.E. Robertson, E. Kanoulas, On per-topic variance in IR evaluation, in Proceedings of ACM SIGIR, Portland, 2012, pp. 891–900

16.

T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)CrossRef

17.

T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34

18.

H. Toyoda, (ed.), Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese) (Asakura Shoten, Shinjuku, 2015)

19.

H. Toyoda. An Introduction to Statistical Data Analysis: Bayesian Statistics for ‘post p-value era’ (in Japanese) (Asakuha Shoten, Shinjuku, 2016)

Title: Preliminaries
Author: Tetsuya Sakai
Publisher: Springer Singapore
Book: Laboratory Experiments in Information Retrieval
Print ISBN: 978-981-13-1198-7

Electronic ISBN: 978-981-13-1199-4

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-981-13-1199-4_1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"