Skip to main content

2018 | Buch

Laboratory Experiments in Information Retrieval

Sample Sizes, Effect Sizes, and Statistical Power

insite
SUCHEN

Über dieses Buch

Covering aspects from principles and limitations of statistical significance tests to topic set size design and power analysis, this book guides readers to statistically well-designed experiments. Although classical statistical significance tests are to some extent useful in information retrieval (IR) evaluation, they can harm research unless they are used appropriately with the right sample sizes and statistical power and unless the test results are reported properly. The first half of the book is mainly targeted at undergraduate students, and the second half is suitable for graduate students and researchers who regularly conduct laboratory experiments in IR, natural language processing, recommendations, and related fields.

Chapters 1–5 review parametric significance tests for comparing system means, namely, t-tests and ANOVAs, and show how easily they can be conducted using Microsoft Excel or R. These chapters also discuss a few multiple comparison procedures for researchers who are interested in comparing every system pair, including a randomised version of Tukey's Honestly Significant Difference test. The chapters then deal with known limitations of classical significance testing and provide practical guidelines for reporting research results regarding comparison of means.

Chapters 6 and 7 discuss statistical power. Chapter 6 introduces topic set size design to enable test collection builders to determine an appropriate number of topics to create. Readers can easily use the author’s Excel tools for topic set size design based on the paired and two-sample t-tests, one-way ANOVA, and confidence intervals. Chapter 7 describes power-analysis-based methods for determining an appropriate sample size for a new experiment based on a similar experiment done in the past, detailing how to utilize the author’s R tools for power analysis and how to interpret the results. Case studies from IR for both Excel-based topic set size design and R-based power analysis are also provided.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Preliminaries
Abstract
This chapter discusses the basic principles of classical statistical significance testing (Sect. 1.1) and defines some well-known probability distributions that are necessary for discussing parametric significance tests (Sect. 1.2). (“A problem is parametric if the form of the underlying distribution is known, and it is nonparametric if we have no knowledge concerning the distribution(s) from which the observations are drawn.” Good (Permutation, parametric, and bootstrap tests of hypothesis, 3rd edn. Springer, New York, 2005, p. 14). For example, the paired t-test is a parametric test for paired data as it relies on the assumption that the observed data independently obey normal distributions (See Chap. 2 Sect. 2.​2); the sign test is a nonparametric test; they may be applied to the same data if the normality assumption is not valid. This book only discusses parametric tests for comparing means, namely, t-tests and ANOVAs. See Chap. 2 for a discussion on the robustness of the t-test to the normality assumption violation.) As this book is intended for IR researchers such as myself, not statisticians, well-known theorems are presented without proofs; only brief proofs for corollaries are given. In the next two chapters, we shall use these basic theorems and corollaries as black boxes just as programmers utilise standard libraries when writing their own code. This chapter also defines less well-known distributions called noncentral distributions (Sect. 1.3), which we shall need for discussing sample size design and power analysis in Chaps. 6 and 7. Hence Sect. 1.3 may be skipped if the reader only wishes to learn about the principles and limitations of significance testing; however, such readers should read up to Chap. 5 before abandoning this book.
Tetsuya Sakai
Chapter 2. t-Tests
Abstract
This chapter first explains how the following classical significance tests for comparing two means work: the paired t-test for paired data (Sect. 2.2) and (Student’s) two-sample t-test and Welch’s two-sample t-test for unpaired data (Sects. 2.3 and 2.4). You have paired data if, for example, you evaluate two search engines using the same topic set with some evaluation measure such as normalised Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen, ACM TOIS 20(4):422–446, 2002). (For a survey on IR evaluation measures, see Sakai (Metrics, statistics, tests. In: PROMISE winter school 2013: bridging between information retrieval and databases. LNCS 8173, pp 116–163, 2014).) You have unpaired data if, for example, you evaluated System 1 with User Group A and System 2 with User Group B; the group sizes may differ. This chapter then discusses the relationship between the aforementioned two two-sample t-tests (Sect. 2.5) and shows how the three t-tests can easily be conducted using Excel (Sect. 2.6) and R (Sect. 2.7). Finally, it describes how confidence intervals for the mean differences can be constructed, based on the assumptions that form the basis of the three t-tests (Sect. 2.8).
Tetsuya Sakai
Chapter 3. Analysis of Variance
Abstract
This chapter first describes the following classical analysis of variance (ANOVA) tests for comparing more than two means: one-way ANOVA, a generalised form of the unpaired t-test (Sect. 3.1); two-way ANOVA without replication, a generalised form of the paired t-test (Sect. 3.2); and two-way ANOVA with replication, which considers the interaction between two factors (e.g. topic and system). The first two types of ANOVA are particularly important for IR researchers, since, in laboratory experiments where systems are evaluated using topics, there is usually one evaluation measure score for a given topic-system pair, (unless, for example, the system is considered to be nondeterministic and produces a different search result page every time the same query is entered) where it is not possible to discuss the topic-system interaction. (Banks et al. (Inf Retr 1:7–34, 1999) applied Tukey’s single-degree-of-freedom test for nonadditivity and Mandel’s bundle-of-line approach to discuss topic-system interaction given two-way ANOVA without replication data from TREC-3 and reported: “there is a strong interaction between system and topic in terms of average precision. The presence of interaction implies that one cannot find simple descriptions of the data in terms of topics and systems alone.” These tests are beyond the scope of this book.) This chapter then describes how one-way ANOVA and two-way ANOVA without replication can easily be conducted using Excel (Sect. 3.4) and R (Sect. 3.5). (For handling other types of ANOVA with R, we refer the readers to Crawley (Statistics: an introduction using R, 2nd edn. Wiley, Chichester, 2015), Chapter 8.) Finally, it describes how a confidence interval for each system can be constructed based on the data from the first two types of ANOVA (Sect. 3.6).
One-way ANOVA is applicable for comparing (say) m systems using m different user groups; moreover, we shall use one-way ANOVA to discuss topic set size design in Chap. 6. Two-way ANOVA without replication is applicable when comparing (says) m systems with the same topic set. However, the reader should be aware that the question addressed with ANOVA is: “are all the populations means equal or not?” It does not tell where the differences lie. If the researcher is interested in the difference between every system pair, then ANOVA is not the test you want: instead, consider a multiple comparison procedure such as the randomised Tukey HSD (honestly significant difference) test (see Chap. 4).
Tetsuya Sakai
Chapter 4. Multiple Comparison Procedures
Abstract
This chapter first discusses the familywise error rate problem (Sect. 4.2), which may arise when a researcher applies statistical significance tests multiple times in an experiment. For example, if the researcher has four experimental systems and is interested in comparing every system pair, it is not advisable to conduct a regular t-test six times. This chapter then discusses two approaches to lower the familywise error rate, namely, the widely used but arguably obsolete Bonferroni correction (Sect. 4.3) and the more recommendable Tukey HSD (Honestly Significant Difference) test (Sect. 4.4). While many multiple comparison procedures for suppressing the familywise error rate have been proposed, the above two methods are parametric, single-step methods (Multiple comparison procedures in which the outcome of one hypothesis test determines what to do next are called stepwise methods. In contrast, multiple comparison procedures that can process all hypotheses at the same time are called single-step methods.) that are suitable for comparing every system pair (Nagata and Yoshida, Introduction to multiple comparison procedures (in Japanese). Scientist Press, 1997). However, the reader should be aware that the Bonferroni correction has low statistical power when handling many hypotheses. Finally, we discuss a distribution-free, computer-based version of the latter test, known as the randomised Tukey HSD test (Carterette, ACM TOIS 30(1):1–34, 2012; Sakai, Evaluation with informational and navigational intents. In: Proceedings of WWW 2012, pp 499–508, 2012), for situations where we have a matrix of scores such as a topic-by-run matrix of nDCG values (Sect. 4.5). The paired randomisation test is also discussed as a special case of this test.
Tetsuya Sakai
Chapter 5. The Correct Ways to Use Significance Tests
Abstract
Statistical significance testing has been under attack for decades. This section first discusses the criticisms on, and limitations of, significance testing (Sect. 5.1). Then it argues the importance of effect sizes, which typically represent the magnitude of the difference between systems (Sect. 5.2), and finally proposes how researchers should present their significance test results in technical papers and reports (Sect. 5.3). Reporting individual results effectively means that the research community as a whole can accumulate reproducible pieces of evidence and draw general conclusions from them; if researchers adhere to bad practices, that would mean a community where very little is learnt from one another.
Tetsuya Sakai
Chapter 6. Topic Set Size Design Using Excel
Abstract
This chapter discusses topic set size design, which enables test collection builders to determine the number of topics to create based on statistical requirements. First, an overview of five topic set size design methods is provided (Sect. 6.1), followed by details on each method (Sects. 6.2, 6.3, 6.4, 6.5, and 6.6). These methods are based on a desired statistical power (for the paired t-test, the two-sample t-test, and one-way ANOVA) or on a desired cap on the expected width of the confidence interval of the difference in means for paired and unpaired data. The simple Excel tools that I devised are based on the sample size design techniques as described in Nagata Y (How to design the sample size (in Japanese). Asakura Shoten, 2003). As these methods require an estimate of the population within-system variance for a given evaluation measure (or the variance of the score differences in the case of paired data), this chapter then describes how the variance can be estimated from pilot data (Sect. 6.7). Finally, it discusses the relationship across the different topic set size design methods (Sect. 6.8).
Tetsuya Sakai
Chapter 7. Power Analysis Using R
Abstract
This section describes how power analysis on published papers can be done using a suite of simple R scripts, so that better-designed experiments can be conducted in the future. Here, “better” means “ensuring appropriate statistical power”. First, an overview of the five R scripts is given (Sect. 7.2), followed by a description of each script (Sects. 7.3, 7.4, 7.5, 7.6, and 7.7). The five scripts, which are for paired t-test, two-sample t-test, one-way ANOVA, two-way ANOVA without replication, and two-way ANOVA with replication, respectively, were adapted from the R scripts of Toyoda (Introduction to statistical power analysis: a tutorial with R (in Japanese). Tokyo Tosyo, 2009): his original scripts, which contain Japanese character codes, are available from his book’s website (http://​www.​tokyo-tosho.​co.​jp/​download/​DL02065.​zip); Toyoda’s scripts (and therefore mine as well) rely on R libraries called stats and pwr. (The present author is solely responsible for any problems caused by modifying the original scripts of Toyoda.) Finally, it provides summary while touching upon a survey I conducted using these R scripts, with a decade’s worth of IR papers from ACM SIGIR (http://​sigir.​org/​) and TOIS (https://​tois.​acm.​org/​) (Sakai Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS. In: Proceedings of ACM SIGIR 2016, pp 5–14, 2016), where it was demonstrated that there are highly overpowered and highly underpowered experiments in the results reported in the IR literature. Highly overpowered experiments use a lot more resources than necessary, while highly underpowered experiments are highly likely to miss important differences that exist due to the use of small samples. We can probably do better by learning from previous studies and/or from pilot studies.
Tetsuya Sakai
Chapter 8. Conclusions
Abstract
This chapter first provides a quick summary of the topics covered in this book (Sect. 8.1). It then very briefly touches upon Bayesian approaches to hypothesis testing, not covered in the previous chapters, and concludes the book by proposing a statistical reform in IR (Sect. 8.2).
Tetsuya Sakai
Backmatter
Metadaten
Titel
Laboratory Experiments in Information Retrieval
verfasst von
Prof. Tetsuya Sakai
Copyright-Jahr
2018
Verlag
Springer Singapore
Electronic ISBN
978-981-13-1199-4
Print ISBN
978-981-13-1198-7
DOI
https://doi.org/10.1007/978-981-13-1199-4

Neuer Inhalt