Skip to main content

2018 | OriginalPaper | Buchkapitel

6. Topic Set Size Design Using Excel

verfasst von : Tetsuya Sakai

Erschienen in: Laboratory Experiments in Information Retrieval

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This chapter discusses topic set size design, which enables test collection builders to determine the number of topics to create based on statistical requirements. First, an overview of five topic set size design methods is provided (Sect. 6.1), followed by details on each method (Sects. 6.2, 6.3, 6.4, 6.5, and 6.6). These methods are based on a desired statistical power (for the paired t-test, the two-sample t-test, and one-way ANOVA) or on a desired cap on the expected width of the confidence interval of the difference in means for paired and unpaired data. The simple Excel tools that I devised are based on the sample size design techniques as described in Nagata Y (How to design the sample size (in Japanese). Asakura Shoten, 2003). As these methods require an estimate of the population within-system variance for a given evaluation measure (or the variance of the score differences in the case of paired data), this chapter then describes how the variance can be estimated from pilot data (Sect. 6.7). Finally, it discusses the relationship across the different topic set size design methods (Sect. 6.8).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
This chapter relies heavily on Nagata’s formula derivations for sample size design [16], but the book is in Japanese. For discussions in English on sample sizes power analysis, the reader is referred to Ryan [18], Murphy, Myors, and Wolach [15], and Kraemer and Blasey [14].
 
2
Gilbert and Sparck Jones [11] (page A4) do report on a table that shows the required number of topics as a function of the number of relevant or retrieved documents per topic. For example, if the number of relevant documents per topic is five and we want 5% Type I error probability and 95% statistical power with the sign test, 830 topics are required according to their analysis.
 
3
Precision at document cuttoff 10.
 
5
These tools are slightly easier to use than their earlier versions, samplesizeTTEST.xlsx, samplesizeANOVA.xlsx, and samplesizeCI.xlsx, in that there is no need for the user to scroll down the Excel sheet to find the right topic set size anymore.
 
6
The achieved power is computed in Column K, although not shown in Fig. 6.1.
 
7
In Corollary 9, let \(\mu = \mu _{1}-\mu _{2}, \sigma ^2 = \sigma _{1}^2 + \sigma _{2}^2, \mu _{0}=0, \lambda = \lambda _{t}\).
 
8
Recall that with Microsoft Excel, z inv(P) can be obtained as NORM.S.INV(1 − P).
 
9
This table corrects a typo in Table 1 of Sakai [22] for (α, β, minΔ t) = (0.05, 0.20, 1.0), and provides the sample sizes for minΔ t = 1.5, 2.0 in addition.
 
10
The achieved power is computed in Column K, although not shown in Fig. 6.2.
 
11
An earlier version of this tool, samplesizeANOVA, accommodates only α = 0.01, 0.05 and β = 0.10, 0.20 [22].
 
12
The achieved power is computed in Column I, although not shown in Fig. 6.3.
 
13
Let A =maxia i and a =minia i. Then \(D^2/2=(A^2+a^2-2Aa)/2 \leq A^2 + a^2 \leq \sum _{i=1}^{m} a_{i}^2\). The equality holds when A = D∕2, a = −D∕2 and a i = 0 for all other systems.
 
14
Let χ 2 be a random variable that obeys χ 2(ϕ). Then c represents the population mean of the random variable \(\sqrt {\chi ^2/\phi }\). That is, \(E(\sqrt {\chi ^2/\phi })=c^{\ast }\). This is the same c used in Theorem 11 (Chap. 1 Sect. 1.​3.​1).
 
15
Recall Corollary 5 (Chap. 1 Sect. 1.​2.​4): if \(u = \frac {\bar {x}-\mu }{\sqrt {\sigma ^2/n}} \sim N(0,1^2)\), then \(t = \frac {\bar {x}-\mu }{\sqrt {V/n}} \sim t(n-1)\) where E(V ) = σ 2. That is, a t-distribution is like the standard normal distribution, except that there is an uncertainty about the estimator of σ 2, whose accuracy increases with n.
 
16
The covariance of two random variables x and y is defined as COV(x, y) = E((x − E(x))(y − (y))); note that COV(x, x) = V (x), i.e. the population variance of x (see Chap. 1 Sect. 1.​2.​1). Now, in general, V (x − y) = V (x) + V (y) − 2COV(x, y) holds. However, if COV(x, y) = 0, we say that x and y are uncorrelated.
 
18
The high variances of nERR reflect the fact that it is a measure designed primarily for navigational intents. That is, this measure relies heavily on the first retrieved relevant document, while the other measures rely on the other retrieved relevant documents as well.
 
19
Start from the left hand side of Eq. 6.61.
$$\displaystyle \begin{aligned} n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2\bar{x}(n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet}) + (n_{1}+n_{2})\bar{x}^2 = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2N\bar{x}^2 + N\bar{x}^2 \end{aligned}$$
$$\displaystyle \begin{aligned} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - N \frac{ (n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet} )^2}{N^2} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - \frac{ n_{1}^2\bar{x}_{1\bullet}^2 + n_{2}^2\bar{x}_{2\bullet}^2 + 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet} }{N} \end{aligned}$$
$$\displaystyle \begin{aligned} = \frac{1}{N} ( (n_{1}+n_{2})n_{1}\bar{x}_{1\bullet}^2 + (n_{1}+n_{2})n_{2}\bar{x}_{2\bullet}^2 - n_{1}^2\bar{x}_{1\bullet}^2 - n_{2}^2\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) \end{aligned}$$
$$\displaystyle \begin{aligned} = \frac{1}{N} (n_{1}n_{2}\bar{x}_{1\bullet}^2 + n_{1}n_{2}\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) =\frac{n_{1}n_{2}}{N} (\bar{x}_{1\bullet}-\bar{x}_{2\bullet})^2 \ , \end{aligned}$$
which equals the right hand side of Eq. 6.61.
 
Literatur
1.
Zurück zum Zitat J. Allan, B. Carterette, J.A. Aslam, V. Pavlu, B. Dachev, E. Kanoulas, Million query track 2007 overview, in Proceedings of TREC 2007, Gaithersburg, 2008 J. Allan, B. Carterette, J.A. Aslam, V. Pavlu, B. Dachev, E. Kanoulas, Million query track 2007 overview, in Proceedings of TREC 2007, Gaithersburg, 2008
2.
Zurück zum Zitat J. Allan, J.A. Aslam, B. Carterette, V. Pavlu, E. Kanoulas, Million query track 2008 overview, in Proceedings of TREC 2008, Gaithersburg, 2009 J. Allan, J.A. Aslam, B. Carterette, V. Pavlu, E. Kanoulas, Million query track 2008 overview, in Proceedings of TREC 2008, Gaithersburg, 2009
3.
Zurück zum Zitat C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 3, pp. 53–75 (The MIT Press, Cambridge, MA, 2005) C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 3, pp. 53–75 (The MIT Press, Cambridge, MA, 2005)
4.
Zurück zum Zitat C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in Proceedings of ACM ICML, Bonn, 2005, pp. 89–96 C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in Proceedings of ACM ICML, Bonn, 2005, pp. 89–96
5.
Zurück zum Zitat B. Carterette, J. Allan, R. Sitaraman, Minimal test collections for retrieval evaluation, in Proceedings of ACM SIGIR, Seattles, 2006, pp. 268–275 B. Carterette, J. Allan, R. Sitaraman, Minimal test collections for retrieval evaluation, in Proceedings of ACM SIGIR, Seattles, 2006, pp. 268–275
6.
Zurück zum Zitat B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in Proceedings of ACM SIGIR, Singapore, 2008, pp. 651–658 B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in Proceedings of ACM SIGIR, Singapore, 2008, pp. 651–658
7.
Zurück zum Zitat B. Carterette, V. Pavlu, H. Fang, E. Kanoulas, Million query track 2009 overview, in Proceedings of TREC 2009, Gaithersburg, 2010 B. Carterette, V. Pavlu, H. Fang, E. Kanoulas, Million query track 2009 overview, in Proceedings of TREC 2009, Gaithersburg, 2010
8.
Zurück zum Zitat O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in Proceedings of ACM CIKM, Hong Kong, 2009, pp. 621–630 O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in Proceedings of ACM CIKM, Hong Kong, 2009, pp. 621–630
9.
Zurück zum Zitat C.L.A. Clarke, N. Craswell, I. Soboroff, E.M. Voorhees, Overview of the TREC 2011 web track, in Proceedings of TREC 2011, Gaithersburg, 2012 C.L.A. Clarke, N. Craswell, I. Soboroff, E.M. Voorhees, Overview of the TREC 2011 web track, in Proceedings of TREC 2011, Gaithersburg, 2012
10.
Zurück zum Zitat C.L.A. Clarke, N. Craswell, E.M. Voorhees, Overview of the TREC 2012 web track, in Proceedings of TREC 2012, Gaithersburg, 2013 C.L.A. Clarke, N. Craswell, E.M. Voorhees, Overview of the TREC 2012 web track, in Proceedings of TREC 2012, Gaithersburg, 2013
11.
Zurück zum Zitat H. Gilbert, K. Sparck Jones, Statistical bases of relevance assessment for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481 (1979) H. Gilbert, K. Sparck Jones, Statistical bases of relevance assessment for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481 (1979)
12.
Zurück zum Zitat D.K. Harman, The TREC test collections, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 2 (The MIT Press, Cambridge, MA, 2005) D.K. Harman, The TREC test collections, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 2 (The MIT Press, Cambridge, MA, 2005)
13.
Zurück zum Zitat K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef
14.
Zurück zum Zitat H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)CrossRef H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)CrossRef
15.
Zurück zum Zitat K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 4th edn. (Routledge, London, 2014)CrossRef K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 4th edn. (Routledge, London, 2014)CrossRef
16.
Zurück zum Zitat Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003) Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)
17.
Zurück zum Zitat Y. Nagata, M. Yoshida, Introduction to Multiple Comparison Procedures (in Japanese) (Scientist Press, Shibuya, 1997) Y. Nagata, M. Yoshida, Introduction to Multiple Comparison Procedures (in Japanese) (Scientist Press, Shibuya, 1997)
18.
Zurück zum Zitat T.P. Ryan, Sample Size Determination and Power (Wiley, Chichester, 2013)CrossRef T.P. Ryan, Sample Size Determination and Power (Wiley, Chichester, 2013)CrossRef
19.
Zurück zum Zitat T. Sakai, Ranking the NTCIR systems based on multigrade relevance, in Proceedings of AIRS 2004, Beijing. LNCS 3411, 2004, pp. 251–262 T. Sakai, Ranking the NTCIR systems based on multigrade relevance, in Proceedings of AIRS 2004, Beijing. LNCS 3411, 2004, pp. 251–262
20.
Zurück zum Zitat T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532 T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532
21.
Zurück zum Zitat T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), 2014, pp. 116–163 T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), 2014, pp. 116–163
22.
23.
Zurück zum Zitat T. Sakai, Evaluating evaluation measures with worst-case confidence interval widths, in Proceedings of EVIA, Chiyoda, 2017, pp. 16–19 T. Sakai, Evaluating evaluation measures with worst-case confidence interval widths, in Proceedings of EVIA, Chiyoda, 2017, pp. 16–19
24.
Zurück zum Zitat T. Sakai, How to run an evaluation task, in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, ed. by N. Ferro, C. Peters, chapter 3. (Springer, 2019) T. Sakai, How to run an evaluation task, in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, ed. by N. Ferro, C. Peters, chapter 3. (Springer, 2019)
25.
Zurück zum Zitat T. Sakai, L. Shang, On estimating variances for topic set size design, in Proceedings of EVIA, Chiyoda, 2016, pp. 9–12 T. Sakai, L. Shang, On estimating variances for topic set size design, in Proceedings of EVIA, Chiyoda, 2016, pp. 9–12
26.
Zurück zum Zitat M. Sanderson, J. Zobel, Information retrieval evaluation: effort, sensitivity, and reliability, in Proceedings of ACM SIGIR, Salvador, 2005, pp. 162–169 M. Sanderson, J. Zobel, Information retrieval evaluation: effort, sensitivity, and reliability, in Proceedings of ACM SIGIR, Salvador, 2005, pp. 162–169
27.
Zurück zum Zitat K. Sparck Jones, C.J. van Rijsbergen, Report on the need for and provision of an ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266, 1975 K. Sparck Jones, C.J. van Rijsbergen, Report on the need for and provision of an ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266, 1975
28.
Zurück zum Zitat K. Sparck Jones, R.G. Bates, Report on a design study for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481, 1977 K. Sparck Jones, R.G. Bates, Report on a design study for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481, 1977
29.
Zurück zum Zitat E.M. Voorhees, Overview of the TREC 2003 robust retrieval track, in Proceedings of TREC 2003, Gaithersburg, 2004 E.M. Voorhees, Overview of the TREC 2003 robust retrieval track, in Proceedings of TREC 2003, Gaithersburg, 2004
30.
Zurück zum Zitat E.M. Voorhees, Overview of the TREC 2004 robust retrieval track, in Proceedings of TREC 2004, Gaithersburg, 2005 E.M. Voorhees, Overview of the TREC 2004 robust retrieval track, in Proceedings of TREC 2004, Gaithersburg, 2005
31.
Zurück zum Zitat E.M. Voorhees, Topic set size redux, in Proceedings of ACM SIGIR, Boston, 2009, pp. 806–807 E.M. Voorhees, Topic set size redux, in Proceedings of ACM SIGIR, Boston, 2009, pp. 806–807
32.
Zurück zum Zitat E.M. Voorhees, C. Buckley, The effect of topic set sizes on retrieval experiment error, in Proceedings of ACM SIGIR, Tampere, 2002, pp. 162–169 E.M. Voorhees, C. Buckley, The effect of topic set sizes on retrieval experiment error, in Proceedings of ACM SIGIR, Tampere, 2002, pp. 162–169
33.
Zurück zum Zitat W. Webber, A. Moffat, J. Zobel, Statistical power in retrieval experimentation, in Proceedings of ACM CIKM, Napa Valley, 2008, pp. 571–580 W. Webber, A. Moffat, J. Zobel, Statistical power in retrieval experimentation, in Proceedings of ACM CIKM, Napa Valley, 2008, pp. 571–580
34.
Zurück zum Zitat J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314 J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314
Metadaten
Titel
Topic Set Size Design Using Excel
verfasst von
Tetsuya Sakai
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-13-1199-4_6

Neuer Inhalt