nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

6. Topic Set Size Design Using Excel

verfasst von : Tetsuya Sakai

Erschienen in: Laboratory Experiments in Information Retrieval

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This chapter discusses topic set size design, which enables test collection builders to determine the number of topics to create based on statistical requirements. First, an overview of five topic set size design methods is provided (Sect. 6.1), followed by details on each method (Sects. 6.2, 6.3, 6.4, 6.5, and 6.6). These methods are based on a desired statistical power (for the paired t-test, the two-sample t-test, and one-way ANOVA) or on a desired cap on the expected width of the confidence interval of the difference in means for paired and unpaired data. The simple Excel tools that I devised are based on the sample size design techniques as described in Nagata Y (How to design the sample size (in Japanese). Asakura Shoten, 2003). As these methods require an estimate of the population within-system variance for a given evaluation measure (or the variance of the score differences in the case of paired data), this chapter then describes how the variance can be estimated from pilot data (Sect. 6.7). Finally, it discusses the relationship across the different topic set size design methods (Sect. 6.8).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel The Correct Ways to Use Significance Tests

Nächstes Kapitel Power Analysis Using R

This chapter relies heavily on Nagata’s formula derivations for sample size design [16], but the book is in Japanese. For discussions in English on sample sizes power analysis, the reader is referred to Ryan [18], Murphy, Myors, and Wolach [15], and Kraemer and Blasey [14].

Gilbert and Sparck Jones [11] (page A4) do report on a table that shows the required number of topics as a function of the number of relevant or retrieved documents per topic. For example, if the number of relevant documents per topic is five and we want 5% Type I error probability and 95% statistical power with the sign test, 830 topics are required according to their analysis.

Precision at document cuttoff 10.

http://www.ccs.neu.edu/home/jaa/papers/drafts/statAP.pdf

These tools are slightly easier to use than their earlier versions, samplesizeTTEST.xlsx, samplesizeANOVA.xlsx, and samplesizeCI.xlsx, in that there is no need for the user to scroll down the Excel sheet to find the right topic set size anymore.

The achieved power is computed in Column K, although not shown in Fig. 6.1.

In Corollary 9, let $\mu = \mu _{1}-\mu _{2}, \sigma ^2 = \sigma _{1}^2 + \sigma _{2}^2, \mu _{0}=0, \lambda = \lambda _{t}$.

Recall that with Microsoft Excel, z _inv(P) can be obtained as NORM.S.INV(1 − P).

This table corrects a typo in Table 1 of Sakai [22] for (α, β, minΔ _t) = (0.05, 0.20, 1.0), and provides the sample sizes for minΔ _t = 1.5, 2.0 in addition.

The achieved power is computed in Column K, although not shown in Fig. 6.2.

An earlier version of this tool, samplesizeANOVA, accommodates only α = 0.01, 0.05 and β = 0.10, 0.20 [22].

The achieved power is computed in Column I, although not shown in Fig. 6.3.

Let A =max_ia _i and a =min_ia _i. Then $D^2/2=(A^2+a^2-2Aa)/2 \leq A^2 + a^2 \leq \sum _{i=1}^{m} a_{i}^2$. The equality holds when A = D∕2, a = −D∕2 and a _i = 0 for all other systems.

Let χ ² be a random variable that obeys χ ²(ϕ). Then c ^∗ represents the population mean of the random variable $\sqrt {\chi ^2/\phi }$. That is, $E(\sqrt {\chi ^2/\phi })=c^{\ast }$. This is the same c ^∗ used in Theorem 11 (Chap. 1 Sect. 1.3.1).

Recall Corollary 5 (Chap. 1 Sect. 1.2.4): if $u = \frac {\bar {x}-\mu }{\sqrt {\sigma ^2/n}} \sim N(0,1^2)$, then $t = \frac {\bar {x}-\mu }{\sqrt {V/n}} \sim t(n-1)$ where E(V ) = σ ². That is, a t-distribution is like the standard normal distribution, except that there is an uncertainty about the estimator of σ ², whose accuracy increases with n.

The covariance of two random variables x and y is defined as COV(x, y) = E((x − E(x))(y − (y))); note that COV(x, x) = V (x), i.e. the population variance of x (see Chap. 1 Sect. 1.2.1). Now, in general, V (x − y) = V (x) + V (y) − 2COV(x, y) holds. However, if COV(x, y) = 0, we say that x and y are uncorrelated.

http://research.nii.ac.jp/ntcir/index-en.html

The high variances of nERR reflect the fact that it is a measure designed primarily for navigational intents. That is, this measure relies heavily on the first retrieved relevant document, while the other measures rely on the other retrieved relevant documents as well.

Start from the left hand side of Eq. 6.61.

$$\displaystyle \begin{aligned} n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2\bar{x}(n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet}) + (n_{1}+n_{2})\bar{x}^2 = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2N\bar{x}^2 + N\bar{x}^2 \end{aligned}$$

$$\displaystyle \begin{aligned} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - N \frac{ (n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet} )^2}{N^2} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - \frac{ n_{1}^2\bar{x}_{1\bullet}^2 + n_{2}^2\bar{x}_{2\bullet}^2 + 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet} }{N} \end{aligned}$$

$$\displaystyle \begin{aligned} = \frac{1}{N} ( (n_{1}+n_{2})n_{1}\bar{x}_{1\bullet}^2 + (n_{1}+n_{2})n_{2}\bar{x}_{2\bullet}^2 - n_{1}^2\bar{x}_{1\bullet}^2 - n_{2}^2\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) \end{aligned}$$

$$\displaystyle \begin{aligned} = \frac{1}{N} (n_{1}n_{2}\bar{x}_{1\bullet}^2 + n_{1}n_{2}\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) =\frac{n_{1}n_{2}}{N} (\bar{x}_{1\bullet}-\bar{x}_{2\bullet})^2 \ , \end{aligned}$$

which equals the right hand side of Eq. 6.61.

J. Allan, B. Carterette, J.A. Aslam, V. Pavlu, B. Dachev, E. Kanoulas, Million query track 2007 overview, in Proceedings of TREC 2007, Gaithersburg, 2008

J. Allan, J.A. Aslam, B. Carterette, V. Pavlu, E. Kanoulas, Million query track 2008 overview, in Proceedings of TREC 2008, Gaithersburg, 2009

C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 3, pp. 53–75 (The MIT Press, Cambridge, MA, 2005)

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in Proceedings of ACM ICML, Bonn, 2005, pp. 89–96

B. Carterette, J. Allan, R. Sitaraman, Minimal test collections for retrieval evaluation, in Proceedings of ACM SIGIR, Seattles, 2006, pp. 268–275

B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in Proceedings of ACM SIGIR, Singapore, 2008, pp. 651–658

B. Carterette, V. Pavlu, H. Fang, E. Kanoulas, Million query track 2009 overview, in Proceedings of TREC 2009, Gaithersburg, 2010

O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in Proceedings of ACM CIKM, Hong Kong, 2009, pp. 621–630

C.L.A. Clarke, N. Craswell, I. Soboroff, E.M. Voorhees, Overview of the TREC 2011 web track, in Proceedings of TREC 2011, Gaithersburg, 2012

10.

C.L.A. Clarke, N. Craswell, E.M. Voorhees, Overview of the TREC 2012 web track, in Proceedings of TREC 2012, Gaithersburg, 2013

11.

H. Gilbert, K. Sparck Jones, Statistical bases of relevance assessment for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481 (1979)

12.

D.K. Harman, The TREC test collections, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 2 (The MIT Press, Cambridge, MA, 2005)

13.

K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRef

14.

H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)CrossRef

15.

K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 4th edn. (Routledge, London, 2014)CrossRef

16.

Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)

17.

Y. Nagata, M. Yoshida, Introduction to Multiple Comparison Procedures (in Japanese) (Scientist Press, Shibuya, 1997)

18.

T.P. Ryan, Sample Size Determination and Power (Wiley, Chichester, 2013)CrossRef

19.

T. Sakai, Ranking the NTCIR systems based on multigrade relevance, in Proceedings of AIRS 2004, Beijing. LNCS 3411, 2004, pp. 251–262

20.

T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532

21.

T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), 2014, pp. 116–163

22.

T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)CrossRef

23.

T. Sakai, Evaluating evaluation measures with worst-case confidence interval widths, in Proceedings of EVIA, Chiyoda, 2017, pp. 16–19

24.

T. Sakai, How to run an evaluation task, in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, ed. by N. Ferro, C. Peters, chapter 3. (Springer, 2019)

25.

T. Sakai, L. Shang, On estimating variances for topic set size design, in Proceedings of EVIA, Chiyoda, 2016, pp. 9–12

26.

M. Sanderson, J. Zobel, Information retrieval evaluation: effort, sensitivity, and reliability, in Proceedings of ACM SIGIR, Salvador, 2005, pp. 162–169

27.

K. Sparck Jones, C.J. van Rijsbergen, Report on the need for and provision of an ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266, 1975

28.

K. Sparck Jones, R.G. Bates, Report on a design study for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481, 1977

29.

E.M. Voorhees, Overview of the TREC 2003 robust retrieval track, in Proceedings of TREC 2003, Gaithersburg, 2004

30.

E.M. Voorhees, Overview of the TREC 2004 robust retrieval track, in Proceedings of TREC 2004, Gaithersburg, 2005

31.

E.M. Voorhees, Topic set size redux, in Proceedings of ACM SIGIR, Boston, 2009, pp. 806–807

32.

E.M. Voorhees, C. Buckley, The effect of topic set sizes on retrieval experiment error, in Proceedings of ACM SIGIR, Tampere, 2002, pp. 162–169

33.

W. Webber, A. Moffat, J. Zobel, Statistical power in retrieval experimentation, in Proceedings of ACM CIKM, Napa Valley, 2008, pp. 571–580

34.

J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314

Titel: Topic Set Size Design Using Excel
verfasst von: Tetsuya Sakai
Verlag: Springer Singapore
Buch: Laboratory Experiments in Information Retrieval
Print ISBN: 978-981-13-1198-7

Electronic ISBN: 978-981-13-1199-4

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-981-13-1199-4_6

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Kryptowährungen/© gopixa / Getty Images / iStock, MG4 aus China auf dem Prüfstand im ADAC-Technik-Zentrum in Landsberg am Lech/© ADAC e.V., Chassis eines Elektrofahrzeugs/© chesky / stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Sustainibility Finance/© Robert Kneschke / stock.adobe.com / Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.