Skip to main content
Erschienen in: Discover Computing 1/2020

08.05.2019

Fewer topics? A million topics? Both?! On topics subsets in test collections

verfasst von: Kevin Roitero, J. Shane Culpepper, Mark Sanderson, Falk Scholer, Stefano Mizzaro

Erschienen in: Discover Computing | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Guiver et al. (2009) use the terminology Best/Average/Worst, and we adopt it in this paper in order to be consistent with past work.
 
2
It is important to remark that this line of research focuses on an a posteriori, i.e., after-evaluation setting: it is not aimed at predicting in advance a good topic subset, but only at determining if such a subset exists.
 
3
Consistently with this line of research (see Footnote 2), we investigate clustering of topics using an a posteriori setting; thus, we study an after-evaluation characterization of Best topic subsets, but do not aim at providing a methodology to find such subsets in practice.
 
4
The effect of statMAP, on which we focus in this paper, is discussed in more detail in Sect. 3.3.
 
5
Note, several versions of statMAP exist, we used statAP_MQ_eval_v3.pl: http://​trec.​nist.​gov/​data/​million.​query07.​html.
 
6
We use the suffix B/A/W to indicate the correlation curve for the best/average/worst topic set.
 
7
Note, the overlap that we find might be an effect of the heuristic used; we can say no more than it is possible to build a best and a worst set of topics with a high overlap.
 
8
Here, for speed of calculation reasons only a single random topic subset is drawn from the set of all topic subsets of a given cardinality. The histograms of random are consequently more “spiky” than if we averaged several random subsets. However, the broad signal of the result is still visible in the plots.
 
9
We tried with up to 1 million repetitions, but the series are already stable with 1000 repetitions.
 
11
For an exhaustive list see the R package “proxy” (https://​cran.​r-project.​org/​web/​packages/​proxy/​proxy.​pdf), and the “Distance computations” section of Python 3 (https://​docs.​scipy.​org/​doc/​scipy/​reference/​spatial.​distance.​html).
 
Literatur
Zurück zum Zitat Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of TREC. Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of TREC.
Zurück zum Zitat Bartlett, J. E., Kotrlik, J. W., & Higgins, C. C. (2001). Organizational research: Determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal, 19(1), 43–50. Bartlett, J. E., Kotrlik, J. W., & Higgins, C. C. (2001). Organizational research: Determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal, 19(1), 43–50.
Zurück zum Zitat Berto, A., Mizzaro, S., & Robertson, S. (2013). On using fewer topics in information retrieval evaluations. In Proceedings of the ICTIR, (p. 9). Berto, A., Mizzaro, S., & Robertson, S. (2013). On using fewer topics in information retrieval evaluations. In Proceedings of the ICTIR, (p. 9).
Zurück zum Zitat Bodoff, D., & Li, P. (2007). Test theory for assessing ir test collections. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 367–374). New York: ACM. Bodoff, D., & Li, P. (2007). Test theory for assessing ir test collections. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 367–374). New York: ACM.
Zurück zum Zitat Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd SIGIR, (pp. 33–40). Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd SIGIR, (pp. 33–40).
Zurück zum Zitat Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th SIGIR, (pp 268–275). Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th SIGIR, (pp 268–275).
Zurück zum Zitat Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009a). Million query track 2009 overview. In Proceedings of TREC. Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009a). Million query track 2009 overview. In Proceedings of TREC.
Zurück zum Zitat Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009b). If i had a million queries. In Proceedings of the 31th ECIR, ECIR ’09, (pp. 288–300). Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009b). If i had a million queries. In Proceedings of the 31th ECIR, ECIR ’09, (pp. 288–300).
Zurück zum Zitat Carterette, B., & Smucker, M. D. (2007). Hypothesis testing with incomplete relevance judgments. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, (pp 643–652). New York: ACM. CIKM ’07. https://doi.org/10.1145/1321440.1321530. Carterette, B., & Smucker, M. D. (2007). Hypothesis testing with incomplete relevance judgments. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, (pp 643–652). New York: ACM. CIKM ’07. https://​doi.​org/​10.​1145/​1321440.​1321530.
Zurück zum Zitat Carterette, B. A. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems (TOIS), 30(1), 4.CrossRef Carterette, B. A. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems (TOIS), 30(1), 4.CrossRef
Zurück zum Zitat Cattelan, M., & Mizzaro, S. (2009). IR evaluation without a common set of topics. In Proceedings of the ICTIR, (pp. 342–345). Cattelan, M., & Mizzaro, S. (2009). IR evaluation without a common set of topics. In Proceedings of the ICTIR, (pp. 342–345).
Zurück zum Zitat Feise, R. (2002). Do multiple outcome measures require \(p\)-value adjustment? BMC Medical Research Methodology, 2, 8.CrossRef Feise, R. (2002). Do multiple outcome measures require \(p\)-value adjustment? BMC Medical Research Methodology, 2, 8.CrossRef
Zurück zum Zitat Guiver, J., Mizzaro, S., & Robertson, S. (2009). A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems, 21(1–21), 26. Guiver, J., Mizzaro, S., & Robertson, S. (2009). A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems, 21(1–21), 26.
Zurück zum Zitat Hauff, C., Hiemstra, D., Azzopardi, L., & de Jong, F. (2010). A case for automatic system evaluation. In Proceedings of the ECIR, (pp. 153–165). Hauff, C., Hiemstra, D., Azzopardi, L., & de Jong, F. (2010). A case for automatic system evaluation. In Proceedings of the ECIR, (pp. 153–165).
Zurück zum Zitat Hauff, C., Hiemstra, D., de Jong, F., & Azzopardi, L. (2009). Relying on topic subsets for system ranking estimation. In Proceedings of the 18th CIKM, (pp. 1859–1862). Hauff, C., Hiemstra, D., de Jong, F., & Azzopardi, L. (2009). Relying on topic subsets for system ranking estimation. In Proceedings of the 18th CIKM, (pp. 1859–1862).
Zurück zum Zitat Hosseini, M., Cox, I. J., Milic-Frayling, N., Shokouhi, M., & Yilmaz, E. (2012). An uncertainty-aware query selection model for evaluation of ir systems. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, (pp. 901–910). New York, NY, USA: ACM. SIGIR ’12. https://doi.org/10.1145/2348283.2348403 Hosseini, M., Cox, I. J., Milic-Frayling, N., Shokouhi, M., & Yilmaz, E. (2012). An uncertainty-aware query selection model for evaluation of ir systems. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, (pp. 901–910). New York, NY, USA: ACM. SIGIR ’12. https://​doi.​org/​10.​1145/​2348283.​2348403
Zurück zum Zitat Hosseini, M., Cox, I. J., Milic-Frayling, N., Sweeting, T., & Vinay, V. (2011a). Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th CIKM 2011, (pp. 641–646) Hosseini, M., Cox, I. J., Milic-Frayling, N., Sweeting, T., & Vinay, V. (2011a). Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th CIKM 2011, (pp. 641–646)
Zurück zum Zitat Hosseini, M., Cox, I. J., Milic-Frayling, N., Vinay, V., & Sweeting, T. (2011b). Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the ICTIR, (pp. 113–124). lNCS 6931. Hosseini, M., Cox, I. J., Milic-Frayling, N., Vinay, V., & Sweeting, T. (2011b). Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the ICTIR, (pp. 113–124). lNCS 6931.
Zurück zum Zitat Mehrotra, R., & Yilmaz, E. (2015). Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the of the 38th international ACM SIGIR conference on research and development in information retrieval, (pp. 545–554). New York, NY, USA: ACM, SIGIR ’15. https://doi.org/10.1145/2766462.2767753 Mehrotra, R., & Yilmaz, E. (2015). Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the of the 38th international ACM SIGIR conference on research and development in information retrieval, (pp. 545–554). New York, NY, USA: ACM, SIGIR ’15. https://​doi.​org/​10.​1145/​2766462.​2767753
Zurück zum Zitat Mizzaro, S., & Robertson, S. (2007). HITS hits TREC—Exploring IR evaluation results with network analysis. In Proceedings of the 30th SIGIR, (pp. 479–486). Mizzaro, S., & Robertson, S. (2007). HITS hits TREC—Exploring IR evaluation results with network analysis. In Proceedings of the 30th SIGIR, (pp. 479–486).
Zurück zum Zitat Moffat, A., Scholer, F., & Thomas, P. (2012). Models and metrics: IR evaluation as a user process. In Proceedings of the Australasian document computing symposium, Dunedin, New Zealand, (pp. 47–54). Moffat, A., Scholer, F., & Thomas, P. (2012). Models and metrics: IR evaluation as a user process. In Proceedings of the Australasian document computing symposium, Dunedin, New Zealand, (pp. 47–54).
Zurück zum Zitat Pavlu, V., & Aslam, J. (2007). A practical sampling strategy for efficient retrieval evaluation. Tech. rep., technical report, college of computer and information science, Northeastern University. Pavlu, V., & Aslam, J. (2007). A practical sampling strategy for efficient retrieval evaluation. Tech. rep., technical report, college of computer and information science, Northeastern University.
Zurück zum Zitat Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets (1st ed.). Cambridge: Cambridge University Press.CrossRef Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets (1st ed.). Cambridge: Cambridge University Press.CrossRef
Zurück zum Zitat Robertson, S. (2011). On the contributions of topics to system evaluation. In Proceedings of the ECIR, lNCS 6611, (pp. 129–140). Robertson, S. (2011). On the contributions of topics to system evaluation. In Proceedings of the ECIR, lNCS 6611, (pp. 129–140).
Zurück zum Zitat Roitero, K., Maddalena, E., & Mizzaro, S. (2017). Do easy topics predict effectiveness better than difficult topics? In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 605–611). Cham: Springer International Publishing.CrossRef Roitero, K., Maddalena, E., & Mizzaro, S. (2017). Do easy topics predict effectiveness better than difficult topics? In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 605–611). Cham: Springer International Publishing.CrossRef
Zurück zum Zitat Roitero, K., Soprano, M., Brunello, A., & Mizzarom, S. (2018a). Reproduce and improve: An evolutionary approach to select a few good topics for information retrieval evaluation. ACM Journal of Data and Information Quality, 10(3), 12:1–12:21. https://doi.org/10.1145/3239573.CrossRef Roitero, K., Soprano, M., Brunello, A., & Mizzarom, S. (2018a). Reproduce and improve: An evolutionary approach to select a few good topics for information retrieval evaluation. ACM Journal of Data and Information Quality, 10(3), 12:1–12:21. https://​doi.​org/​10.​1145/​3239573.CrossRef
Zurück zum Zitat Roitero, K., Soprano, M., & Mizzaro, S. (2018b). Effectiveness evaluation with a subset of topics: A practical approach. In The 41st international ACM SIGIR conference on research and development in information retrieval, (pp. 1145–1148). New York, NY, USA:ACM, SIGIR ’18. https://doi.org/10.1145/3209978.3210108 Roitero, K., Soprano, M., & Mizzaro, S. (2018b). Effectiveness evaluation with a subset of topics: A practical approach. In The 41st international ACM SIGIR conference on research and development in information retrieval, (pp. 1145–1148). New York, NY, USA:ACM, SIGIR ’18. https://​doi.​org/​10.​1145/​3209978.​3210108
Zurück zum Zitat Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, (pp. 13–19). New York, NY:ACM Press. Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, (pp. 13–19). New York, NY:ACM Press.
Zurück zum Zitat Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd CIKM 2014, (pp. 61–70). Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd CIKM 2014, (pp. 61–70).
Zurück zum Zitat Sakai, T. (2016a). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the 39th SIGIR, (pp. 5–14). ACM. Sakai, T. (2016a). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the 39th SIGIR, (pp. 5–14). ACM.
Zurück zum Zitat Sakai, T. (2016b). Topic set size design. Information Retrieval Journal, 19(3), 256–283.CrossRef Sakai, T. (2016b). Topic set size design. Information Retrieval Journal, 19(3), 256–283.CrossRef
Zurück zum Zitat Sanderson, M., & Soboroff, I. (2007). Problems with Kendall’s Tau. In Proceedings of the 30th SIGIR, (pp. 839–840). Sanderson, M., & Soboroff, I. (2007). Problems with Kendall’s Tau. In Proceedings of the 30th SIGIR, (pp. 839–840).
Zurück zum Zitat Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th SIGIR, (pp. 162–169). Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th SIGIR, (pp. 162–169).
Zurück zum Zitat Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). Boca Raton: CRC Press.MATH Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). Boca Raton: CRC Press.MATH
Zurück zum Zitat Urbano, J., Marrero, M., & Martín, D. (2013). On the measurement of test collection reliability. In Proceedings of the 36th SIGIR, (pp. 393–402). Urbano, J., Marrero, M., & Martín, D. (2013). On the measurement of test collection reliability. In Proceedings of the 36th SIGIR, (pp. 393–402).
Zurück zum Zitat Urbano, J., & Nagler, T. (2018). Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval, (pp. 695–704). New York, NY, USA: ACM, SIGIR ’18. https://doi.org/10.1145/3209978.3210043. Urbano, J., & Nagler, T. (2018). Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval, (pp. 695–704). New York, NY, USA: ACM, SIGIR ’18. https://​doi.​org/​10.​1145/​3209978.​3210043.
Zurück zum Zitat Voorhees, E., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. InProceedings of the 25th SIGIR, (pp. 316–323). Voorhees, E., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. InProceedings of the 25th SIGIR, (pp. 316–323).
Zurück zum Zitat Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In Proceedings of the 17th CIKM, (pp. 571–580). Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In Proceedings of the 17th CIKM, (pp. 571–580).
Metadaten
Titel
Fewer topics? A million topics? Both?! On topics subsets in test collections
verfasst von
Kevin Roitero
J. Shane Culpepper
Mark Sanderson
Falk Scholer
Stefano Mizzaro
Publikationsdatum
08.05.2019
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 1/2020
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-019-09357-w