nach oben

Discover Computing

Erschienen in:

08.05.2019

Fewer topics? A million topics? Both?! On topics subsets in test collections

verfasst von: Kevin Roitero, J. Shane Culpepper, Mark Sanderson, Falk Scholer, Stefano Mizzaro

Erschienen in: Discover Computing | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.

Vorheriger Artikel ReBoost: a retrieval-boosted sequence-to-sequence model for neural response generation

Nächster Artikel Low-cost, bottom-up measures for evaluating search result diversification

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Guiver et al. (2009) use the terminology Best/Average/Worst, and we adopt it in this paper in order to be consistent with past work.

It is important to remark that this line of research focuses on an a posteriori, i.e., after-evaluation setting: it is not aimed at predicting in advance a good topic subset, but only at determining if such a subset exists.

Consistently with this line of research (see Footnote 2), we investigate clustering of topics using an a posteriori setting; thus, we study an after-evaluation characterization of Best topic subsets, but do not aim at providing a methodology to find such subsets in practice.

The effect of statMAP, on which we focus in this paper, is discussed in more detail in Sect. 3.3.

Note, several versions of statMAP exist, we used statAP_MQ_eval_v3.pl: http://trec.nist.gov/data/million.query07.html.

We use the suffix B/A/W to indicate the correlation curve for the best/average/worst topic set.

Note, the overlap that we find might be an effect of the heuristic used; we can say no more than it is possible to build a best and a worst set of topics with a high overlap.

Here, for speed of calculation reasons only a single random topic subset is drawn from the set of all topic subsets of a given cardinality. The histograms of random are consequently more “spiky” than if we averaged several random subsets. However, the broad signal of the result is still visible in the plots.

We tried with up to 1 million repetitions, but the series are already stable with 1000 repetitions.

See the R function “kmeans” in the “stats” package (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html), and “k-means” of “scikit-learn” for Python 3 (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

For an exhaustive list see the R package “proxy” (https://cran.r-project.org/web/packages/proxy/proxy.pdf), and the “Distance computations” section of Python 3 (https://docs.scipy.org/doc/scipy/reference/spatial.distance.html).

Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proceedings of TREC.

Bartlett, J. E., Kotrlik, J. W., & Higgins, C. C. (2001). Organizational research: Determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal, 19(1), 43–50.

Berto, A., Mizzaro, S., & Robertson, S. (2013). On using fewer topics in information retrieval evaluations. In Proceedings of the ICTIR, (p. 9).

Bodoff, D., & Li, P. (2007). Test theory for assessing ir test collections. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 367–374). New York: ACM.

Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd SIGIR, (pp. 33–40).

Carterette, B., Allan, J., & Sitaraman, R. (2006). Minimal test collections for retrieval evaluation. In Proceedings of the 29th SIGIR, (pp 268–275).

Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009a). Million query track 2009 overview. In Proceedings of TREC.

Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009b). If i had a million queries. In Proceedings of the 31th ECIR, ECIR ’09, (pp. 288–300).

Carterette, B., & Smucker, M. D. (2007). Hypothesis testing with incomplete relevance judgments. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, (pp 643–652). New York: ACM. CIKM ’07. https://doi.org/10.1145/1321440.1321530.

Carterette, B. A. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems (TOIS), 30(1), 4.CrossRef

Cattelan, M., & Mizzaro, S. (2009). IR evaluation without a common set of topics. In Proceedings of the ICTIR, (pp. 342–345).

Feise, R. (2002). Do multiple outcome measures require \(p\)-value adjustment? BMC Medical Research Methodology, 2, 8.CrossRef

Guiver, J., Mizzaro, S., & Robertson, S. (2009). A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Transactions on Information Systems, 21(1–21), 26.

Hauff, C., Hiemstra, D., Azzopardi, L., & de Jong, F. (2010). A case for automatic system evaluation. In Proceedings of the ECIR, (pp. 153–165).

Hauff, C., Hiemstra, D., de Jong, F., & Azzopardi, L. (2009). Relying on topic subsets for system ranking estimation. In Proceedings of the 18th CIKM, (pp. 1859–1862).

Hosseini, M., Cox, I. J., Milic-Frayling, N., Shokouhi, M., & Yilmaz, E. (2012). An uncertainty-aware query selection model for evaluation of ir systems. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, (pp. 901–910). New York, NY, USA: ACM. SIGIR ’12. https://doi.org/10.1145/2348283.2348403

Hosseini, M., Cox, I. J., Milic-Frayling, N., Sweeting, T., & Vinay, V. (2011a). Prioritizing relevance judgments to improve the construction of IR test collections. In Proceedings of the 20th CIKM 2011, (pp. 641–646)

Hosseini, M., Cox, I. J., Milic-Frayling, N., Vinay, V., & Sweeting, T. (2011b). Selecting a subset of queries for acquisition of further relevance judgements. In Proceedings of the ICTIR, (pp. 113–124). lNCS 6931.

Kutlu, M., Elsayed, T., & Lease, M. (2018). Intelligent topic selection for low-cost information retrieval evaluation: A new perspective on deep vs. shallow judging. Information Processing and Management, 54(1), 37–59. https://doi.org/10.1016/j.ipm.2017.09.002.CrossRef

Mehrotra, R., & Yilmaz, E. (2015). Representative & informative query selection for learning to rank using submodular functions. In Proceedings of the of the 38th international ACM SIGIR conference on research and development in information retrieval, (pp. 545–554). New York, NY, USA: ACM, SIGIR ’15. https://doi.org/10.1145/2766462.2767753

Mizzaro, S., & Robertson, S. (2007). HITS hits TREC—Exploring IR evaluation results with network analysis. In Proceedings of the 30th SIGIR, (pp. 479–486).

Moffat, A., Scholer, F., & Thomas, P. (2012). Models and metrics: IR evaluation as a user process. In Proceedings of the Australasian document computing symposium, Dunedin, New Zealand, (pp. 47–54).

Pavlu, V., & Aslam, J. (2007). A practical sampling strategy for efficient retrieval evaluation. Tech. rep., technical report, college of computer and information science, Northeastern University.

Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets (1st ed.). Cambridge: Cambridge University Press.CrossRef

Robertson, S. (2011). On the contributions of topics to system evaluation. In Proceedings of the ECIR, lNCS 6611, (pp. 129–140).

Roitero, K., Maddalena, E., & Mizzaro, S. (2017). Do easy topics predict effectiveness better than difficult topics? In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in information retrieval (pp. 605–611). Cham: Springer International Publishing.CrossRef

Roitero, K., Soprano, M., Brunello, A., & Mizzarom, S. (2018a). Reproduce and improve: An evolutionary approach to select a few good topics for information retrieval evaluation. ACM Journal of Data and Information Quality, 10(3), 12:1–12:21. https://doi.org/10.1145/3239573.CrossRef

Roitero, K., Soprano, M., & Mizzaro, S. (2018b). Effectiveness evaluation with a subset of topics: A practical approach. In The 41st international ACM SIGIR conference on research and development in information retrieval, (pp. 1145–1148). New York, NY, USA:ACM, SIGIR ’18. https://doi.org/10.1145/3209978.3210108

Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web, (pp. 13–19). New York, NY:ACM Press.

Sakai, T. (2007), Alternatives to bpref. In Proceedings of the 30th annual international ACM SIGIR Conference on research and development in information retrieval, (pp. 71–78). New York, NY:ACM, SIGIR ’07. https://doi.org/10.1145/1277741.1277756

Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd CIKM 2014, (pp. 61–70).

Sakai, T. (2016a). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the 39th SIGIR, (pp. 5–14). ACM.

Sakai, T. (2016b). Topic set size design. Information Retrieval Journal, 19(3), 256–283.CrossRef

Sanderson, M., & Soboroff, I. (2007). Problems with Kendall’s Tau. In Proceedings of the 30th SIGIR, (pp. 839–840).

Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th SIGIR, (pp. 162–169).

Sheskin, D. (2007). Handbook of parametric and nonparametric statistical procedures (4th ed.). Boca Raton: CRC Press.MATH

Urbano, J. (2016). Test collection reliability: A study of bias and robustness to statistical assumptions via stochastic simulation. Information Retrieval Journal, 19(3), 313–350. https://doi.org/10.1007/s10791-015-9274-y.CrossRef

Urbano, J., Marrero, M., & Martín, D. (2013). On the measurement of test collection reliability. In Proceedings of the 36th SIGIR, (pp. 393–402).

Urbano, J., & Nagler, T. (2018). Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval, (pp. 695–704). New York, NY, USA: ACM, SIGIR ’18. https://doi.org/10.1145/3209978.3210043.

Voorhees, E., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. InProceedings of the 25th SIGIR, (pp. 316–323).

Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In Proceedings of the 17th CIKM, (pp. 571–580).

Titel: Fewer topics? A million topics? Both?! On topics subsets in test collections
verfasst von: Kevin Roitero
J. Shane Culpepper
Mark Sanderson
Falk Scholer
Stefano Mizzaro
Publikationsdatum: 08.05.2019
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 1/2020
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-019-09357-w

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"