nach oben

Discover Computing

Erschienen in:

01.06.2016 | Information Retrieval Evaluation Using Test Collections

Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

verfasst von: Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Chris Develder

Erschienen in: Discover Computing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.

Vorheriger Artikel Topic set size design

Nächster Artikel Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vakkari and Sormunen (2004) adopt the term ‘users’ for the persons reassessing documents. In our terminology, such persons are referred to as assessors.

The UDM was actually defined based on the probability that at least M out of N assessors, including the observed one, assign the top level. However, based on the binomial distribution, this is a straightforward extension from the case of \(M=1\) and \(N=2\), which is described here and corresponds best to the PRM formulation.

http://trec.nist.gov/trec_eval/

http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html

The TREC results are available at http://trec.nist.gov/results/.

Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. (2009). Diversifying search results. In Proceedings of the 2nd ACM international conference on web search and data mining (WSDM 2009) (pp. 5–14), Barcelona. doi:10.1145/1498759.1498766.

Al-Harbi, A. L., & Smucker, M. D. (2014). A qualitative exploration of secondary assessor relevance judging behavior categories and subject descriptors. In Proceedings of the 5th information interaction in context symposium (IIiX 2014) (pp. 195–204), Regensburg. doi:10.1145/2637002.2637025.

Bailey, P., Craswell, N., Soboroff, I., & Thomas, P. (2008). Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of the 31st international ACM SIGIR conference research and development in information retrieval (SIGIR 2008), Singapore. doi:10.1145/1390334.1390447.

Carterette, B., & Soboroff, I. (2010). The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 539–546), Geneva. doi:10.1145/1835449.1835540.

Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or there: Preference judgments for relevance. In Proceedngs of the 30th European conference on advances in information retrieval (ECIR 2008) (pp. 16–27). Berlin: Springer.

Carterette, B., Kanoulas, E., & Yilmaz, E. (2012). Incorporating variability in user behavior into systems based evaluation. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM’12) (pp. 135–144). New York, NY: ACM. doi:10.1145/2396761.2396782.

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009) (pp. 621–630), New York, NY. doi:10.1145/1645953.1646033.

Demeester, T., Trieschnigg, D., Nguyen, D., & Hiemstra, D. (2013). Overview of the trec 2013 federated web search track. In Proceedings of the 22nd text retrieval conference (TREC 2013), Gaithersburg, MD.

Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Trieschnigg, D., & Develder, C. (2014). Exploiting user disagreement for web search evaluation: An experimental approach. In Proceedings of the 7th ACM international conference on web search and data mining (WSDM 2014) (pp. 33–42), New York, NY. doi:10.1145/2556195.2556268.

Demeester, T., Trieschnigg, D., Zhou, K., Nguyen, D., & Hiemstra, D. (2015). FedWeb greatest hits: Presenting the new test collection for federated web search. In Proceedings of the 24th international world wide web conference (WWW 2015), Florence. doi:10.1145/2740908.2742755.

Harter, S. P. (1996). Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science, 47(1), 37–49. doi:10.1002/(SICI)1097-4571(199601)47:1<3.0.CO;2-3.CrossRef

Hosseini, M., Cox, I. J., Milić-frayling, N., Kazai, G., & Vinay, V. (2012). On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the 34th European conference on advances in information retrieval (ECIR 2012) (pp. 182–194), Barcelona.

Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. doi:10.1145/582415.582418.CrossRef

Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for nDCG. In Proceedings of the 18th ACM international conference on information and knowledge management (CIKM 2009) (pp. 611–620), Hong Kong. doi:10.1145/1645953.1646032.

Kazai, G., Yilmaz, E., Craswell, N., & Tahaghoghi, S. (2013). User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on conference on information and knowledge management (CIKM 2013) (pp. 699–708). New York, NY: ACM. doi:10.1145/2505515.2505716.

Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Information Processing & Management, 41(5), 1019–1033. doi:10.1016/j.ipm.2005.01.004.CrossRef

Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2:1–2:27. doi:10.1145/1416950.1416952.CrossRef

Nguyen, D., Demeester, T., Trieschnigg, D., & Hiemstra, D. (2012). Federated search in the wild: The combined power of over a hundred search engines. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012), Maui, HI. doi:10.1145/2396761.2398535.

Robertson, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 603–610), Geneva. doi:10.1145/1835449.1835550.

Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing & Management, 43(2), 531–548. doi:10.1016/j.ipm.2006.07.020.CrossRef

Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., & Song, R. (2013). Overview of the NTCIR-10 INTENT-2 task. In Proceedings of the 10th NTCIR conference (pp. 94–123), Tokyo.

Smucker, M. D., & Clarke, C. L. (2012). Modeling user variance in time-biased gain. In Proceedings of the symposium on human–computer interaction and information retrieval (HCIR 2012), Cambridge, CA. doi:10.1145/2391224.2391227.

Song, R., Zhang, M., Sakai, T., Kato, M. P., Liu, Y., Sugimoto, M., Wang, Q., & Orii, N. (2011). Overview of the NTCIR-9 INTENT task. In Proceedings of the 9th NTCIR workshop meeting (pp. 82–105), Tokyo.

Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 324–330), Tampere.

Turpin, A., Scholer, F., Jarvelin, K., Wu, M., & Culpepper, J. S. (2009). Including summaries in system evaluation. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2009) (pp. 508–515), Boston, MA. doi:10.1145/1571941.1572029.

Vakkari, P., & Sormunen, E. (2004). The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology, 55(11), 963–969. doi:10.1002/asi.20046.CrossRef

Voorhees, E. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36(5), 697–716. doi:10.1016/S0306-4573(00)00010-8.CrossRef

Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 74–82), New Orleans, LA. doi:10.1145/383952.383963.

Webber, W., Chandar, P., & Carterette, B. (2012). Alternative assessor disagreement and retrieval depth. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012) (pp. 125–134), New York, NY. doi:10.1145/2396761.2396781.

Yilmaz, E., Shokouhi, M., Craswell, N., & Robertson, S. (2010). Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on information and knowledge management (CIKM 2010) (pp. 1561–1564), Toronto, ON. doi:10.1145/1871437.1871672.

Zhai, C. X., Cohen, W. W., & Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2003) (pp. 10–17), Toronto, ON. doi:10.1145/860435.860440

Zhou, K., Zha, H., Chang, Y., & Xue, G. R. (2014). Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering, 26(2), 391–404. doi:10.1109/TKDE.2012.252.CrossRef

Titel: Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation
verfasst von: Thomas Demeester
Robin Aly
Djoerd Hiemstra
Dong Nguyen
Chris Develder
Publikationsdatum: 01.06.2016
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 3/2016
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-015-9275-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2016

The strange case of reproducibility versus representativeness in contextual suggestion test collections

Topic set size design

Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation

Information retrieval evaluation using test collections

Premium Partner