Skip to main content
Erschienen in: Discover Computing 3/2016

01.06.2016 | Information Retrieval Evaluation Using Test Collections

Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

verfasst von: Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, Chris Develder

Erschienen in: Discover Computing | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Vakkari and Sormunen (2004) adopt the term ‘users’ for the persons reassessing documents. In our terminology, such persons are referred to as assessors.
 
2
The UDM was actually defined based on the probability that at least M out of N assessors, including the observed one, assign the top level. However, based on the binomial distribution, this is a straightforward extension from the case of \(M=1\) and \(N=2\), which is described here and corresponds best to the PRM formulation.
 
5
The TREC results are available at http://​trec.​nist.​gov/​results/​.
 
Literatur
Zurück zum Zitat Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. (2009). Diversifying search results. In Proceedings of the 2nd ACM international conference on web search and data mining (WSDM 2009) (pp. 5–14), Barcelona. doi:10.1145/1498759.1498766. Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. (2009). Diversifying search results. In Proceedings of the 2nd ACM international conference on web search and data mining (WSDM 2009) (pp. 5–14), Barcelona. doi:10.​1145/​1498759.​1498766.
Zurück zum Zitat Al-Harbi, A. L., & Smucker, M. D. (2014). A qualitative exploration of secondary assessor relevance judging behavior categories and subject descriptors. In Proceedings of the 5th information interaction in context symposium (IIiX 2014) (pp. 195–204), Regensburg. doi:10.1145/2637002.2637025. Al-Harbi, A. L., & Smucker, M. D. (2014). A qualitative exploration of secondary assessor relevance judging behavior categories and subject descriptors. In Proceedings of the 5th information interaction in context symposium (IIiX 2014) (pp. 195–204), Regensburg. doi:10.​1145/​2637002.​2637025.
Zurück zum Zitat Bailey, P., Craswell, N., Soboroff, I., & Thomas, P. (2008). Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of the 31st international ACM SIGIR conference research and development in information retrieval (SIGIR 2008), Singapore. doi:10.1145/1390334.1390447. Bailey, P., Craswell, N., Soboroff, I., & Thomas, P. (2008). Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of the 31st international ACM SIGIR conference research and development in information retrieval (SIGIR 2008), Singapore. doi:10.​1145/​1390334.​1390447.
Zurück zum Zitat Carterette, B., & Soboroff, I. (2010). The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 539–546), Geneva. doi:10.1145/1835449.1835540. Carterette, B., & Soboroff, I. (2010). The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 539–546), Geneva. doi:10.​1145/​1835449.​1835540.
Zurück zum Zitat Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or there: Preference judgments for relevance. In Proceedngs of the 30th European conference on advances in information retrieval (ECIR 2008) (pp. 16–27). Berlin: Springer. Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or there: Preference judgments for relevance. In Proceedngs of the 30th European conference on advances in information retrieval (ECIR 2008) (pp. 16–27). Berlin: Springer.
Zurück zum Zitat Carterette, B., Kanoulas, E., & Yilmaz, E. (2012). Incorporating variability in user behavior into systems based evaluation. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM’12) (pp. 135–144). New York, NY: ACM. doi:10.1145/2396761.2396782. Carterette, B., Kanoulas, E., & Yilmaz, E. (2012). Incorporating variability in user behavior into systems based evaluation. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM’12) (pp. 135–144). New York, NY: ACM. doi:10.​1145/​2396761.​2396782.
Zurück zum Zitat Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009) (pp. 621–630), New York, NY. doi:10.1145/1645953.1646033. Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009) (pp. 621–630), New York, NY. doi:10.​1145/​1645953.​1646033.
Zurück zum Zitat Demeester, T., Trieschnigg, D., Nguyen, D., & Hiemstra, D. (2013). Overview of the trec 2013 federated web search track. In Proceedings of the 22nd text retrieval conference (TREC 2013), Gaithersburg, MD. Demeester, T., Trieschnigg, D., Nguyen, D., & Hiemstra, D. (2013). Overview of the trec 2013 federated web search track. In Proceedings of the 22nd text retrieval conference (TREC 2013), Gaithersburg, MD.
Zurück zum Zitat Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Trieschnigg, D., & Develder, C. (2014). Exploiting user disagreement for web search evaluation: An experimental approach. In Proceedings of the 7th ACM international conference on web search and data mining (WSDM 2014) (pp. 33–42), New York, NY. doi:10.1145/2556195.2556268. Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Trieschnigg, D., & Develder, C. (2014). Exploiting user disagreement for web search evaluation: An experimental approach. In Proceedings of the 7th ACM international conference on web search and data mining (WSDM 2014) (pp. 33–42), New York, NY. doi:10.​1145/​2556195.​2556268.
Zurück zum Zitat Demeester, T., Trieschnigg, D., Zhou, K., Nguyen, D., & Hiemstra, D. (2015). FedWeb greatest hits: Presenting the new test collection for federated web search. In Proceedings of the 24th international world wide web conference (WWW 2015), Florence. doi:10.1145/2740908.2742755. Demeester, T., Trieschnigg, D., Zhou, K., Nguyen, D., & Hiemstra, D. (2015). FedWeb greatest hits: Presenting the new test collection for federated web search. In Proceedings of the 24th international world wide web conference (WWW 2015), Florence. doi:10.​1145/​2740908.​2742755.
Zurück zum Zitat Hosseini, M., Cox, I. J., Milić-frayling, N., Kazai, G., & Vinay, V. (2012). On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the 34th European conference on advances in information retrieval (ECIR 2012) (pp. 182–194), Barcelona. Hosseini, M., Cox, I. J., Milić-frayling, N., Kazai, G., & Vinay, V. (2012). On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the 34th European conference on advances in information retrieval (ECIR 2012) (pp. 182–194), Barcelona.
Zurück zum Zitat Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for nDCG. In Proceedings of the 18th ACM international conference on information and knowledge management (CIKM 2009) (pp. 611–620), Hong Kong. doi:10.1145/1645953.1646032. Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for nDCG. In Proceedings of the 18th ACM international conference on information and knowledge management (CIKM 2009) (pp. 611–620), Hong Kong. doi:10.​1145/​1645953.​1646032.
Zurück zum Zitat Kazai, G., Yilmaz, E., Craswell, N., & Tahaghoghi, S. (2013). User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on conference on information and knowledge management (CIKM 2013) (pp. 699–708). New York, NY: ACM. doi:10.1145/2505515.2505716. Kazai, G., Yilmaz, E., Craswell, N., & Tahaghoghi, S. (2013). User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on conference on information and knowledge management (CIKM 2013) (pp. 699–708). New York, NY: ACM. doi:10.​1145/​2505515.​2505716.
Zurück zum Zitat Nguyen, D., Demeester, T., Trieschnigg, D., & Hiemstra, D. (2012). Federated search in the wild: The combined power of over a hundred search engines. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012), Maui, HI. doi:10.1145/2396761.2398535. Nguyen, D., Demeester, T., Trieschnigg, D., & Hiemstra, D. (2012). Federated search in the wild: The combined power of over a hundred search engines. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012), Maui, HI. doi:10.​1145/​2396761.​2398535.
Zurück zum Zitat Robertson, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 603–610), Geneva. doi:10.1145/1835449.1835550. Robertson, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 603–610), Geneva. doi:10.​1145/​1835449.​1835550.
Zurück zum Zitat Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., & Song, R. (2013). Overview of the NTCIR-10 INTENT-2 task. In Proceedings of the 10th NTCIR conference (pp. 94–123), Tokyo. Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., & Song, R. (2013). Overview of the NTCIR-10 INTENT-2 task. In Proceedings of the 10th NTCIR conference (pp. 94–123), Tokyo.
Zurück zum Zitat Smucker, M. D., & Clarke, C. L. (2012). Modeling user variance in time-biased gain. In Proceedings of the symposium on human–computer interaction and information retrieval (HCIR 2012), Cambridge, CA. doi:10.1145/2391224.2391227. Smucker, M. D., & Clarke, C. L. (2012). Modeling user variance in time-biased gain. In Proceedings of the symposium on human–computer interaction and information retrieval (HCIR 2012), Cambridge, CA. doi:10.​1145/​2391224.​2391227.
Zurück zum Zitat Song, R., Zhang, M., Sakai, T., Kato, M. P., Liu, Y., Sugimoto, M., Wang, Q., & Orii, N. (2011). Overview of the NTCIR-9 INTENT task. In Proceedings of the 9th NTCIR workshop meeting (pp. 82–105), Tokyo. Song, R., Zhang, M., Sakai, T., Kato, M. P., Liu, Y., Sugimoto, M., Wang, Q., & Orii, N. (2011). Overview of the NTCIR-9 INTENT task. In Proceedings of the 9th NTCIR workshop meeting (pp. 82–105), Tokyo.
Zurück zum Zitat Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 324–330), Tampere. Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 324–330), Tampere.
Zurück zum Zitat Turpin, A., Scholer, F., Jarvelin, K., Wu, M., & Culpepper, J. S. (2009). Including summaries in system evaluation. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2009) (pp. 508–515), Boston, MA. doi:10.1145/1571941.1572029. Turpin, A., Scholer, F., Jarvelin, K., Wu, M., & Culpepper, J. S. (2009). Including summaries in system evaluation. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2009) (pp. 508–515), Boston, MA. doi:10.​1145/​1571941.​1572029.
Zurück zum Zitat Vakkari, P., & Sormunen, E. (2004). The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology, 55(11), 963–969. doi:10.1002/asi.20046.CrossRef Vakkari, P., & Sormunen, E. (2004). The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology, 55(11), 963–969. doi:10.​1002/​asi.​20046.CrossRef
Zurück zum Zitat Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 74–82), New Orleans, LA. doi:10.1145/383952.383963. Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 74–82), New Orleans, LA. doi:10.​1145/​383952.​383963.
Zurück zum Zitat Webber, W., Chandar, P., & Carterette, B. (2012). Alternative assessor disagreement and retrieval depth. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012) (pp. 125–134), New York, NY. doi:10.1145/2396761.2396781. Webber, W., Chandar, P., & Carterette, B. (2012). Alternative assessor disagreement and retrieval depth. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012) (pp. 125–134), New York, NY. doi:10.​1145/​2396761.​2396781.
Zurück zum Zitat Yilmaz, E., Shokouhi, M., Craswell, N., & Robertson, S. (2010). Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on information and knowledge management (CIKM 2010) (pp. 1561–1564), Toronto, ON. doi:10.1145/1871437.1871672. Yilmaz, E., Shokouhi, M., Craswell, N., & Robertson, S. (2010). Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on information and knowledge management (CIKM 2010) (pp. 1561–1564), Toronto, ON. doi:10.​1145/​1871437.​1871672.
Zurück zum Zitat Zhai, C. X., Cohen, W. W., & Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2003) (pp. 10–17), Toronto, ON. doi:10.1145/860435.860440 Zhai, C. X., Cohen, W. W., & Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2003) (pp. 10–17), Toronto, ON. doi:10.​1145/​860435.​860440
Zurück zum Zitat Zhou, K., Zha, H., Chang, Y., & Xue, G. R. (2014). Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering, 26(2), 391–404. doi:10.1109/TKDE.2012.252.CrossRef Zhou, K., Zha, H., Chang, Y., & Xue, G. R. (2014). Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering, 26(2), 391–404. doi:10.​1109/​TKDE.​2012.​252.CrossRef
Metadaten
Titel
Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation
verfasst von
Thomas Demeester
Robin Aly
Djoerd Hiemstra
Dong Nguyen
Chris Develder
Publikationsdatum
01.06.2016
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2016
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-015-9275-x

Weitere Artikel der Ausgabe 3/2016

Discover Computing 3/2016 Zur Ausgabe

Information Retrieval Evaluation Using Test Collections

Topic set size design

Premium Partner