nach oben

Discover Computing

Erschienen in:

08.06.2020

Assessing ranking metrics in top-N recommendation

verfasst von: Daniel Valcarce, Alejandro Bellogín, Javier Parapar, Pablo Castells

Erschienen in: Discover Computing | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The evaluation of recommender systems is an area with unsolved questions at several levels. Choosing the appropriate evaluation metric is one of such important issues. Ranking accuracy is generally identified as a prerequisite for recommendation to be useful. Ranking metrics have been adapted for this purpose from the Information Retrieval field into the recommendation task. In this article, we undertake a principled analysis of the robustness and the discriminative power of different ranking metrics for the offline evaluation of recommender systems, drawing from previous studies in the information retrieval field. We measure the robustness to different sources of incompleteness that arise from the sparsity and popularity biases in recommendation. Among other results, we find that precision provides high robustness while normalized discounted cumulative gain offers the best discriminative power. In dealing with cold users, we also find that the geometric mean is more robust than the arithmetic mean as aggregation function over users.

Vorheriger Artikel Offline evaluation options for recommender systems

Nächster Artikel Employing neighborhood reduction for alleviating sparsity and cold start problems in user-based collaborative filtering

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://github.com/usnistgov/trec_eval.

https://github.com/dvalcarce/rec_eval.

Actually, we limit the plot to the first 20 p-values.

https://grouplens.org/datasets/movielens.

http://snap.stanford.edu/data/web-BeerAdvocate.html.

Allan, J., Croft, B., Moffat, A., & Sanderson, M. (2012). Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in lorne. SIGIR Forum, 46(1), 2–32. https://doi.org/10.1145/2215676.2215678.CrossRef

Anderson, C. (2008). The Long Tail: Why the Future of Business Is Selling Less of More. New York: Hachette Books.

Beel, J., & Langer, S. (2015). A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In S. Kapidakis, C. Mazurek, M. Werla (eds.) Proceedings of the 19th international conference on theory and practice of digital libraries, TPDL ’15 (pp. 153–168). Springer, Cham. https://doi.org/10.1007/978-3-319-24592-8_12.

Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), 29–38. https://doi.org/10.1145/138859.138861.CrossRef

Bellogín, A., Castells, P., & Cantador, I. (2011). Precision-oriented evaluation of recommender systems. In Proceedings of the 5th ACM conference on recommender systems, RecSys ’11 (p. 333). ACM, New York, NY, USA. https://doi.org/10.1145/2043932.2043996.

Bellogín, A., Castells, P., & Cantador, I. (2017). Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Journal, 20(6), 606–634. https://doi.org/10.1007/s10791-017-9312-z.CrossRef

Bellogín, A., Wang, J., & Castells, P. (2013). Bridging memory-based collaborative filtering and text retrieval. Information Retrieval, 16(6), 697–724. https://doi.org/10.1007/s10791-012-9214-z.CrossRef

Bennett, J., & Lanning, S (2007). The netflix prize. In Proceedings of the KDD cup workshop 2007 (pp. 3–6). ACM, New York, NY, USA.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.MATH

Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. (2007). Bias and the limits of pooling for large collections. Information Retrieval, 10(6), 491–508. https://doi.org/10.1007/s10791-007-9032-x.CrossRef

Buckley, C., & Voorhees, E.M (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00 (pp. 33–40). ACM, New York, NY, USA. https://doi.org/10.1145/345508.345543.

Buckley, C., & Voorhees, E.M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04 (pp. 25–32). ACM, New York, NY, USA. https://doi.org/10.1145/1008992.1009000.

Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07, (pp. 63–70). ACM, New York, NY, USA. https://doi.org/10.1145/1277741.1277755.

Campos, P. G., Díez, F., & Cantador, I. (2014). Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols. User Model. User-Adapt. Interact., 24(1–2), 67–119. https://doi.org/10.1007/s11257-012-9136-x.CrossRef

Cañamares, R., & Castells, P. (2018). Should i follow the crowd? A probabilistic analysis of the effectiveness of popularity in recommender systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18 (pp. 415–424). ACM, New York, NY, USA. https://doi.org/10.1145/3209978.3210014.

Castells, P., Hurley, N.J., & Vargas, S. (2015). Novelty and diversity in recommender systems. In F. Ricci, L. Rokach, B. Shapira (eds.) Recommender Systems Handbook, 2nd edn. (pp. 881–918). Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7637-6_26.

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (p. 621). ACM, New York, NY, USA. https://doi.org/10.1145/1645953.1646033.

Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A.V., & Turrin, R. (2011). Looking for “Good” recommendations: A comparative evaluation of recommender systems. In Proceedings of the 13th IFIP TC 13 international conference on human-computer interaction–Part III, INTERACT ’11 (pp. 152–168). Springer, Berlin. https://doi.org/10.1007/978-3-642-23765-2_11.

Cremonesi, P., Garzotto, F., & Turrin, R. (2013). User-centric vs. system-centric evaluation of recommender systems. In Proceedings of the 14th IFIP TC 13 international conference on human-computer interaction–Part III, INTERACT ’13 (pp. 334–351). Springer, Berlin. https://doi.org/10.1007/978-3-642-40477-1_21.

Cremonesi, P., Koren, Y., & Turrin, R. (2010). Performance of recommender algorithms on top-N recommendation tasks. In Proceedings of the 4th ACM conference on recommender systems, RecSys ’10 (pp. 39–46). ACM, New York, NY, USA. https://doi.org/10.1145/1864708.1864721.

Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap, monographs on statistics and applied probability. Boca Raton, Florida, USA: Chapman & Hall/CRC.CrossRef

Ferro, N., Fuhr, N., Grefenstette, G., Konstan, J. A., Castells, P., Daly, E. M., et al. (2018). The dagstuhl perspectives workshop on performance modeling and prediction. SIGIR Forum, 52(1), 91–101. https://doi.org/10.1145/3274784.3274789.CrossRef

Fuhr, N. (2018). Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3), 32–41. https://doi.org/10.1145/3190580.3190586.CrossRef

Garcin, F., Faltings, B., Donatsch, O., Alazzawi, A., Bruttin, C., & Huber, A. (2014). Offline and online evaluation of news recommender systems at swissinfo.ch. In Proceedings of the 8th ACM conference on recommender systems, RecSys ’14 (pp. 169–176). ACM, New York, NY, USA. https://doi.org/10.1145/2645710.2645745.

Gilotte, A., Calauzènes, C., Nedelec, T., Abraham, A., & Dollé, S. (2018). Offline A/B testing for recommender systems. In Proceedings of the 11th ACM international conference on web search and data mining, WSDM ’18 (pp. 198–206). ACM.

Gini, C. (1912). Variabilità e Mutuabilità. Cuppini: Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche.

Gruson, A., Chandar, P., Charbuillet, C., McInerney, J., Hansen, S., Tardieu, D., & Carterette, B. (2019). Offline evaluation to make decisions about playlist recommendation. In Proceedings of the 12th ACM international conference on web search and data mining, WSDM ’19 (pp. 420–428). ACM.

Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. In: F. Ricci, L. Rokach, B. Shapira (eds.) Recommender systems handbook, 2nd edn. (pp. 265–308). Springer, Boston, MA, USA. https://doi.org/10.1007/978-1-4899-7637-6_8.

Harper, F. M., & Konstan, J. A. (2015). The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems, 5(4), 19:1–19:19. https://doi.org/10.1145/2827872.CrossRef

Herlocker, J. L., Konstan, J. A., Terveen, L. G., & Riedl, J. T. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 5–53. https://doi.org/10.1145/963770.963772.CrossRef

Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems, 22(1), 89–115. https://doi.org/10.1145/963770.963774.CrossRef

Hosanagar, K., Fleder, D., Lee, D., & Buja, A. (2014). Will the global village fracture into tribes: Recommender systems and their effects on consumers. Management Science, 60(4), 805–823. https://doi.org/10.2139/ssrn.1321962.CrossRef

Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 eighth IEEE international conference on data mining, ICDM ’08 (pp. 263–272). IEEE, Washington, DC, USA. https://doi.org/10.1109/ICDM.2008.22.

Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. https://doi.org/10.1145/582415.582418.CrossRef

Joachims, T., Swaminathan, A., & Schnabel, T. (2017). Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining, WSDM ’17 (pp. 781–789). ACM. https://doi.org/10.1145/3018661.3018699.

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1–2), 81–93. https://doi.org/10.1093/biomet/30.1-2.81.CrossRefMATH

Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009). Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery, 18(1), 140–181. https://doi.org/10.1007/s10618-008-0114-1.MathSciNetCrossRef

Konstan, J.A., & Adomavicius, G. (2013). Toward identification and adoption of best practices in algorithmic recommender systems research. In Proceedings of the international workshop on reproducibility and replication in recommender systems evaluation, RepSys ’13 (pp. 23–28). ACM, New York, NY, USA. https://doi.org/10.1145/2532508.2532513.

Kutlu, M., Elsayed, T., & Lease, M. (2018). Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging. Information Processing & Management, 54(1), 37–59. https://doi.org/10.1016/j.ipm.2017.09.002.CrossRef

Losada, D. E., Parapar, J., & Barreiro, A. (2017). Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Information Processing & Management, 53(5), 1005–1025. https://doi.org/10.1016/j.ipm.2017.04.005.CrossRef

Lu, X., Moffat, A., & Culpepper, J. S. (2016). The effect of pooling and evaluation depth on IR metrics. Information Retrieval Journal, 19(4), 416–445. https://doi.org/10.1007/s10791-016-9282-6.CrossRef

Marlin, B. M., Zemel, R. S., Roweis, S., & Slaney, M. (2007). Collaborative filtering and the missing at random assumption. In Proceedings of the 23rd conference on uncertainty in artificial intelligence, UAI’07 (pp. 267–275). Arlington, VA: AUAI Press.

Matos-Junior, O., Ziviani, N., Botelho, F., Cristo, M., Lacerda, A., & da Silva, A. S. (2012). Using taxonomies for product recommendation. Journal of Information and Data Management, 3(2), 85–100.

McNee, S. M., Riedl, J. T., & Konstan, J. A. (2006). Being accurate is not enough: How accuracy metrics have hurt recommender systems. In CHI ’06 extended abstracts on human factors in computing systems, CHI EA ’06 (p. 1097). ACM, New York, NY, USA. https://doi.org/10.1145/1125451.1125659.

Ning, X., & Karypis, G. (2011). SLIM: sparse linear methods for top-N recommender systems. In Proceedings of the 2011 IEEE 11th international conference on data mining, ICDM ’11 (pp. 497–506). IEEE, Washington, DC, USA. https://doi.org/10.1109/ICDM.2011.134.

Parapar, J., Bellogín, A., Castells, P., & Barreiro, Á. (2013). Relevance-based Language modelling for recommender systems. Information Processing & Management, 49(4), 966–980. https://doi.org/10.1016/j.ipm.2013.03.001.CrossRef

Parapar, J., Losada, D.E., PresedoQuindimil, M.A., & Barreiro, Á. (2019). Using score distributions to compare statistical significance tests for information retrieval evaluation. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24203

Park, S.-T., Chu, W., Park, S.T., & Chu, W. (2009). Pairwise preference regression for cold-start recommendation. In Proceedings of the 3rd ACM conference on recommender systems, RecSys ’09 (pp. 21–28). ACM, New York, NY, USA. https://doi.org/10.1145/1639714.1639720.

Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, UAI ’09 (pp. 452–461). AUAI Press, Arlington, VA, US.

Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender systems handbook, 2nd edn. Boston, MA: Springer. https://doi.org/10.1007/978-1-4899-7637-6.

Robertson, S. (2006). On GMAP: And other transformations. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06 (pp. 78–83). ACM, New York, NY, USA. https://doi.org/10.1145/1183614.1183630.

Rossetti, M., Stella, F., & Zanker, M. (2016). Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM conference on recommender systems, RecSys ’16 (pp. 31–34). ACM, New York, NY, USA. https://doi.org/10.1145/2959100.2959176.

Said, A., & Bellogín, A. (2014). Comparative recommender system evaluation: Benchmarking recommendation frameworks. In Proceedings of the 8th ACM conference on recommender systems, RecSys ’14 (pp. 129–136). ACM, New York, NY, USA. https://doi.org/10.1145/2645710.2645746.

Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 525–532). ACM, New York, NY, USA. https://doi.org/10.1145/1148170.1148261.

Sakai, T. (2012). Evaluation with informational and navigational intents. In Proceedings of the 21st international conference on world wide web, WWW ’12 (pp. 499–508). New York, NY: ACM. https://doi.org/10.1145/2187836.2187904.

Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval, 11(5), 447–470. https://doi.org/10.1007/s10791-008-9059-7.CrossRef

Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., & Joachims, T. (2016). Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 33rd international conference on machine learning, ICML ’16 (pp. 1670–1679).

Siroker, D., & Koomen, P. (2013). A/B testing: The most powerful way to turn clicks into customers. New York: Wiley.

Spärck Jones, K., & Van Rijsbergen, C.J. (1975). Report on the need for and provision of an ideal information retrieval test collection. British Library Research and Development Reports. University Computer Laboratory

Steck, H. (2010). Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10 (pp. 713–722). New York, NY: ACM. https://doi.org/10.1145/1835804.1835895.

Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudík, M., Langford, J., Jose, D., Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In Proceedings of the 31st annual conference on neural information processing systems, NIPS ’17 (pp. 3635–3645).

Takács, G., Pilászy, I., Németh, B., & Tikk, D. (2009). Scalable collaborative filtering approaches for large recommender systems. Journal of Machine Learning Research, 10, 623–656. https://doi.org/10.1145/1577069.1577091.CrossRef

Valcarce, D., Bellogín, A., Parapar, J., & Castells, P. (2018). On the robustness and discriminative power of information retrieval metrics for top-N recommendation. In Proceedings of the 12th ACM conference on recommender systems, RecSys ’18 (pp. 260–268). New York, NY: ACM. https://doi.org/10.1145/3240323.3240347.

Valcarce, D., Parapar, J., & Barreiro, Á. (2016). Efficient pseudo-relevance feedback methods for collaborative filtering recommendation. In Proceedings of the 38th European conference on information retrieval, ECIR ’16 (pp. 602–613). Berlin: Springer. https://doi.org/10.1007/978-3-319-30671-1_44.

Valcarce, D., Parapar, J., & Barreiro, Á. (2016). Item-based relevance modelling of recommendations for getting rid of long tail products. Knowledge-Based Systems, 103, 41–51. https://doi.org/10.1016/j.knosys.2016.03.021.CrossRef

Valcarce, D., Parapar, J., Barreiro, Á. (2016). Language models for collaborative filtering neighbourhoods. In Proceedings of the 38th European conference on information retrieval, ECIR ’16 (pp. 614–625). Berlin: Springer. https://doi.org/10.1007/978-3-319-30671-1_45.

Voorhees, E.M. (2002). The philosophy of information retrieval evaluation. In Evaluation of cross-language information retrieval systems: Second workshop of the cross-language evaluation forum, CLEF 2001 (pp. 355–370). Berlin: Springer. https://doi.org/10.1007/3-540-45691-0_34.

Voorhees, E.M. (2005). Overview of the TREC 2004 Robust Track. In ACM SIGIR forum

Wang, J., de Vries, A.P., & Reinders, M. J. T. (2006). A user-item relevance model for log-based collaborative filtering. In Proceedings of the 28th European conference on IR research, ECIR ’06 (pp. 37–48). London: Springer. https://doi.org/10.1007/11735106_5.

Webber, W., Moffat, A., & Zobel, J. (2010). The effect of pooling and evaluation depth on metric stability. In Proceedings of the 3rd international workshop on evaluating information access, EVIA ’10 (pp. 7–15).

Yang, L., Cui, Y., Xuan, Y., Wang, C., Belongie, S., & Estrin, D. (2018). Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM conference on recommender systems, RecSys ’18 (pp. 279–287). ACM.

Yilmaz, E., & Aslam, J. A. (2008). Estimating average precision when judgments are incomplete. Knowledge and Information Systems, 16(2), 173–211. https://doi.org/10.1007/s10115-007-0101-7.CrossRef

Yin, H., Cui, B., Li, J., Yao, J., & Chen, C. (2012). Challenging the long tail recommendation. Proceedings of the VLDB Endowment, 5(9), 896–907. https://doi.org/10.14778/2311906.2311916.CrossRef

Yu, H. T., Jatowt, A., Blanco, R., Joho, H., & Jose, J. M. (2017). An in-depth study on diversity evaluation: The importance of intrinsic diversity. Information Processing & Management, 53(4), 799–813. https://doi.org/10.1016/j.ipm.2017.03.001.CrossRef

Titel: Assessing ranking metrics in top-N recommendation
verfasst von: Daniel Valcarce
Alejandro Bellogín
Javier Parapar
Pablo Castells
Publikationsdatum: 08.06.2020
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 4/2020
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-020-09377-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner