ABSTRACT
A crucial operation in memory-based collaborative filtering (CF) is determining nearest neighbors (NNs) of users/items. This paper addresses two phenomena that emerge when CF algorithms perform NN search in high-dimensional spaces that are typical in CF applications. The first is similarity concentration and the second is the appearance of hubs (i.e. points which appear in $k$-NN lists of many other points). Through theoretical analysis and experimental evaluation we show that these phenomena are inherent properties of high-dimensional space, unrelated to other data properties like sparsity, and that they can impact CF algorithms by questioning the meaning and representativeness of discovered NNs. Moreover, we show that it is not easy to mitigate the phenomena using dimensionality reduction. Studying these phenomena aims to provide a better understanding of the limitations of memory-based CF and motivate the development of new algorithms that would overcome them.
- G. Casella and R. L. Berger. Statistical Inference, 2nd ed. Duxbury, 2002.Google Scholar
- D. Francois, V. Wertz, and M. Verleysen. The concentration of fractional distances. IEEE T. Knowl. Data. En., 19(7):873--886, 2007. Google ScholarDigital Library
- L. A. Goodman. On the exact variance of products. J. Am. Stat. Assoc., 55(292):708--713, 1960.Google ScholarCross Ref
- M. Grcar, D. Mladenic, B. Fortuna, and M. Grobelnik. Data sparsity issues in the collaborative filtering framework. In Proc. WebKDD Workshop, pages 58--76, 2005. Google ScholarDigital Library
- A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 506--515, 2000. Google ScholarDigital Library
- M. Radovanovic, A. Nanopoulos, and M. Ivanovic. Nearest neighbors in high-dimensional data: The emergence and influence of hubs. In Proc. Int. Conf. on Machine Learning (ICML), pages 865--872, 2009. Google ScholarDigital Library
- B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Application of dimensionality reduction in recommender system. In Proc. WebKDD Workshop, 2000.Google Scholar
- B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proc. World Wide Web Conf. (WWW), pages 285--295, 2001. Google ScholarDigital Library
- J. Wang, A. P. de Vries, and M. J. T. Reinders. Unifying user-based and item-based collaborative filtering approaches by similarity fusion. In Proc. ACM Conf. on Research and Development in Information Retrieval (SIGIR), pages 501--508, 2006. Google ScholarDigital Library
- K. Yu, X. Xu, M. Ester, and H.-P. Kriegel. Feature weighting and instance selection for collaborative filtering: An information--theoretic approach. Knowl. Inf. Syst., 5(2):201--224, 2003. Google ScholarDigital Library
Index Terms
- How does high dimensionality affect collaborative filtering?
Recommendations
Improving Neighborhood-Based Collaborative Filtering by Reducing Hubness
ICMR '14: Proceedings of International Conference on Multimedia RetrievalFor recommending multimedia items, collaborative filtering (CF) denotes the technique of automatically predicting a user's rating or preference for an item by exploiting item preferences of a (large) group of other users. In traditional memory-based (or ...
The wisdom of the few: a collaborative filtering approach based on expert opinions from the web
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalNearest-neighbor collaborative filtering provides a successful means of generating recommendations for web users. However, this approach suffers from several shortcomings, including data sparsity and noise, the cold-start problem, and scalability. In ...
On the existence of obstinate results in vector space models
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalThe vector space model (VSM) is a popular and widely applied model in information retrieval (IR). VSM creates vector spaces whose dimensionality is usually high (e.g., tens of thousands of terms). This may cause various problems, such as susceptibility ...
Comments