In the context of recommendation, researchers and professionals in RSs are concerned with user satisfaction, so that the predictions can provide the more value to the user. The reason is that RSs must be useful to the user, not only suggesting them to consume “more of the same”. Researchers worry about users’ interaction and consumption experience in the system. Recently, researchers have been attempting to solve this problem by evaluating different concepts of evaluation, rather than simply use predictive accuracy and machine learning techniques [
14]. The performance of the suggestions provided by a RSs should be measured by the value it can generate to the user [
14]. There are many concepts regarding evaluation of recommendations, such as coverage, novelty, diversity and surprise of recommendations have been evaluated by different researches.
This methodology of evaluation in recommendation has different names in the literature. Kotkov et al. [
20,
21] use the word
concept referring to novelty, relevance, serendipity. However, distinct names have been used. For instance, Adamopoulos et al. [
1] use the word
dimensions to refer to improvements that can increase the performance and usefulness of the recommendation list for users. Additionally, Herlocker et al. [
14] prefer the name
measures of recommender system evaluation, referring to coverage, confidence, novelty and learning rate. Besides having different names, we state the word
concept to refer to different aspects for assessment of RSs.
There are other concepts besides the six ones aforementioned, however, due to space limitation and lack of research about them, we decided to focus on the most studied ones. Nevertheless, some concepts that are not mentioned in this paper are: trust, risk, robustness, privacy, adaptability and scalability [
29]. Even though the handbook developed by Ricci et al. [
29] contains a review about the concepts of evaluation, our survey is focused on the evaluation concepts and metrics; moreover, this survey includes recent updates on metrics for unexpectedness and serendipity which were published after the review developed by Ricci et al. [
29].
2.1 Utility
Utility has been mentioned in the literature in many names, such as relevance, usefulness, recommendation value and satisfaction. In the Recommender Systems Handbook, Ricci et al. [
29] argues that utility represents the value that users receives in being recommended. As their own definition mentions, if the user enjoys the recommended items, he/she received useful recommendations. Moreover, utility has been defined as an order of preference of consumption. If users would only consume what they like most in the first place, therefore, recommending such items would help him/her find them easily, bringing usefulness to the recommendation Herlocker et al. [
14]. Moreover, Adamopoulos et al. [
1] cites the use of utility theory of economics for improving user satisfaction. Kotkov et al. [
20,
21] also mention in a survey that utility or relevance relates to what the user is interested in consuming and therefore related to the tastes of the user.
As it can be seen, most of the definitions associate utility with the desires of consumption of the user and if the user enjoyed the recommendations. On such definition, metrics for assessing utility in recommendation should focus on how the user might react to the predictions made by a recommender. Ricci et al. [
29] mention that utility could be measured by evaluating the rating that the user gives to predicted items after consuming them. This method is likely to be correct and capture if the recommendations brought value to the user, however it would involve in a costly online evaluation.
For offline evaluation, Herlocker et al. [
14] mention the use of accuracy-based metrics for evaluating utility. The authors discuss the use of
predictive accuracy metrics in order to evaluate if the user consumed the recommended items, usually in a train/test experiment. In this paper, we use the notation
\(util\left( {{R_u}} \right)~\)for utility, however there are many metrics for this purpose that the following subsections show.
2.1.1 Error metrics
Error metrics are widely used for predictive accuracy.
Mean Absolute Error evaluates the difference between the ratings predicted by the recommender and given by the users [
14]. Equation
1 show the MAE metric.
$$util\left( {{R_u}} \right)=~MAE=~\frac{{\mathop \sum \nolimits_{{i \in {R_u}}} p\left( i \right) - r(i)}}{{|{R_u}|}}$$
(1)
Moreover,
Root Mean Squared Error is another error metric, as it is shown in Eq.
2. Root Mean Squared Error calculates a larger difference for large errors in the rating prediction [
29]. Both MAE and RMSE are calculated on the prediction list, therefore the metrics are divided by
\({R_u}\). In addition, there are other error metrics, such as
Average RMSE, Average MAE and
Mean Squared Error.
$$util\left( {{R_u}} \right)=~RMSE=~\sqrt {\frac{{\mathop \sum \nolimits_{{i \in {R_u}}} {{\left( {p\left( i \right) - r(i)} \right)}^2}}}{{|{R_u}|}}}$$
(2)
2.1.2 Precision and Recall
According to Ricci et al. [
29], precision of a recommendation consists on the number of consumed (or rated) items in the recommendation list, as stated in the Eq.
3. Precision measures the rate of items in the recommendation list that the user likes and therefore consumed.
$$util\left( {{R_u}} \right)=~precision=~\frac{{|{C_u}\mathop \cap \nolimits^{} {R_u}|}}{{|{R_u}|}}$$
(3)
Recall, on the other hand, is calculated by the number of consumed items in the recommendation list out of the total number of items the user consumed [
29]. Equation
4 show recall calculation. Authors have called precision and recall as
\(precision@N\) and
\(recall@N\), where
\(N\) stands for the size of the recommendation list.
$$util\left( {{R_u}} \right)=~recall=~\frac{{|{C_u}\mathop \cap \nolimits^{} {R_u}|}}{{|{C_u}|}}$$
(4)
In applications, Zhang et al. [
38] evaluated their recommender against
novelty, diversity, serendipity and also used
rank and
recall in their metrics. Hurley and Zhang [
15] also uses
precision in their evaluations.
2.1.3 ROC curves
Ricci et al. [
29] mention the use of ROC curves in accuracy-based evaluation of recommendations. ROC curves measure the rate of items that the user likes in the recommendation list. Differently from error, precision and recall metrics, the calculation of ROC curves accentuate items that were suggested but the user disliked. Evaluation of algorithms in different scenarios could use the Area under the ROC curve (AUC) [
29].
Herlocker et al. [
14] also mention and exemplify that ROC curves could be plotted using the rate of useful and not useful items in a recommendation list. In this sense, a useful item could be defined if the user liked/consumed the item or not [
14].
2.1.4 Ranking score
Herlocker et al. [
14] cites that
rank metrics are useful in evaluating recommendations lists. Recommenders usually predicts ranked lists, however, users difficultly browse through all of the items. Therefore, ranking metrics could be interesting in measuring the utility and rank information altogether. One example is the
R-Score metric which considers a deduction in the value of recommendations according to the rank position. Top ranked items are more valued rather then items in the tail of the metric [
29].
Equation
5 show the
R-Score metric, where
\(r(i,j)\) is the rating of the item
\(i\) in the rank,
\(d\) is a median rating and
\(\alpha\) represents a half-life decay value. Besides this R-score, there are other ranking scores metrics, such as
Kendall and
Spearman rank correlation and
Normalized Distance-based Performance Measure [
14].
$$util\left( {{R_u}} \right)=~rank({R_u})=~\mathop \sum \limits_{{j=1}}^{{|{R_u}|}} \frac{{{\text{max}}(r\left( {{i_j}} \right) - d,0)}}{{{2^{\frac{{j - 1}}{{\alpha - 1}}}}}}$$
(5)
2.1.5 Utility-based metrics for online evaluation
Utility is also evaluated with users in online experiments. In this sense, researchers usually make user experiments for testing the utility of their recommender systems or evaluate it when it is being applied in the industry. Such experiments are also a good way to measure the overall systems targets [
32]. For these kinds of online experiments, some metrics are employed for evaluating the working recommender system.
Click-through-rate (CTR) is calculates the ratio of clicked/interacted recommended items out of the number of items recommended. It has been used since the early stages of the web in web/mobile advertisement and online marketing campaigns. CTR is also a major metric applied in the industry of recommender systems, as it helps to study how many items recommended to the users that they effectively consume. It has been mentioned or used in many work in the area, such as Farris et al. [
10], Chu and Park [
7] and Gomez-Uribe and Hunt [
13]. The premise is that by clicking/interacting/consuming a recommended item, the user considers that recommendation useful. From a business point of view, it shows how effectively the recommender system is in predicting useful items to the user. The metric can be seen in Eq.
6.
$$util\left( {{R_u}} \right)=~CTR=\frac{{|{C_u}|}}{{|{R_u}|}}$$
(6)
Retention is also a useful metric used in online evaluation of recommender systems [
32] user utility and for business. Retention measures the impact of the recommender systems in keeping users consuming items or using the system. It has been applied in many scenarios, at it is has been a focus of evaluation is systems such as Netflix [
13]. In an online monthly subscription fee based services, retention is an important business metric, evaluating retention of the recommendation is important to keep track of how long users will spend on their systems. Then, many algorithms try to predict items to maximize such metrics. In [
13], online retention experiments are performed as A/B tests with users and the retention delta is calculated, such as shown in Eq.
7. In the authors’ research, retention is calculated as the difference between users in control (
\({p_c}\)) and test groups (
\({p_t}\)) in the A/B test the authors performed.
$$util\left( {{R_u}} \right)={\Delta _{retention}}~=~{p_t} - {p_c}$$
(7)
Lastly, it is important to mention that the previously mentioned metrics for utility evaluation on recommender systems are applicable to online evaluation. For instance, accuracy-based metrics, such as error metrics, precision, recall are suitable to be used in online evaluation as well.
2.3 Diversity
Diversity is a concept concerned with the diversity of items in the recommendation list. It has also been widely studied by previous researchers. For diversity metrics, the notation used in this paper is \(div({R_u})\).
According to the Ricci et al. [
29], diversity in RSs have the contrary effect of similarity. The authors state that recommendation lists with low variety may not be of interest of the user. Moreover, one of the earliest work concerned with diversification in recommendation is [
41]. The authors argue that the RSs usually predict similar items compared to the user’s consumption history. Therefore, diversity means balancing recommendation lists to cover the user’s whole set of interests [
41]. In addition, Vargas and Castells [
36] state that diversity refers to the variety of the items in the recommendation list. Moreover, Hurley and Zhang [
15] and Zhang et al. [
38] reinforce the definition of diversity from [
41] stating that it is related to the variation of items in predictions of a RS.
Differently from novelty, the definitions of diversity are largely consistent in the literature. All the authors surveyed in this work agree that diversity represents variety of items in recommendation lists.
As a result of this definition, the proposed metrics tend to calculate diversity as a dissimilarity between the items in the recommendation list. Ziegler et al. [
41] proposed a metric for intra-list similarity, as Eq.
12 show. The function
\(d(i,j)\) calculates the distance between items
\(i\) and
\(j\) in the recommendation list
\({R_u}\). This metric actually captures the similarity of the list; therefore, low values for this metric represent a more similar list, in which the items are similar to one another.
$$div({R_u})=~\mathop \sum \limits_{{i \in {R_u}}} \mathop \sum \limits_{{j \in {R_u}~i \ne j}} d(i,j)$$
(13)
The intra-list similarity metric was also used by other works in diversity. Zhang et al. [
38] used the metric proposed by Ziegler et al. [
41] and chose the cosine similarity as the distance function. The metric can be seen in Eq.
13. Moreover, Hurley and Zhang [
15] used a similar diversity metric as 11, where the distance function (
\(d(i,j)\)) is calculated by a Collaborative Filtering memory-based similarity metric.
$$div({R_u})=~\mathop \sum \limits_{{i \in {R_u}}} \mathop \sum \limits_{{j \in {R_u}~i \ne j}} cossim(i,j)$$
(14)
Individually, Vargas and Castells [
36] proposed a distinct metric for calculating similarity. Their metric, as stated in Eq.
14, is a more specific case of the intra-list similarity. The metric takes into consideration a relative rank discount function for the position of each pair of items being analyzed (
\(disc(k)\) and
\(~disc(l|k)\)). Moreover, the metric also uses a distance function (
\(d({i_k},{i_l})\)) between the items, for instance cosine similarity distance. The approach proposed by Vargas and Castells [
36] resembles techniques used in Rank Information Retrieval for both the authors’ novelty and diversity metrics.
$$div({R_u})=~\mathop \sum \limits_{{k=1}}^{{|{R_u}|}} \mathop \sum \limits_{{l=1}}^{{|{R_u}|}} disc\left( k \right)disc\left( {l{\text{|}}k} \right)d\left( {{i_k},{i_l}} \right)~\forall {i_k} \ne {i_l}$$
(15)
2.4 Unexpectedness
Unexpectedness is a concept that has been increasingly mentioned in the literature, but it is still involves uncertain definitions. It is usually linked to surprise and avoidance of obviousness in recommendation. The notation used to show the unexpectedness metrics in this paper is \(unexp({R_u})\).
Unexpectedness was firstly stated a component of serendipity by McNee et al. [
24] and Ge et al. [
12]. In both researches, the authors use the term
unexpectedness to define the idea of
surprise in recommendation. Moreover, Kaminskas and Bridge [
18] also mention that unexpectedness represents surprise in recommendation.
Unexpectedness has also been defined as a
divergence from expected recommendations. In Murakami et al. [
25], the authors also state unexpectedness as a part of serendipity and describe it as a deviation from items that the user expect to consume, however the authors mostly focus on serendipity. Adamopoulos et al. [
1] also explain that serendipity and unexpectedness concepts have been overlapping. The user expectations consist on the set of items the user would like to consume next or the items the user forecast to be recommended [
1]. Therefore, unexpectedness would be a deviation from these expected items, evading obvious and uninteresting recommendations, with the possibility of surprising the user [
1].
Measuring unexpectedness is not trivial, due to its overlapping definitions. Two set of metrics have been proposed in the literature: metrics based on a primitive recommender and metrics based on principles not involving a primitive method. We present both set of metrics as follows.
2.4.1 Primitive recommender based unexpectedness
According to Ge et al. [
12], a primitive recommender usually predicts items that the user expects to consume. Such consideration is reasonable, considering unexpectedness as a deviation from expected recommendations. Therefore, Eq.
15 present unexpectedness as the items in a recommendation list (
\({R_u}\)), but not in a set of prediction made by a primitive recommender (
\(P{M_u}\)), proposed by Ge et al. [
12].
$$unexp\left( {{R_u}} \right)={R_u} - ~P{M_u}$$
(16)
The primitive recommender idea was later enhanced by Adamopoulos et al. [
1], where the authors measure the rate of unexpected items in a recommendation list
\(({R_u})\), such as shown in Eq.
16. In this metric,
\({E_u}\) is the set of expected items for the user. In short,
\({E_u}~\)is the same as
\(P{M_u}\).
$$unexp\left( {{R_u}} \right)=\frac{{{R_u} - ~P{M_u}}}{{|{R_u}|}}$$
(17)
The problem with primitive recommender based metrics lies in choosing an appropriate primitive recommender. The choice should be made considering the recommendation’s context. Users may have different expectations for movies and songs, for example. Moreover, different primitive recommenders will lead to different unexpectedness values. Therefore, using a primitive recommender may not be a trivial way to measure unexpectedness.
2.4.2 Non primitive recommender based unexpectedness
Metrics to assess unexpectedness based on principles not involving a primitive recommender also exist in the literature. Kaminskas and Bridge [
18] attempt to calculate surprise using a metric represented by Eqs.
17 and
18. The Point-wise mutual information function (
\(PMI(i,j)\)) calculates the probability of two items
\(i\) and
\(j\) be rated by the users. The PMI function is
\(PMI\left( {i,j} \right)=~\frac{{lo{g_2}\frac{{p(i,j)}}{{p\left( i \right)p(j)}}}}{{ - lo{g_2}p(i,j)}}\), where
\(p\left( i \right)\) is the probability of item
\(i~\)to be rated by users. In this case, this metric is comparing the recommended items and the history of the user, checking if the user is likely to know the predictions. Nevertheless, the metric may not be effectively measuring whether the user gets surprised with the recommendations. Besides, as the authors explain,
\(PMI\) function may be biased towards rare items, which may always be considered unexpected to the user.
$$unexp\left( {{R_u}} \right)=\mathop \sum \limits_{{i \in {R_u}}} \mathop \sum \limits_{{j \in {H_u}}} PMI(i,j)$$
(18)
$$unexp\left( {{R_u}} \right)=\mathop \sum \limits_{{i \in {R_u}}} ma{x_{j \in {R_u}}}PMI(i,j)$$
(19)
Akiyama et al. [
3] proposed an unpersonalized metric for unexpectedness that does not consider the users’ information. The metric, as it is shown by Eq.
19 use an idea of co-occurrence, but it is limited to items and their features. For instance,
\({I_v}\) calculates the number of items that have feature
\(v\) and
\({I_{v,w}}\) calculates the number of items that have both features
\(v\) and
\(w\). The probability of co-occurrence uses items’ features to measure how similar these items are. The author explains how to calculate the unexpectedness of a single item, however, one could calculate unexpectedness for entire recommendation list
\({R_u}\) by summing or averaging the unexpectedness of an item. Since this metric is not personalized, it is unlikely that it is measuring unexpectedness to users.
$$unexp\left( {{R_u}} \right)=~\frac{1}{{\frac{1}{{|{F_i}|}}\mathop \sum \nolimits_{{v,w\epsilon {F_i}}} \frac{{{I_v}}}{{{I_V}+{I_w} - {I_{v,w}}}}}}$$
(20)
In summary, it can be seen that definitions and metrics for unexpectedness are unclear in the literature. In general, unexpectedness means surprise and avoiding expectations of users. However, the definitions presented somewhat overlap with other concepts such as serendipity, and there are different metrics to measure unexpectedness. In Silveira et al. [
34], the authors summarize metrics for unexpectedness and propose an evaluation methodology for unexpectedness evaluation in recommender system.
2.5 Serendipity
Serendipity has been increasingly used in recommender systems, however it has a complicated definition. In this section, metrics for serendipity use the notation \(ser({R_u})\).
The term serendipity means a
lucky finding or a
satisfying surprise. According to Ricci et al. [
29], serendipity represent surprising recommendations. One of the earliest mentions of serendipity in the literature comes from [
14]. The authors use the word serendipity as a concept of surprising and interesting item for the user. The same is stated by Iaquinta et al. [
17], where they mention serendipity represent items that the users would difficultly find. Moreover, Ge et al. [
12] define serendipitous items as surprising and pleasant. As it can be seen, most authors agree that the serendipity concept involve a
good and pleasant surprise. However, it is necessary to state that serendipity is a perception of the users with regard to the recommendations they receive [
12,
20].
Other definitions of serendipity can be found in the literature. For instance, Zhang et al. [
38] and Kotkov et al. [
20,
21] say that serendipitous recommendations are
unusual and surprising to the users. Furthermore, serendipity can also be seen as
good emotional answer to a novel recommendation that the users were not expecting to receive [
1]. In this definition, Adamopoulos et al. [
1] conclude that serendipitous recommendations are novel, unexpected and useful. Ge et al. [
12] and Murakami et al. [
25] also associate serendipity and unexpectedness.
Several other work have been studying serendipity problem in recommendation. For example, Lu et al. [
11], Onuma et al. [
22] and Gemmis [
27] are some works that try to propose algorithms for serendipity. Additionally, Kotkov et al. [
20,
21] made a large survey on serendipity. In summary, it can be concluded that even though serendipity has a hard to understand definition, most authors agree that it represents a
delightful surprise and provide useful and surprising items to the user.
Metrics have been proposed to measure serendipity in recommendation lists and most of them have some relation to the concepts that serendipity is involved to: level 2 of novelty, unexpectedness and utility. Some metrics attempt to use unexpectedness notion of a primitive recommender. Therefore, we divide the metrics into the primitive recommender based metrics and non primitive recommender based.
2.5.1 Primitive recommender based serendipity
To our knowledge, the first metric proposed to evaluate serendipity was presented in Murakami et al. [
25]. The metric can be seen in Eq.
20.
\(P{M_u}\) is the primitive recommender. Moreover, the metric also uses a relevance function (
\(rel\)), which calculates if the predicted items are relevant to the user or not
\(0\) for relevant or
\(1~\) for irrelevant. Moreover, the position in the recommendation rank is also taken into consideration (
\(\frac{{coun{t_u}(k)}}{k}\)).
$$ser({R_u})=\mathop \sum \limits_{{k=1}}^{{|{R_u}|}} {\text{max}}({R_u}\left[ k \right] - ~P{M_u}\left[ k \right],0)rel({i_k})\frac{{coun{t_k}(k)}}{k}$$
(21)
The metric proposed by Murakami et al. [
25] was also used by Ge et al. [
12] and it can be seen in Eq.
21. The metric was simplified and the rank of the items in the list were not considered. Moreover,
\(UNEX{P_u}\) represent the surprising items for the user
\(u\), which is calculated as an unexpectedness metric (
\(UNEX{P_u}=~{R_u} - ~P{M_u}\)). The author maintains the utility function.
$$ser({R_u})=\frac{{\mathop \sum \nolimits_{{i\epsilon UNEX{P_u}}} {\text{utility}}({\text{i}})}}{{|{R_u}|}}$$
(22)
Adamopoulos et al. [
1] also utilizes the same metric with some terminology variations, as it can be seen in the Eq.
22. For the authors, serendipity is said to be the rate of not expected
\(({R_u} - {E_u})\). Again,
\({E_u}\) represents the set of expected items and can be replaced as
\({R_u}\).
\(USEFU{L_u}~\) is the set of useful items in the recommendation list.
$$ser({R_u})=\frac{{({R_u} - {E_u})\mathop \cap \nolimits^{} USEFU{L_u}}}{{|{R_u}|}}$$
(23)
As mentioned in the unexpectedness subsection, primitive recommendation based metrics are dependent on choosing a primitive recommender. Selecting different primitive recommenders will result in different values of serendipity. For instance, in Adamopoulos et al. [
1], for selecting expected items for the user, the authors use the profile of the users and a set of rules about the two datasets evaluated. Moreover, a utility function must be appropriately selected for calculating the usefulness of the items to the user. Again, different utility functions will result in different values for serendipity.
2.5.2 Non primitive recommender based serendipity
Zhang et al. [
38] proposed a metric that calculates the cosine similarity between recommended items (
\({R_u}\)) and the history of consumption of the user (
\({H_u}\)). The metric is shown in Eq.
23. In this case, low values represent more serendipitous recommender lists. The metric is reasonable, since the recommended items should not be very similar to the user’s consumption profile. However, the metric does not evaluate the usefulness of the recommendations, only the element of surprise is considered. Therefore, it may be evaluating novelty or unexpectedness, instead.
$$ser({R_u})=\frac{1}{{|{H_u}|}}\mathop \sum \limits_{{i\epsilon {H_u}}} \mathop \sum \limits_{{j\epsilon {R_u}}} \frac{{cossim(i,j)}}{{|{R_u}|}}$$
(24)
In short, serendipity is a complex concept. Even though most authors agree that it represents usefulness and surprise, it is not known if the metrics are effectively evaluating serendipity. Metrics, such as Murakami et al. [
25] ones, are sensitive to the selection of a primitive recommender. Moreover, the literature has distinct metrics to evaluate the same concept.
In order to further enhance the discussion on how good the Recommender Systems are and their evaluation, we present a review on work in the literature that performed evaluations on state-of-art recommenders regarding the aforementioned concepts. This brief review does not focus on comparisons, since different work decide on evaluating their recommender system regarding different concepts. Furthermore, even same concepts are evaluated by different metrics. Therefore, we center our attention on summarizing the main achievements in performance with regard to the six concepts of evaluation. For the sake of summarization, we included Table
4 in the appendix summarizing the performances of the State-of-Art recommenders reviewed in this subsection.
With regard to
utility, most articles evaluate their recommender systems with offline accuracy metrics. As an example, some collaborative filtering algorithms were initially evaluated through MAE and RMSE metrics. Wen [
37] reviewed Item-based KNN, Item-based EM and Sparse SVD for the recommender system problem. The author evaluated the RMSE metric (Eq.
2) on a Netflix database
1 and obtained results as high as 0.95, 0.91 and 0.8974 for the test set for the Item-KNN, Item-EM and SVD methods, respectively.
Novelty has been evaluated in many ways in the literature, however using different metrics and definitions. For instance, Vargas and Castells [
36] evaluated state-of-art methods using level 3 novelty metric stated in Eq.
10. The authors used Matrix Factorization, IA-select and MMR methods in their evaluation of novelty and considered a relevance discount, which resulted values of novelty of 0.058, 0.0639, 0.0620, respectively, in the Movie Lens
2 dataset. Not considering the relevance discount, the authors achieved results on novelty of 0.76, 0.8080 and 0.7605, respectively, also for the Movie Lens dataset. The authors also evaluate the same methods on Last.fm
3 dataset, where they obtained 0.2671, 0.3462 and 0.2439 of novelty ratio, respectively, considering the relevance discount; and the authors obtained 0.8949, 0.8912 and 0.9133, respectively, disregarding the relevance discount. It can be seen that more novelty was achieved in the Last.fm dataset. However, it is noteworthy that, in this case, high novelty may mean low values of accuracy or other utility measurements, because novel items not necessary are highly useful to the users.
Vargas and Castells [
36] also evaluated the state-of-art recommenders with regard to their
diversity metric. In this case, the authors used the diversity metric described in Eq.
13. For Movie Lens dataset, the recommenders Matrix Factorization, IA-select and MMR achieved 0.0471, 0.0537 and 0.0510, respectively, considering a relevance discount. Ignoring the relevance, the resulting diversity ratio was elevated to 0.7164, 0.8289, 0.7191. Again, similar to the novelty statement above, high diversity in this case may negatively impact the utility. In addition, De Pessemier et al. [
8] evaluated Group Recommendation on Movie Lens dataset and diversity was evaluated for User based and Item Based Collaborative Filtering algorithms and an SVD method, although the authors do not further specify which algorithm was used. The metric used for diversity was the intra-list similarity, as Eq.
11 states. When considering the group recommendation size equal to 1, the diversity similarity was 0.7, 0.81 and 0.64 for User-based CF, Item-based CF and SVD methods.
Regarding the concept of
unexpectedness, it is not usually assessed by work in the area in the literature. Adamopolous et al. [
1] is one of the few work in the literature to evaluate the unexpectedness and compare their own method with state-of-art baselines. The authors, evaluated Item and User KNN and Matrix Factorization methods. However, in this work, the authors evaluate many combinations of parameters and they do not mention which baselines are used in comparison with their recommending methods, instead they calculated the average values of their experimental settings. The authors evaluate unexpectedness through the metric shown in Eq.
15. Two datasets were analyzed in this case: Movie Lens and Book Crossing
4. For Movie Lens dataset, the unexpectedness ratio for their baseline metrics are 0.71 and 0.75 for recommendation lists size 10 and 100 respectively, using related movies as expected recommendations. For the Book Crossing dataset, the values of unexpectedness are more modest, where it is 0.3 and 0.38 for size 10 and 100, respectively, using related books as expected recommendations.
Although there are work which study the
serendipity concept and propose new methods and metrics, few of them effectively use state-of-art baseline as comparison. Lu et al. [
22] evaluated serendipity on two datasets: Netflix and Yahoo! Music
5, using popular methods: SVD, SVD++ and SVDNbr. The metric used for assessing serendipity was analogous to Eq.
21. The authors used different loss functions and optimizing methods, however their best serendipity results for each method for the Netflix dataset was 0.2534, 0.3036 and 0.2978 for SVD, SVD++ and SVDNbr, respectively. For the Yahoo! Music dataset, the best results were: 0.1995, 0.2771 and 0.3834 for the three methods, respectively.
Also, there has been attempts to evaluate a temporally evolving system with regard to the aforementioned concepts. Shi et al. [
33] addressed the issue of the performances of recommender systems in a temporally dynamic system. The authors described the datasets (Movie Lens and Netflix) as bipartite networks of users and items divided into a series of subsets which considers the recommendations and the time series. Authors used a recommendation based evolution method to simulate the temporal dynamic of three common collaborative filtering strategies. They evaluate their results at each time step with utility (RMSE Eq.
2), intra-list similarity (similar to Eq.
13), system-level novelty (metric similar to Eq.
9). For Movie Lens dataset for instance, the authors showed to be able to maintain low levels of similarity, around 0.025 with Item-based CF in the temporal evolving process.
Lastly, as mentioned earlier, the
coverage concept is also under-explored, both in metrics definitions and work concerning this concept. Nevertheless, one of the few authors that evaluate coverage is Adamopolous et al. [
1]. In the authors’ work, catalog coverage is evaluated using metric similar to Eq.
24. For Movie Lens dataset, the authors obtained 0.05 and 0.15 rate of items ever recommended to the user for the size of the prediction list of 10 and 100, respectively. Moreover, for the Book Crossing dataset, the coverage ratio is about 0.2 and 0.5 for 10 and 100 sizes of the prediction list.
It is important to notice that the reviewed work consider different datasets, methods, evaluation metrics, implementations and evaluation methodologies when studying the impact of the user satisfaction concepts in recommendation. We emphasize the need of a novel work which can effectively provide
user satisfaction evaluations for the state-of-art recommenders under the same evaluation scenario and datasets considering the mentioned concepts. Moreover, in order to complement such research, such work could also analyse the impacts of attempting to evaluate and optimize all of those concepts simultaneously and attempt to combine recommendations with different objectives, similar to what was performed by Zhang et al. [
38] and Ribeiro et al. [
28] in a limited level.