scroll identifier for mobile
main-content

## Weitere Artikel dieser Ausgabe durch Wischen aufrufen

01.12.2019 | Research | Ausgabe 1/2019 Open Access

# Recommendation algorithm based on user score probability and project type

Zeitschrift:
EURASIP Journal on Wireless Communications and Networking > Ausgabe 1/2019
Autoren:
Chunxue Wu, Jing Wu, Chong Luo, Qunhui Wu, Cong Liu, Yan Wu, Fan Yang
Abbreviations
GSCF
A collaborative filtering recommendation algorithm based on the graph structure
IBCF
A collaborative filtering recommendation algorithm based on item
MSE
Mean square error
RMSE
Root mean square error
UBCF
A collaborative filtering recommendation algorithm based on user
UPCF
A collaborative filtering recommendation algorithm based on user score probability and project type

## 1 Introduction

Recommendation algorithm is a very important tool to help users deal with information overload in the era of big data [1]. In the scoring matrix, the scoring behavior and scoring value of the user are the basis for the recommendation algorithm to recommend the product. In the era of information explosion, because the number of commodities is too large, the user can only score a few projects of their preferences. This results in the sparsity and incomplete of scoring data in the user-product scoring matrix, which makes it impossible to find similar neighbors of the target user. If there are no similar neighbors, the recommendation algorithm cannot recommend the product to the user, or the recommendation product to the user is inappropriate.
The primary cause of the sparse data in the scoring matrix of the recommendation system is that the user does not take the initiative to score the commodities. Therefore, the number of scores in the scoring matrix is not random, but depends on the user’ subjective choice. Traditional recommendation algorithms think that users randomly choose and score the commodity. They also believe that users score high on commodities, which indicates that users like the product, and low scores on commodities, which indicates that users do not like the product. In [2], it is proved that the hypothesis of the traditional recommendation algorithm is inaccurate and does not accord with the reality of the massive information era, because the conventional algorithms ignore the performance of the user’s subjective behavior.
In the big data era of information explosion, the number of commodities is very large. Users can only access to a small number of commodities, and then choose the type of interest preference from the small number of products to score. This results in the sparsity of scoring data in user commodity scoring matrix, which affects the accuracy of recommendation. Users choose products and score them, which is an invisible embodiment of user interest preference.
On the basis of the conventional algorithm, this paper integrates the subjective behavior of users to score the commodities and puts forward the algorithm of integrating score preference and project type. Compared with the traditional algorithm that users randomly choose a product to score, the paper improves the recommendation algorithm (UPCF) with integrating score preference and project type, which makes full use of the subjective behavior of the user to choose and score commodity. The two-step predictive recommendation algorithm proposed in [3] and the probabilistic latent semantic recommendation algorithm based on an autonomous prediction all present similar points to this chapter. There are two kinds of difference between the improved algorithm and them in this paper. First, the method to calculate the probability of the user to score product is different. Second, the UPCF algorithm takes the product type into consideration in the similar calculation.
1.
For the accuracy problem, the paper in the calculation of similarity integrates the project type. Combining the similarity calculated from the scoring matrix and the similarity obtained from the commodity type, the calculation of the similarity will be more accurate.

2.
For the problem of data sparsity, the fundamental reason for data sparsity is that users do not take the initiative to score the project. This paper calculates user score probability by analyzing the user’s historical scoring behavior and the type of the commodity. According to the score probability and commodity type, the similarity S2 of two users is calculated. The similarity S1 is calculated by a score matrix. The combination of the two similarities overcomes the problem that data sparsity cannot calculate user similarity.

## 2 The recommendation algorithm model of score preference and project type

### 2.1 User behavior information

The core idea of the recommendation algorithm is to obtain the information implied in the user’s behavior, identify the user’s behavior, use the collective wisdom [4] to match the user, and recommend the product to the user.
The traditional recommendation algorithm only pays attention to the value of the product scored by the user [5, 6], ignoring the user’s implicit information in the behavior of scoring the commodities. The traditional algorithms indicate that the user randomly selects some of the commodities and scores them according to the degree of preference for the commodities. A high score shows that the user likes this commodity, and the low score shows that the user does not like the commodities. In the era of online shopping information explosion, a user’s shopping behavior is based on their own needs and preferences [710]. The user will score the goods according to the quality of commodities, customer service attitude, logistics speed, and other factors. If the user is not interested in a commodity, he will not buy it. That is to say, giving a commodity a low score can only indicate that the user is not satisfied with this product. In the traditional algorithm, this dissatisfaction is spread to the same type of commodity, making the system think that the user’s preference for similar products is reduced and affecting the system’s recommendation for similar products. Therefore, it is a subjective behavior of the user to select a product and score it and this behavior is an invisible embodiment of the user’s interest preference [1113].
About users’ behavior, there are some different views between this paper and the traditional algorithms:
1.
Different views on scoring behavior. The traditional algorithm considers that the user’s scoring behavior is random. In this paper, it is considered that the scoring behavior is the implicit embodiment of the user’s interest preference, and the user will only score the commodities that they are interested in.

2.
Different reasons for the sparsity of scoring data [1417]. The traditional algorithms think that the users’ scoring behavior is random. This paper believes that users will only choose the products that they are interested in and score them, which means that the user’s subjective choice results in the lack of data.

3.
Different views on the level of scoring value. The traditional algorithms consider that the user likes the commodity if he gives it a high score, and the user does not like it to give the commodity a low score. This paper believes that as long as the user gives a score, regardless of the level of the score value, the user has a preference for this kind of commodity.

For example, as shown in Fig. 1, user A loves to watch action movies, but does not love science fiction and comedy. As the number of movies on the video site is very large, so user A will choose his favorite movie to watch. First of all, user A will choose the action movie on the video site, then scores it after watching the movie. Because they do not like to see the comedy and science fiction film, user A cannot evaluate such films. User A finds a movie called “IP MAN” [18, 19] in the action movie category. After watching it, user A feels that the clarity of the picture is not good, so he gives the film a low score. If user A does not like the action movie, he will not see this category of movies. The traditional collaborative filter [20, 21] considers that as long as the user gives a low score that users do not like this type of film, but the fact is that if the user does not like a film, users will not pay attention to this type, let alone watch and to score.

### 2.2 Recommendation algorithm model of user score preference and project type

This paper believes that the score value cannot indicate whether the user likes this kind of commodity or not, but only indicate that the user is not satisfied with the current product. Users pay attention to and want to consume the products that they are interested in. Users will score a high score for products that they are interested in and satisfied with, and low scores for products that they are interested in but not satisfied with. Therefore, regardless of whether the user gives a product a high or low score, the behavior of the user to score the product fully indicates that the user is interested in this type of product.
Based on the above viewpoints and the implicit expression of the user score behavior [22, 23], this paper designs a recommendation algorithm model based on user score preference and project type. By analyzing the user score behavior, our algorithm obtains the user’s preference for the commodity. According to the preference of the user, we can predict the probability of the user to score the target commodities. The recommendation system combines the similarity calculated from the score value with the similarity calculated from the user score probability and the project type to make the similarity between the users more accurate. The similarity calculation of recommendation algorithm framework based on score preference and the project type is shown in Fig. 2.
As shown in Fig. 2, the behavior that a user gives a score to a commodity is called the user rating behavior. The score value that the user gives to the commodity is called the user rating value. The frame calculates the score probability P based on the user rating behavior and calculates the similarity S2 according to the score probability P and the project type. Then, we calculate the similarity S1 according to the user rating value. Finally, we combine the two similarity values S1 and S2 to obtain the ultimate similarity S [24, 25].
On the basis of the conventional algorithm based on neighborhood, the ideas of the user behavior in Fig. 1 and the improved similarity of Fig. 2 are integrated. The improved algorithm framework of this chapter is obtained, which is called the recommendation algorithm framework of scoring preference and project type. This framework makes full use of the user’s preference information and calculates the user interest in a kind of commodity, that is, the probability of scoring. We will get the scoring probability and improved similarity by calculating [26]. The process of the model of the score preference and the project type is shown in Fig. 3.
The recommendation algorithm based on the user score preference and project type combines the probability of the user to score the commodity with the user’s prediction value to the product. Some studies have shown that making full use of the user’s behavior of scoring the product can effectively improve the recommendation accuracy of recommendation algorithm. The model takes full advantage of the user’s scoring behavior, excavates the user’s implied hobbies, and predicts the product type that the user may be interested in.
The core idea of the model based on the user score preference and project types is shown in Fig. 3: Firstly, we calculate the similarity S1 based on the matrix. Then, we calculate the possibility Pro of the user to score the product according to the preference information implied in the user rating behavior. The second similarity S2 is calculated according to Pro and the type of the product. Due to the different weights of the two similarities, the final similarity of the user S is obtained by combining the two similarities S1 and S2 [27, 28]. For all users, the similarity between any two forms an M×M similarity matrix (M is the number of users).
For example, in order to calculate the similarity of user A and user B, the algorithm reads the scoring information from the data set, gets the score matrix of 943×1682 (943 represents 943 users, and 1682 represents 1682 commodities), and calculates the Sim1(A,B) by score matrix and Pearson’s formula. Then, a 943×18 score count matrix and a 943×18 score probability matrix are created (943 represents 943 users, and 18 represents of 18 types). Next, the algorithm traverses the score matrix, records the number of user score for each commodity into the 943×18 score matrix. The algorithm also traverses the score count matrix, calculates the probability of user score for each commodity, and records it into the scoring probability matrix. We obtain the second similarity S2(A,B) through probability matrix and Pearson’s formula and get the final S(A,B) by combining S1(A, B) and S2(A, B). By calculating the similarity by the above way, we can obtain a 943×943 similarity matrix.
The UPCF algorithm takes full advantage of the user’s behavior and the type of information of the product to score the product, which is the main difference between the UPCF algorithm and the traditional recommendation algorithm. The two-step prediction recommendation algorithm proposed in [3] and the probabilistic latent semantic recommendation algorithm based on autonomous prediction proposed in [13, 29] make full use of the user’s behavior information to score the product. The difference between them is that the UPCF algorithm uses a different approach when calculating the score probability and considers the type of information of the project when calculating the similarity.
In order to verify the effectiveness of the framework, this paper combines IBCF with the framework in Fig. 2 to propose an algorithm of fusing score preferences and project types. The next section will detail the UPCF algorithm.

## 3 The algorithm of fusing score preference and project type

This section takes the user’s subjective scoring behavior into consideration on the basis of the traditional recommendation algorithm based on neighborhood, proposing the algorithm of fusion score preference and item type. UPCF is short for collaborative filtering recommendation algorithm based on user score probability and project type.

### 3.1 Prediction of user score probability

The user’s scoring preferences can also be used to calculate the user’s score probability. The scoring value of all commodities in the score matrix can be regarded as an n-dimensional score vector, as follows:
$$P(U)=\left({I}_1,{I}_2\dots \dots {I}_n\right)$$
(1)
If the value is not 0, the user has scored the commodity; otherwise, there is no score on the commodities. Traverse the target user’s n-dimensional score vector, count the type of commodities and the number of times each commodity is scored, and put the statistic results into the list. Each item in the list is an <i, n> binary relationship group, where i is the commodity type, and n is the number of times the commodities have been scored. We predict the users’ interest in this type of product according to the number of users scoring a certain type in the list, that is, predicting the probability of users to score the commodity. If the target commodity type is j, N(j) represents the scoring number of u on the j type and M is the total number of the user to score commodities. The user score probability is calculated as follows:
$$\Pr \left(u,j\right)=N(j)/M$$
(2)
The specific implementation of the score probability prediction is shown in the following pseudo code:
 Prediction of scoring probability Proba(int[][] grade) Create a new 943×18 matrix pro Traverse the grade If(grade[a][m]!=0) Get the type k of the movie m pro[a][k]+1

### 3.2 Improvements in similarity calculations

The cosine similar and Pearson et al. [3032] are the most common way to calculate similarity in the conventional collaborative filtering algorithms. But regardless of the kind of calculation, the database of the calculation is the commodities’ score commonly used by the users. This calculation ignores the commodity type [33]. This section incorporates the commodity type and the scoring probability of the user on the basis of the traditional similarity calculation method and adjusts the ratio of the two similarities. The improved similarity formula is as follows:
$$\mathrm{Sim}\left({U}_i,{U}_j\right)=\beta S\left({U}_i,{U}_j\right)+\left(1-\beta \right){S}_{\mathrm{sort}}\left({U}_i,{U}_j\right)$$
(3)
where Sim(Ui, Uj) is the final similarity, S(Ui, Uj) is the similarity calculated by using the user’s score value, Ssort(Ui, Uj) is the similarity calculated by the commodity type and the scoring probability of product. The formula to calculate Ssort(Ui, Uj) is as follows:
$${S}_{\mathrm{sort}}\left({U}_i,{U}_j\right)=\frac{\sum \limits_{k\in L\left({U}_i\right)\cap L\left({U}_j\right)}\left({P}_{U_ik}-{\overline{P}}_{U_i}\right)\left({P}_{U_jk}-{\overline{P}}_{U\mathrm{j}}\right)}{\sqrt{\sum \limits_{k\in L\left({U}_i\right)\cap L\left({U}_j\right)}{\left({P}_{U_ik}-{\overline{P}}_{U_i}\right)}^2}\sum \limits_{k\in L\left({U}_i\right)\cap L\left({U}_j\right)}{\left({P}_{U_jk}-{\overline{P}}_{U_jk}\right)}^2}$$
(4)
where L(Ui) is a type collection of commodities that are scored by Ui. L(Uj) is a type collection of commodities that are scored by Uj. $${P}_{U_ik}$$ is the scoring probability of Ui for the k type, and $${\overline{P}}_{U_i}$$ is the average of the scoring probability of Ui for all the types. The k type is one of the intersection types scored by Ui and Uj.
The selection of the nearest neighbor and the scoring prediction and scoring criteria have been described in detail in the previous section, and it is no longer described here.

### 3.3 The selection of the nearest neighbor

There are two conditions for the target user to choose the nearest neighbor [3437]. Firstly, the selected neighbor is highly similar to the target user. Secondly, the selected neighbor has been scored on the target commodity.
When selecting the best neighbor, there is a threshold needed to set in order to prevent the existence of less similar individuals that affects the final results in collaborative filtering. Only the neighbor who has given the target product score and the similarity is greater than the threshold value that can become the target neighbor. The selection of neighbors is as follows:
$$\mathrm{KN}\left({U}_m\right)=\left\{{U}_n/\mathrm{Sim}\left({U}_m,{U}_n\right)>\sigma \&\left.\mathrm{Score}\left({U}_n,I\right)!=0,m\ne n\right\}\right.$$
(5)
where KN(Um) is the neighbor list of the user Um and β is the threshold, which can be set to the average value of the similarity of all the users who are similar to the user Um. The specific implementation of selecting neighbor is as follows:
 The selection of the nearest neighbor FindNeighbor(int i, int k, int[][] grade, int[][] similar) where i is the user, k is the commodity, grade is the score matrix, similar is the similarity matrix. if(grade[j][k]!=0&&similar[i][j]!=0) List_N[j]=similar[i][j], List_N is the neighbor list Sort(List_N), sort the list of neighbors choose the N neighbors we need

### 3.4 Calculation of prediction score

In the calculation of the predicted score, the traditional collaborative filtering algorithm only focuses on the similarity [38, 39] between the neighbor and the target user and the neighbor’s score for the prediction item. Each user has different scoring criteria. For example, some users give three points to show that they like that product, and some users need to give five points to express the same meaning. In order to solve this problem, this algorithm takes the average value of the user’s score into account to resolve the difference between users. The user’s rating for item v is
$$\mathrm{Score}\left({U}_i,v\right)=\left({\overline{r}}_{ui}+\frac{\sum \limits_{U_j\in KN\left({U}_i\right)}\mathrm{Sim}\left({U}_i,{U}_i\right)\left({r}_{U_jv}-{\overline{r}}_{uj}\right)}{\sum \limits_{{\mathrm{U}}_{\mathrm{j}}\widehat{\mathrm{I}}\mathrm{KN}\left({\mathrm{U}}_{\mathrm{i}}\right)}\mathrm{Sim}\left({\mathrm{U}}_{\mathrm{i}},{\mathrm{U}}_{\mathrm{j}}\right)}\right)f$$
(6)
where $$f=\exp \left\{-1+\alpha \left({\overline{r}}_{ui}-{\overline{r}}_{uj}\right)\right\}$$, exp represents the exponential function based on e, $${\overline{r}}_{ui}$$ is the average of Ui, $${r}_{u_jv}$$ is score of Uj to v, α is the attenuation factor, and Score(Ui, v) is the prediction score of Ui to v.

### 3.5 Evaluation indicators

The evaluation indicators of the recommendation system can be summarized as accuracy and the other indicators out of accuracy [40, 41]. The accuracy of this paper is mainly referring to the accuracy index of the prediction score. This kind of indicator is to judge the accuracy by comparing the difference between the prediction score and the real score. The most commonly used is the MAE (mean absolute error), |test| is the test set, ruv is a prediction of U to V, and $${r}_{uv}^{\mathrm{test}}$$ is the real score of U to V in the test set.
MAE is calculated as follows:
$$\mathrm{MAE}=\frac{\sum \limits_{\left(U,V\right)\in \mathrm{test}}\left|{r}_{uv}-{r}_{uv}^{\mathrm{test}}\right|}{\left|\mathrm{test}\right|}$$
(7)
MAE is easy to understand and to calculate, but it also has some shortcomings that the MAE makes a contribution to the inaccurate prediction of low-score products. The RMSE (root mean square error) is also an evaluation indicator related to MAE; RMSE is calculated as follows:
$$\mathrm{RMSE}=\sqrt{\frac{\sum \limits_{\left(u,v\right)\in \mathrm{test}}{\left|{r}_{uv}-{r}_{uv}^{\mathrm{test}}\right|}^2}{\left|\mathrm{test}\right|}}$$
(8)
In RMSE, each absolute error is squared, which makes the larger absolute error becomes larger.

## 4 Experimental design and analysis

### 4.1 Data sources

In order to verify the validity of the collaborative filtering recommendation algorithm of score preference and project type, the experiment was validated on the MovieLens set provided by GroupLens. The MovieLens data set is collected by the GroupLens Study Group of the University of Minnesota [4244], which contains three different versions. This chapter selects the ml-100K data set for experimentation. The data set has 943 users and 1682 movies and 943×1682 score records. The score is 0 or a positive integer between 1 and 5, the score is 0 that the user did not score the product, the higher the score indicates that the higher the degree of the user’s preferences for commodities. Ninety percent of the data set was randomly selected as the training set, and the rest was used for the experiment.

### 4.2 Experimental design

The acquisition of the experimental data samples is from the MovieLens data set [4547] provided by GroupLens. We select the ml-100K version of MovieLens. The traditional collaborative filtering algorithm based on the nearest neighbor, GSCF algorithm, and UPCF algorithm is run on the data sets train1 and test1 and data sets train2 and test2, and then compare and analyze the difference between the MAE value and RMSE value of the three algorithms [4850]. According to the analysis of the three algorithms recommendation results, the specific steps of the experimental design are as follows:
The first step, the division of the data set: the data set will be divided into two parts according to the proportion of 9:1 in accordance with the principle of completely random. One class is called the training set train, and the less is called the test set test. The ml-100K data set is divided into several times, and we obtain the training sets train1, train2, train3 and so on, as well as the test set corresponding to the training sets test1, test2, test3 and so on.
The second step, the prediction of scoring probability: the probability of the user’s score for each type is calculated according to Section 3.1, and a score probability matrix of 943×18 is obtained. Nine hundred forty-three lines represent 943 users, and 18 columns represent 18 types of movies.
The third step, the similarity calculation: the similarity S1 is calculated by the score matrix, and the similarity S2 is calculated by the score probability. We will get a similarity matrix by combining the two similarities; the similarity matrix S is as follows:
$$S=\left[\begin{array}{cccc}{S}_{11}& {S}_{12}& \dots & {S}_{1m}\\ {}{S}_{21}& \dots & \dots & \dots \\ {}\dots & \dots & \dots & \dots \\ {}{S}_{m1}& \dots & \dots & {S}_{mm}\end{array}\right]$$
Sm1 is the similarity between user m and user 1 in the matrix.
The fourth step is to choose the nearest neighbor according to the similarity.
The fifth step is to calculate the scoring error MAE, RMSE.

### 4.3 Experimental results and analysis

This section is compared with the traditional user-based propulsion algorithm UBCF, traditional item-based algorithm IBCF, a recommendation algorithm GSCF based on graph structure and project type. For the UPCF algorithm, we carry out experimental analysis. By comparing the influence of different neighbors on the MAE and RMSE of these four recommendation algorithms, the number of neighbors is the same, the difference between the MAE value and the RMSE value of the four recommendation algorithms is obtained. The number of neighbors which selected 10 to 80 variables was shown in the data sets train1 and test1 and data sets train2 and test2 of these four algorithms (UBCF, IBCF, GSCF, UPCF) in the MAE and RMSE performance. In the data sets train1 and test1, the three recommendation algorithm’s MAE value of the comparison is shown in Table 1.
Table 1
Comparison table for MAE value

Number of neighbors
10
20
30
40
50
60
70
80
UBCF
0.792
0.777
0.773
0.772
0.772
0.772
0.772
0.773
IBCF
0.863
0.839
0.829
0.824
0.820
0.818
0.817
0.815
GSCF
0.779
0.767
0.763
0.763
0.763
0.764
0.764
0.765
UPCF
0.770
0.761
0.758
0.757
0.757
0.758
0.759
0.762
When the number of neighbors is the same, the MAE value of the improved algorithm UPCF is the smallest, that is, the prediction error is the smallest and the recommendation performance is the best. When the number of neighbors is different, the MAE value of the four algorithms decreases first and then increases with the growth of the neighbors. It shows that the MAE is affected by the neighbors, in other words, the performance is affected by the neighbors
In Table 1, UBCF is based on the user’s algorithm, IBCF is a conventional item-based algorithm, GSCF is an algorithm based on graph structure and project type, and UPCF is an algorithm based on user score preference and project type. When the number of neighbors is the same, the MAE value of the improved algorithm UPCF is the smallest, that is, the prediction error is the smallest and the recommendation performance is the best. When the number of neighbors is different, the MAE value of the four algorithms decreases first and then increases with the growth of the neighbors. It shows that the MAE is affected by the neighbors, in other words, the performance is affected by the neighbors. In order to make the comparison of the four algorithms more obvious, this chapter draws the MAE values into line graphs, as shown in Fig. 4.
We take the number of neighbors as variables; with the increase of the number of neighbors, the MAE values of the four algorithms are reduced first and then flattened. This shows that the number of neighbors has a certain impact on the scoring error, when the number of neighbors is enough, this effect gradually weakened. In the case of the same number of neighbors, the MAE value of the algorithm UPCF is lower than the algorithm GSCF, which is lower than the UBCF and IBCF. This chapter shows that the error between the real score and the predicted score of the improved algorithm is the lowest, and the prediction of the user is more accurate. On the data sets train1 and test1, the RMSE values of the four recommendation algorithms are compared as shown in Table 2.
Table 2
Comparison of RMSE value

Number of neighbors
10
20
30
40
50
60
70
80
UBCF
1.045
1.027
1.022
1.020
1.019
1.019
1.019
1.019
IBCF
1.079
1.048
1.035
1.028
1.024
1.022
1.020
1.019
GSCF
1.036
1.019
1.016
1.013
1.014
1.013
1.013
1.013
UPCF
1.026
1.016
1.013
1.011
1.012
1.012
1.012
1.013
The RMSE value of the algorithm UPCF is always smaller than the other three recommendation algorithms (under the same number of neighbors). As the number of neighbors increases, the RMSE value becomes smaller first and then bigger. When the number of neighbors is about 40, the value of RMSE tends to be the smallest, that is, the error of the prediction score is the smallest
As can be seen from Table 2, the RMSE value of the algorithm UPCF is always smaller than the other three recommendation algorithms (under the same number of neighbors). As the number of neighbors increases, the RMSE value becomes smaller first and then bigger. When the number of neighbors is about 40, the value of RMSE tends to be the smallest, that is, the error of the prediction score is the smallest.
According to the data in Table 2, the RMSE values of the four contrast algorithms are plotted as histograms as shown in Fig. 5.
As can be seen from Fig. 5, with the increase of the number of nearest neighbors, the RMSE (root mean square error) value of the four algorithms is gradually reduced and then tends to be gentle. In the case of the same number of neighbors, the RMSE value of the UPCF algorithm has been lower than the other three kinds of recommendation algorithms, that is, the error between the real score and the prediction score of UPCF is the smallest, and the prediction is more accurate.
In order to exclude the impact of the data set on the results of the algorithm, the following experiments will be performed on second data sets train2 and test2 generated randomly.
In the data sets of train2 and test2, with the nearest neighbor as a variable, MAE and RMSE are the evaluation criteria to analyze and compare these four recommendation algorithms.
Figure 6 is the contrast effect diagram of the MAE value, with the increase in the number of nearest neighbors, the MAE value of the three algorithms decreases first and then increases. When the nearest neighbor number is about 40, the MAE value is the smallest, which shows that the collaborative filtering algorithm is affected by the nearest neighbor number. When the nearest neighbor number is the same, the MAE of UPCF algorithm is the smallest, that is, the predicted score is close to the real value of the user and it provides the best recommendation result. It fully illustrates the importance of the user to the subjective behavior of commodity score.
Figure 7 is the contrast effect diagram of the RMSE value, with the increase in the number of nearest neighbors, the value of RMSE decreased first and then increased. When the nearest neighbor number is about 40, the value of RMSE is the smallest. In the case of the same number of near neighbors, the RMSE value of the UPCF algorithm is the smallest, that is, the error between the real score and the prediction score is the smallest, and the performance is the best.

### 4.4 Comparative analysis of algorithms

The recommendation algorithm GSCF based on graph structure and item type is based on the conventional algorithm, which makes full use of the indirect neighbor and commodity type when computing similarity.
However, the algorithm UPCF thinks that the user is the product of the user’s subjective behavior; it is a kind of implicit user preferences. We make full use of the user score behavior that can find the root cause of data sparsity, as it causes no marks of the product for users. In order to better analyze and compare the two improved algorithms in this paper, we will compare the MAE value and RMSE value of the two algorithms.
Figure 8 is the algorithm UPCF and algorithm GSCF, which has the contrast line chart of the data set train1 on the MAE value (the average absolute error). It can be seen from the graph that the MAE value of the two algorithms decreases first and then increases, and finally tends to be gentle. With the near neighbor number as the variable, When the near neighbor number is about 40, the MAE values of the two algorithms are the smallest, that is, the error is the smallest. Under the condition of the same neighborhood, the numerical MAE of the UPCF algorithm is lower than the GSCF algorithm. The MAE value is smaller, which means that the scoring error of UPCF is lower than GSCF, that is, the recommendation effect of the algorithm UPCF is more accurate than the GSCF algorithm. Because it analyzes the root causes of data sparsity, it makes full use of the implicit information implied by user rating behavior, namely, the user’s implicit preference.
Figure 9 is the RMSE (RMS error) contrast histogram between the algorithm GSCF and algorithm UPCF. The left is the algorithm GSCF, and the right is the algorithm UPCF. With the near neighbor number as the variable, the RMSE values of the two kinds of recommendation algorithms decrease first and then increase, and finally tend to be gentle. When the near neighbor number reaches about 40, the RMSE value of the two algorithms is the smallest, that is, the algorithm has the least error and the highest precision. In the case of the same number of near neighbors, the RMSE value of the algorithm UPCF is lower than that of the algorithm GSCF, that is, the error is smaller.
The experimental results show that the numerical values about MAE and RMSE of the UPCF algorithm are lower than the GSCF algorithm, namely, the error score is lower between the real and prediction score, so that the recommendation algorithm of UPCF is better than the GSCF algorithm.

## 5 Conclusions

The primary cause of the sparse data is that the user does not take the initiative to score the commodities. Due to the continuous updating of the commodity information, the user has no ability and energy to purchase and rate each commodity. The subjective behavior of the user to select a product and score it is an invisible embodiment of the user’s interest preference. Users will only choose the products that they are interested in. If the user is satisfied with the commodities, he will give them high scores. On the basis of the conventional algorithm, this paper integrates the subjective behavior of users to score the commodities and makes full use of the implicit information in user behavior. The improved algorithm firstly calculates the scoring probability of the user to a product and then incorporates the commodity type and the scoring probability of the user to the product on the basis of the traditional similarity calculation method. Compared with the traditional algorithm based on a neighbor recommendation, GSCF algorithm, and UBCF algorithm, in the case of the same number of neighbors, MAE value and RMSE value UPCF algorithm are the lowest, which fully illustrates the usability of user preference information implied by the subjective behavior of user score.
This paper presents the UPCF algorithm. The project type is added to the traditional collaborative filtering recommendation algorithm to alleviate the cold start and score sparsity problem. The primary reason of data sparsity problem is that users do not score the commodities, which is the users’ subjective behavior. Based on the GSCF algorithm, this paper combines the user’s willingness to score the commodity, and an algorithm based on user score probability and project type is proposed. The difference between UPCF and the traditional algorithm is analyzed, and the recommendation system dataset is used for experimental verification and data analysis. The experimental results show that the algorithm based on user score probability and project type alleviates the problem of data sparsity, which has a better effect than the conventional algorithm.

## 6 Future works

The current collaborative filtering recommendation technology research has been more mature, but there is still room for improvement in the recommendation accuracy and user experience. The improved algorithm proposed in this paper is only for the data sparsity situation. In order to solve other shortcomings of the traditional recommendation algorithm, the future work mainly around the followings:
(1)
The use of social network to solve the cold start problem: The social information and display information (circle of friends, QQ space) of social network users are used to supplement and improve the recommendation algorithm user behavior information, so that we can get better predictions of user preference and enhance the recommendation performance of the recommendation algorithm.

(2)
The use of time sequence to solve the problem of user interest drift: Because the interest of the user changes over time, the time is added to the recommendation algorithm to study the impact of this objective factor on the recommendation accuracy.

## Acknowledgements

The authors would like to appreciate all anonymous reviewers for their insightful comments and constructive suggestions to polish this paper in high quality. This research was supported by the National Key Research and Development Program of China (No. 2018YFC0810204), National Natural Science Foundation of China (No.61502220), Shanghai Science and Technology Innovation Action Plan Project (16111107502, 17511107203) and Shanghai key lab of modern optical system. In this paper, Fan Yang is the corresponding author.

CHUNXUE WU received the Ph.D. degree in Control Theory and Control Engineering from of mining and technology, Beijing, China, in 2006.He is a Professor with the Computer Science and Engineering and software engineering Division, School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, China. His research interests include, wireless sensor networks, distributed and embedded systems, wireless and mobile systems, networked control systems.
JING WU is a graduate student of computer technology at School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China. His research interests include Internet of Things, embeded system development, Deep Learning.
CHONG LUO (1991-) is a postgraduate student of computer technology, School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, China. His research interests include, networks communication, big data and machine learning.
QUNHUI WU is a Shanghai Hao long environmental technology Co., Ltd. system integration engineer. 2013 graduated from Xi’an Jiaotong University in computer science and technology. At present the main research direction for the computer system integration, computer control systems.
CONG LIU received the Ph.D. degree in computer application from the East China Normal University, Shanghai, China, in 2013. He is currently a Lecturer with the Department of Computer Science and Engineering, School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, China. His research interests include Evolutionary Computation, Machine Learning, and Image Processing.
YAN WU is currently a postdoctoral associate at the school of public and environmental affairs, Indiana University Bloomington. He obtained his PhD degree in Southern Illinois University Carbondale, with concentrations in environmental chemistry and ecotoxicology. His research involves elucidations of environmental fate of contaminants using chemical and computational techniques, as well as predictions of their associated effects on wildlife and public health. Data Processing and Analysis in Environmental Related Fields.
FAN YANG is an Associate Prof in School of Information, Zhongnan University of Economics and law. She received her PhD degree in School of Computer Science, Wuhan University, China, 2007, and M.S. degree in Dept. of Computer Engineering, Hubei University, China, 2004. Now Dr. Yang is doing some research on wireless communication security. Her research interest includes security analysis and improvements for Block Chain related technology.

### Funding

This research was supported by the National Key Research and Development Program of China (No. 2018YFC0810204), National Natural Science Foundation of China (No.61502220), Shanghai Science and Technology Innovation Action Plan Project (16111107502, 17511107203) and Shanghai key lab of modern optical system.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Unsere Produktempfehlungen

### Premium-Abo der Gesellschaft für Informatik

Sie erhalten uneingeschränkten Vollzugriff auf alle acht Fachgebiete von Springer Professional und damit auf über 45.000 Fachbücher und ca. 300 Fachzeitschriften.

### Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

• über 58.000 Bücher
• über 300 Zeitschriften

aus folgenden Fachgebieten:

• Bauwesen + Immobilien
• Finance + Banking
• Management + Führung
• Marketing + Vertrieb
• Versicherung + Risiko

Testen Sie jetzt 30 Tage kostenlos.

### Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

• über 50.000 Bücher
• über 380 Zeitschriften

aus folgenden Fachgebieten:

• Automobil + Motoren
• Bauwesen + Immobilien
• Elektrotechnik + Elektronik
• Energie + Umwelt
• Maschinenbau + Werkstoffe​​​​​​​

Testen Sie jetzt 30 Tage kostenlos.

Weitere Produktempfehlungen anzeigen
Literatur
Über diesen Artikel

Zur Ausgabe