Introduction
Related work
The proposed system
Text pre-processing
-
Noise removal: we remove stop-words, dates, numeric characters, Web pages, and words with special characters;
-
Tokenization, stemming, and lemmatization: we divide each word into a token and extract its stem and lemma. The stem and lemma are used according to the feature extraction technique in the feature extraction module;
-
Part-of-speech tagging (POS tagging): we obtain the POS tag of each token. This feature will be used in all feature extraction techniques;
-
Sentence splitting: we split the text into sentences and divide those that are compound so they can be processed by the sentiment analysis algorithm;
-
Parsing: we apply a parser in the texts, so that more sensitive information is obtained, such as the structure of sentences and relationships between words. Such information will be used by some of the techniques in the feature extraction module.
Feature extraction
Extracting terms through heuristics
Extracting aspects through heuristics
Extracting terms through transductive learning
Extracting aspects through hierarchy clustering
Sentiment analysis and item representations construction
Sentiment analysis algorithm
Item’s vector construction
Recommendation
Empirical evaluation
Databases
Total | Average occurrence | ||
---|---|---|---|
ML-100k | Actors | 44,178 | 44.15 |
Directors | 1016 | 1.06 | |
Genres | 18 | 1.72 | |
HetRec ML | Actors | 95,321 | 22.78 |
Directors | 4060 | 1.0 | |
Genres | 20 | 2.04 |
Evaluation metrics
Feature extraction techniques comparison
-
Heuristic terms: the term extraction technique described in the “Extracting terms through heuristics” section, using IF=30;
-
Classification terms: the term extraction technique described in the “Extracting terms through transductive learning” section, using the filter_DF_N in TLATE’s filtering step and the k-NN network with k=57 in the transductive step;
-
Heuristic aspects: the aspect extraction technique described in the “Extracting aspects through heuristics” section, using the binarized sentiment approach in the items’ representation creation module;
-
Hierarchy aspects: the aspect extraction technique described in the “Extracting aspects through hierarchy clustering” section, using the topic granularity of [ 2,7].
Total | Average occurrence | ||
---|---|---|---|
ML-100k | Heuristic terms | 3085 | 223.32 |
Classification terms | 8433 | 401.07 | |
Heuristic aspects | 78 | 22.59 | |
Hierarchy aspects | 933 | 236.65 | |
HetRec ML | Heuristic terms | 33,618 | 840.89 |
Classification terms | 17,864 | 1469.43 | |
Heuristic aspects | 55 | 41.59 | |
Hierarchy aspects | 5428 | 2097.93 |
Rating prediction
k = 20 |
k = 40 |
k = 60 |
k = 80 |
k = 100 | ||
---|---|---|---|---|---|---|
ML 100k | Actors | 0.9384 | 0.9383 | 0.9382 | 0.9382 | 0.9381 |
Directors | 0.9438 | 0.9438 | 0.9438 | 0.9438 | 0.9438 | |
Genres | 0.9404 | 0.9401 | 0.9402 | 0.9401 | 0.9401 | |
HetRec ML | Actors | 0.8229 | 0.8226 | 0.8225 | 0.8225 | 0.8224 |
Directors | 0.8311 | 0.8311 | 0.8311 | 0.8311 | 0.8311 | |
Genres | 0.8294 | 0.8283 | 0.8280 | 0.8280 | 0.8280 |
k = 20 |
k = 40 |
k = 60 |
k = 80 |
k = 100 | ||
---|---|---|---|---|---|---|
ML 100k | Heuristic terms | 0.9310 | 0.9314 | 0.9330 | 0.9347 | 0.9361 |
Classification terms | 0.9302 | 0.9306 | 0.9328 | 0.9345 | 0.9360 | |
Heuristic aspects | 0.9424 | 0.9409 | 0.9403 | 0.9403 | 0.9404 | |
Hierarchy aspects | 0.9406 | 0.9383 | 0.9381 | 0.9384 | 0.9388 | |
HetRec ML | Heuristic terms | 0.8025 | 0.8030 | 0.8050 | 0.8071 | 0.8090 |
Classification terms | 0.7964 | 0.7978 | 0.8005 | 0.8030 | 0.8052 | |
Heuristic aspects | 0.8289 | 0.8267 | 0.8270 | 0.8277 | 0.8285 | |
Hierarchy aspects | 0.8173 | 0.8168 | 0.8184 | 0.8200 | 0.8214 |
Item recommendation
k = 20 |
k = 40 |
k = 60 |
k = 80 |
k = 100 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
prec@10 | MAP | prec@10 | MAP | prec@10 | MAP | prec@10 | MAP | prec@10 | MAP | ||
ML-100k | Actors | 0.0892 | 0.0579 | 0.0893 | 0.0579 | 0.0894 | 0.0579 | 0.0892 | 0.0580 | 0.0892 | 0.0580 |
Directors | 0.0872 | 0.0571 | 0.0872 | 0.0571 | 0.0872 | 0.0571 | 0.0872 | 0.0571 | 0.0872 | 0.0571 | |
Genres | 0.0843 | 0.0568 | 0.0849 | 0.0570 | 0.0849 | 0.0570 | 0.0849 | 0.0570 | 0.0849 | 0.0570 | |
HetRec ML | Actors | 0.1081 | 0.0238 | 0.1073 | 0.0255 | 0.1073 | 0.0255 | 0.1073 | 0.0255 | 0.1090 | 0.0255 |
Directors | 0.1047 | 0.0270 | 0.1047 | 0.0270 | 0.1047 | 0.0270 | 0.1047 | 0.0270 | 0.1047 | 0.0270 | |
Genres | 0.1021 | 0.0280 | 0.1030 | 0.0280 | 0.1038 | 0.0281 | 0.1038 | 0.0281 | 0.1038 | 0.0281 |
k = 20 |
k = 40 |
k = 60 |
k = 80 |
k = 100 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
prec@10 | MAP | prec@10 | MAP | prec@10 | MAP | prec@10 | MAP | prec@10 | MAP | ||
ML-100k | Heuristic terms | 0.1041 | 0.0656 | 0.1059 | 0.0671 | 0.1021 | 0.0673 | 0.1039 | 0.0675 | 0.1024 | 0.0676 |
Classification terms | 0.1043 | 0.0658 | 0.1051 | 0.0671 | 0.1044 | 0.0675 | 0.1048 | 0.0676 | 0.1042 | 0.0677 | |
Heuristic aspects | 0.0951 | 0.0597 | 0.0956 | 0.0604 | 0.0947 | 0.0601 | 0.0950 | 0.0597 | 0.0946 | 0.0594 | |
Hierarchy aspects | 0.0997 | 0.0643 | 0.0977 | 0.0647 | 0.0993 | 0.0642 | 0.0979 | 0.0637 | 0.0979 | 0.0633 | |
HetRec ML | Heuristic terms | 0.1057 | 0.0256 | 0.1105 | 0.0270 | 0.1144 | 0.0277 | 0.1160 | 0.0271 | 0.1174 | 0.0273 |
Classification terms | 0.1047 | 0.0258 | 0.1125 | 0.0274 | 0.1159 | 0.0280 | 0.1169 | 0.0270 | 0.1180 | 0.0273 | |
Heuristic aspects | 0.0910 | 0.0219 | 0.0991 | 0.0237 | 0.1038 | 0.0246 | 0.1053 | 0.0250 | 0.1054 | 0.0252 | |
Hierarchy aspects | 0.1062 | 0.0242 | 0.1060 | 0.0262 | 0.1105 | 0.0272 | 0.1143 | 0.0277 | 0.1159 | 0.0260 |
Discussion
Heuristic terms | Classification terms | Heuristic aspects | Hierarchy aspects |
---|---|---|---|
ML 100k | |||
film | film | cinematography | [discov,homeless,filmmak,maggi,thought] |
movie | movi | critics | [paradis,cinema,camera,screen,past] |
time | time | horror | [juror,juri,sayl,chang,reason] |
character | stori | time | [nichol,listen,happen,experi,mike] |
story | charact | scene | [marci,stai,store,leav,slow] |
way | scene | audio | [mail,postman,deliv,put,post] |
thing | plot | description | [art,form,artist,rylanc,interest] |
people | soundtrack | cast | [rylanc,form,atmospher,artist,mind] |
scene | plai | footage | [materi,falk,spiritu,chanc,grandfath] |
plot | view | script | [carri,spacek,palma,lauri,barn] |
HetRec ML | |||
film | film | cinematography | [comic,final,western,enjoi,fan] |
movie | movi | cast | [petti,tank,comic,person,enjoi] |
time | time | watch | [saramago,blind,viewer,turn,point] |
story | watch | critics | [symbol,imag,view,experi,place] |
way | stori | time | [scari,sound,base,run,minut] |
people | end | audio | [lucia,reason,kind,artist,art] |
character | work | distribution | [declin,spheeri,troop,scout,exploit] |
thing | charact | fantasy | [wake,linklat,present,view,understand] |
scene | plot | direction | [mckellar,sandra,hour,bit,special] |
life | plai | romance | [makoto,chang,travel,van,moment] |