1 Introduction
-
SSAM, which considers reviews sentiment label and terms sentiment label as the extension of LDA model by adding a sentiment layer.
-
SSAM can be accounted as a full framework for classifying unlabeled reviews and cluster related words with a high accuracy.
-
Our proposed model is capable of extracting implicit aspects, negation sentiments, intensified sentiments and can also consider sentence structure and terms order instead of bag of words.
-
Implementation is on the big data Spark framework to adapt to the explosive growth of opinions on the web.
-
A thorough analysis of the SSAM compared to other sentiment and topic modeling (e.g., JST and ASUM) and different supervised methods (e.g., SVM and NB) is presented.
2 Terminology
-
Multiword aspect or sentiment: an n-gram phrase that conveys aspect or sentiment, for example, “portable DVD player”, “well designed”.
-
negation sentiment: a multiword with at least one sentiment word and one negation word such as no, not, none, cannot and etc. as the previous word, for example, “not bad”, “not clear”.
-
Intensified sentiment: a multiword with at least one sentiment word and one intensified word such as so, very, extremely and etc. as the previous word, for example, “very well”, “so expensive”.
-
Aspect: is a topic in topic modeling methods.
-
Explicit Aspect: an aspect expression in a sentence that is a noun or noun phrase, for example, “camera”, “battery”.
-
Implicit Aspect: an aspect expression in a sentence that is another type such as adjective or adverb, for example, “not fit”, “expensive”.
-
Sentiment Lexicons: are the words with positive (+1) or negative (−1) sentiment, such as, good (+1) or bad (−1), which used in scoring levels of preprocessing phase.
3 Related works
4 Research methodology
4.1 Preprocessing phase
N-grams | Type |
---|---|
Digital camera | Explicit aspect |
Very good | Intensified sentiment |
Not good | Negation sentiment |
High quality | Intensified sentiment |
Very nice | Intensified sentiment |
Battery life | Implicit aspect |
Not waste money | Implicit aspect |
External hard drive | Explicit aspect |
Windows media player | Explicit aspect |
Portable DVD player | Explicit aspect |
Not very good | Negation sentiment |
Work very well | Intensified sentiment |
Not fit | Implicit aspect |
Not clear | Negation sentiment |
Not expensive | Implicit aspect |
4.2 Supervised Sentiment and Aspect Model
4.2.1 Incorporating document’s sentiment labels
4.2.2 Incorporating terms or sentences label
D | The number of all reviews |
V | The vocabulary size |
Z | Number of aspects |
S | Number of sentiments |
z | Aspect |
s | Sentiment |
\(\theta \)
| Per-review sentiment–aspect distribution |
\(\varphi \)
| Sentiment–aspect word distribution |
\(\pi \)
| Per-review sentiment distribution |
\(\alpha \)
| Dirichlet prior vector for \(\theta \)
|
\(\beta \)
| Dirichlet prior vector for \(\varphi \)
|
\(\gamma \)
| Dirichlet prior vector for \(\pi \)
|
\(s_i \)
| The sentiment of word i
|
\(z_i \)
| The aspect of word i
|
\(s_{-i} \)
| The sentiment assignments for all words except word i
|
\(z_{-i} \)
| The sentiment assignments for all words except word i
|
w | The word list representation of review d
|
\(N_{k,j,w} \)
| The number of times word w occurred in aspect j with sentiment label k
|
\(N_{k,j} \)
| The number of words that are assigned sentiment k and aspect j
|
\(N_{d,k,j} \)
| The number of words that are assigned sentiment label k and aspect j in review d
|
\(N_d \)
| The total number of words in review d
|
4.3 Learning and inference
Datasets | Amazon electronics | Amazon books |
---|---|---|
Number or reviews | 143,828 | 38,473 |
Number of reviews with 3,4 and 5 stars | 73% | 77% |
Average number of word/review+ | 102 | 67 |
Average number of word/review* | 42 | 33 |
Corpus size+ | 15,822,742 | 3,064,464 |
Corpus size* | 6,493,136 | 1,272,683 |
Vocabulary size+ | 470,779 | 172,669 |
Vocabulary size* | 224,725 | 87,836 |
4.4 Implementing SSAM on Spark framework
Electronics | Books | ||||||
---|---|---|---|---|---|---|---|
Picture quality (n) | Camera size (n) | Computer network (n) | Computer screen (p) | Romantic (p) | Politic (n) | Education (p) | War novels (n) |
Noise | Camera | Internet | computer | Feel | Politic | Book | War |
Picture quality | Battery | Network | Monitor | Heart | Culture | Course | Fear |
Camera | Size | Issue | Bright | Love story | Middle east | Young | Soldier |
Pixel | Kit | Wireless access point | Display | Classic | Democratic | High school student | Army |
Resolution | Bulky | Plug | Screen | Friendship | Bad | Recommend | Force |
Low quality | Camera size | Not work | Size | Leave | History | Collage | American |
Contrast | Heavy | Bad | LCD screen | Romance | Inconsistent | Well write | Sadness |
Amateur | Battery | Connect | Great | Love | Influence | Educate | Dark |
Not clear | Camera bag | Slow | View | Interesting | Government | Child | Kill |
Lens | Compact | DSL router | Inch | Life | People | Kid | Country |
Distortion | Side | Home | Color | Emily | State | School | Happen |
Color | Inch | Port | Sharp | Emotion | Republic | Parent | Human |
Not good | Very small | File | New | Wonderful | Dissatisfaction | Teach | Action |
Point | Not fit | Less | Video | Special | Foreign | Children | Critic |
Low light | Pocket | Support | Show | Man | Policy | Think | Bad |
5 Experimental setup
5.1 Dataset
5.2 Evaluation metrics
5.2.1 Sentiment classification accuracy
5.2.2 Precision, recall and F-score
6 Experiments
Electronics | Books | ||||||
---|---|---|---|---|---|---|---|
Linear SVM | Naive Bayes | SSAM | Linear SVM | Naive Bayes | SSAM | ||
Recall | 85.02 |
99.60
| 90.45 | Recall | 74.12 |
98.00
| 81.17 |
Precision | 84.11 | 77.84 |
90.74
| Precision | 92.03 | 68.35 |
92.14
|
F1-score | 84.06 | 87.39 |
90.61
| F1-score | 84.16 | 80.53 |
86.31
|
Accuracy |
85.17
| 77.61 | 83.90 | Accuracy | 73.74 | 69.07 |
77.61
|
6.1 Aspects discovery evaluation
6.2 Performance comparison of SSAM with two existing supervised methods
Accuracy (%) | ||||
---|---|---|---|---|
Scoring levels | ||||
Term | Sentence | Document | SSAM | |
Electronics | 66.93 | 75.58 | 73.41 |
83.90
|
Books | 62.02 | 72.43 | 70.44 |
77.61
|
Accuracy (%) | |||
---|---|---|---|
Topic modelling | |||
ASUM | JST | SSAM | |
Electronics | 78.83 | 69.94 |
83.90
|
Books | 73.23 | 65.28 |
77.61
|