Introduction
-
In this paper, we propose a two-stage framework for target-based sentiment analysis, namely O\(^2\)-Bert, which consists of two modules, respectively for Opinion Target Extraction (OTE-Bert) and Opinion Sentiment Classification (OSC-Bert). Compared to the end-to-end approaches, the proposed framework can make full use of the supervision signals and achieve better-trained models.
-
Secondly, the designed standalone modules, respective for entity number prediction, starting position annotation, as well as entity length prediction, are effective to solve the unusual samples, for example, samples with no entities, or with multiple-word entities. Moreover, in the entity starting position module, we introduce an innovative model to combine BERT and GCN to learn contextual relationships among words.
-
The proposed approach achieves competitive performances on open benchmarks, SemEval datasets, which demonstrates the effectiveness and robustness through various comparison experiments.
Related Work
Task | Method | |||
---|---|---|---|---|
Preprocessing | Network | Attention mechanism | Other | |
OTE | Word2Vec [4] | RNN [5] | Self [6] | MTL [1] |
GPT [14] | SPR [15] | |||
Capsule network [12] | ||||
Post-processing [9] | ||||
RL [23] | ||||
OSC | MTL [1] | |||
Dictionary-based [31] | ||||
Distance-rule [29] | ||||
One-stage | MTL [1] | |||
BERT [38] | ||||
Span-based [18] |
Opinion Target Extraction
Opinion Sentiment Classification
Recent Advances Promoted by Pretrained Language Models
O\(^2\)-Bert Model
Challenges and Design Motivation
OTE-Bert Framework
Entity Number Prediction Module
Entity Starting Annotation Module
The Entity Length Prediction Module
OSC-Bert Framework
Experiment
Dataset and Experiment Settings
Sentiment | Laptop2014 | Restaurant2014 | Restaurant2015 | Restaurant2016 | ||||
---|---|---|---|---|---|---|---|---|
Train | Test | Train | Test | Train | Test | Train | Test | |
Positive | 1002 | 348 | 2216 | 737 | 1178 | 341 | 1618 | 596 |
Neural | 471 | 167 | 643 | 197 | 48 | 34 | 88 | 36 |
Negative | 885 | 134 | 834 | 200 | 380 | 328 | 708 | 189 |
Number | Laptop2014 | Restaurant2014 | Restaurant2015 | |||
---|---|---|---|---|---|---|
Train | Test | Train | Test | Train | Test | |
1 | 930 | 266 | 1023 | 298 | 801 | 311 |
2 | 354 | 105 | 572 | 186 | 212 | 127 |
3-6 | 199 | 50 | 417 | 128 | 67 | 15 |
>6 | 5 | 0 | 9 | 2 | 25 | 3 |
Division | Laptop2014 | Restaurant2014 | Restaurant2015 | Restaurant2016 |
---|---|---|---|---|
Train | 1840 | 2585 | 1124 | 1690 |
Validation | 788 | 1108 | 482 | 724 |
Test | 649 | 1134 | 703 | 821 |
Comparative Methods
Baseline Models
-
Bi-LSTM is the abbreviation of “Bidirectional Long Short-Term Memory,” which is made up of a forward LSTM and a backward LSTM. However, there is a problem with modeling sentences using LSTM alone. It cannot encode information from back to front. For more grain-fined classification, especially when there are several entities and opinion words in a sentence, Bi-LSTM may figure out the “pair” better by catching more context information.
-
Bi-LSTM+CRF: The input sequence undergoes embedding transformation into a vector sequence. Two bidirectional LSTM units process the input. The forward and backward outputs are concatenated via a fully connected layer, resulting in a vector with dimensions corresponding to the output tags defined by the CRF feature function. The output is then normalized using softmax to obtain label probabilities.
-
BERT+CRF uses BERT as the pretraining model; the output hidden states of input words are taken as the features for CRF.
-
BERT+Bi-LSTM+CRF: BERT, functioning as a pretraining model resembling the transformer encoder, generates an embedding vector that captures contextual and word information at the current position. Bi-LSTM extracts the embedding feature and passes it to CRF for classification outcomes.
-
SpanMlt [18] is a framework that is a span-based multi-task, and the task contains pairwise aspect and opinion term extraction.
-
ATSE [20] treats the opinion target extraction task as a question-answering machine reading comprehension task. It utilizes a span-based tagging scheme to handle cases where a token can belong to multiple entities.
-
DE-CNN [9] is a post-processing method to control the number of extracted opinion targets and correct the boundary of the extracted opinion targets. They proposed aspect number determining module and aspect boundary modifying module to better address the errors in extracted opinion targets.
-
TextCNN [19]: It is an enhanced CNN-based model for text tasks, comprising four layers: input, convolution, pooling, and fully connected softmax layer. The input of the model is the word embedding of each word, and after the fully connected softmax layer, it can output the classification probability.
-
Distance-rule [29] summarizes customers’ reviews in three steps: first they mine product features that appear as nouns or noun phrases in comments by part-of-speech (POS) tagging and association mining. Then the model regards adjectives as opinion words and determines their polarities by employing the synonym and adjective synonym set in WordNet [45]. Moreover, for infrequent feature, the model regards the nearest noun or noun phrase as the opinion target for an opinion word. Finally, it predicts the polarity of a sentence by analyzing all the opinion words in it and generates the final result by summing up all product features.
-
Dependency-rule [36] proposes dependency tree based templates to identify opinion pairs, making use of the POS tag of opinion targets and opinion words and the dependency path between them.
-
ATAE-LSTM [19]: ATAE-LSTM encompasses two primary features: aspect embedding and attention mechanism. The former involves learning embedding vectors for each aspect, while the latter focuses on determining the weights that indicate the significance of each word through attention.
-
TransCap [46] proposes a transfer capsule network model to transfer document-level knowledge to aspect-level representations.
-
IACapsNet [30] utilizes a capsule network to capture vector-based feature representation. Through the incorporation of an interactive attention EM-based capsule routing mechanism, IACapsNet effectively learns the semantic correlation between opinion targets and opinion words.
-
SGGCN+BERT [26] employs a gate vector to leverage the representations of the opinion words. Additionally, leveraging opinion words information, it modulates the hidden vectors of graph-based models.
-
CapsNet+BERT [47]: CapsNet is a fusion of a conventional CNN and a unique fully connected Capsule layer. It comprises three layers: a standard CNN as the first layer, a primarycaps layer as the second, and a digitcaps layer as the third. In Capsule, each cap neuron establishes connections with all cap neurons in the subsequent layer.
-
MHAGCN (BERT) [24] is a graph convolutional network with a hierarchical multi-head attention mechanism. It aims to leverage the relationship between opinion targets and their context by incorporating semantic information and syntactic dependencies.
-
GP-GCN (BERT) [25] simplifies the global feature by utilizing orthogonal projection in the process of GCN. It captures the local dependency structure of sentences by syntactic dependency structure and sentence sequence information. Moreover, it proposed a percentage-based multi-headed attention mechanism to better represent the critical output of GCN.
-
BiGCN [35]: It involves constructing a global vocabulary graph from the training corpus, along with local syntactic and vocabulary graphs for each sentence. Additionally, a conceptual hierarchy is employed to differentiate various types of dependent or symbiotic relationships. To extract comprehensive sentence features, the HiarAgg module facilitates interaction between the vocabulary graph and the syntactic graph. By utilizing a mask and gate control mechanism, contextual information is obtained, leading to improved performance in predicting target polarity.
-
CGAT [32] proposes a contextual attention network which contains two graph attention networks and a contextual attention network to capture aspect-sensitive text features. Furthermore, a novel syntactic attention mechanism based on relative distance is introduced to enhance focus on opinion targets while mitigating computational complexities.
Ablation Models
Results and Discussion
The Experimental Result of Extracting the Opinion Target in OTE-Bert
Method | Lap2014 (%) | Rest2014 (%) | Rest2015 (%) | Rest2016 (%) |
---|---|---|---|---|
Bi-LSTM [18] | 55.25 | 51.90 | 53.28 | 51.83 |
Bi-LSTM+CRF [18] | 69.80 | 78.03 | 66.27 | 70.43 |
BERT+CRF [18] | 56.38 | 54.37 | 57.01 | 55.83 |
BERT+Bi-LSTM+CRF [18] | 56.99 | 54.08 | 55.85 | 55.18 |
SpanMlt [18] | 84.51 | 87.42 | 81.76 | 85.62 |
ATSE [20] | 82.47 | 87.85 | 77.72 | 83.34 |
DE-CNN [9] | 84.89 | 88.41 | 73.47 | 78.83 |
O
\(^\textbf{2}\)-Bert (ours)
| 84.63 | 89.20 | 83.16 | 86.88 |
w/o n | 81.53 | 85.31 | 78.03 | 80.02 |
w/o s | 75.83 | 79.38 | 77.26 | 78.34 |
w/o n+s | 72.14 | 80.27 | 73.72 | 76.69 |
w/o se | 75.02 | 78.14 | 77.56 | 76.30 |
The Experimental Result of Opinion Sentiment Classification in OSC-Bert
Model | Lap2014 | Rest2014 | Rest2015 | Rest2016 | |||||
---|---|---|---|---|---|---|---|---|---|
Acc (%) | F (%) | Acc (%) | F (%) | Acc (%) | F (%) | Acc (%) | F (%) | ||
Network | TextCNN [19] | 55.16 | 48.81 | 47.69 | 42.58 | - | - | - | - |
Distance-rule [29] | 58.39 | 49.92 | 50.13 | 40.42 | 54.12 | 45.97 | 61.90 | 51.83 | |
64.57 | 58.04 | 45.09 | 37.14 | 65.49 | 55.98 | 76.03 | 64.62 | ||
LSTM [19] | 52.64 | 58.34 | 55.71 | 56.52 | 57.27 | 58.93 | 62.46 | 65.33 | |
ATAE-LSTM [19] | 77.32 | 66.57 | 69.14 | 63.14 | 75.43 | 56.34 | 83.25 | 63.85 | |
TransCap [46] | 79.29 | 70.85 | 73.87 | 70.10 | - | - | - | - | |
IACapsNet [30] | 81.79 | 73.40 | 76.80 | 73.29 | - | - | - | - | |
Bert | BERT [27] | 84.11 | 76.68 | 77.59 | 73.28 | 83.48 | 66.18 | 90.10 | 74.16 |
SGGCN+BERT [26] | 87.20 | 82.50 | 82.80 | 80.20 | 82.72 | 65.86 | 90.52 | 74.53 | |
CapsNet+BERT [47] | - | 76.37 | - | 73.58 | - | 70.56 | - | 76.36 | |
MHAGCN+BERT [24] | 79.06 | 75.70 | 82.57 | 75.83 | - | - | - | - | |
GP-GCN+BERT [25] | 83.89 | 75.09 | 83.90 | 66.89 | 87.78 | 72.89 | 75.90 | 73.90 | |
Graph | ASGCN-DT [34] | 80.86 | 72.19 | 74.14 | 69.24 | 79.34 | 60.78 | 88.69 | 66.64 |
ASGCN-DG [34] | 80.77 | 72.02 | 75.55 | 71.05 | 79.89 | 61.89 | 88.99 | 67.48 | |
BiGCN [35] | 81.97 | 73.48 | 74.59 | 71.84 | 81.16 | 64.79 | 88.96 | 70.84 | |
CGAT [32] | 86.25 | 80.38 | 81.41 | 76.48 | - | - | - | - | |
O
\(^2\)-Bert (ours)
| 88.43 | 82.90 | 86.81 | 80.73 | 86.94 | 76.94 | 89.83 | 83.58 |