Preprocessing techniques
Most of recent studies in sentiment analysis focus on the user-generated texts which have been based on habit and are informal, hence it is necessary to clean, normalize language, also remove noisy information to be classified.
Vietnamese segmentation this is always a required step to work with Vietnamese, for example “
” (this is a wonderful phone) is tokenized “
” (using pyvi
1 library). Unlike English, words are separated by whitespaces and punctuations, Vietnamese words may contain many tokens and they must be processed, if not, the meaning of the sentence can be much different from the original expectation.
Lowercase is a classic preprocessing technique converting all texts into lowercase form. The same words are merged, so the dimensionality of the problem is reduced, for example “
” (good) and “
” (Good) is the same dimensionality. This techniques have been widely used by many researchers [
19‐
21].
Stop words removal stop words are function words, they are usually less meaning words and do not contain any sentiment, but appear high frequencies in texts. They should be removed to reduce the dimensionality and the computational cost, also improve the performance. The set of these words is not totally predefined depending on the application. In our experiments, they are determined stop words list based on term frequencies and inverse document frequencies weights in the collected datasets.
Elongated characters removal some characters are elongated one or more times in a lexicon to emphasize sentiment, this can lead to increase unnecessary dimensionality because the classifiers treat them as different words, even they may be ignored due to low frequency. So the elongated characters removal transforms the word to the source word in order to merge them in the same dimensionality. For example “quuuuáaaa” is replaced by “quá” (so), “
” is replaced by “
” (wonderful). The experiments of Symeonidis [
4] proved the improvement of this one.
Abbreviations or wrong-spelling lexicons replacement abbreviations and wrong-spelling words become a habit and are usually used in reviews of social media or e-commerce system, they should also merge into the source word, for example dth ->
(cute); iu, êu -> yêu (lovely); omg -> oh my god; k -> không (not); sd ->
(use); ote, okay, oki, uki, oke -> ok;
->
(wonderful);
, xs ->
(excellent); wá, qá -> quá (so). Currently, a list of abbreviation and wrong-spelling lexicons have prepared manually for our experiments based on observing the reviews in social media and our collected datasets. Kim [
19] corrected the common spelling mistakes by using AutoMap. Symeonidis et al. [
4] also mentioned this technique in a comparison of preprocessing techniques.
Emotional icons replacement emotional icons have been widely used in reviews and denotes users’ sentiment. Wang and Castanon [
11] analyze and compare sentiments of tweets with and without emotional icons in order to provide the evidences the importance of emotional icons in expressing the sentiment in social media. In our case, the positive and negative icons are, respectively, replaced by “pos” and “neg” lexicons, for example: :) is replaced by “pos” lexicon or :( is replaced by “neg” lexicon.
Punctuation removal some punctuations (excluding underscore _ is used for Vietnamese segmentation) usually do not affect the sentiment, it should be removed to reduce noise, for example: “
” (so beautiful!, love this phone!) will be “
”. However, some punctuations contain sentiment, so it might decrease the accuracy of classification in those cases such as :), :D, ;),
\(<3\) are positive icons which affect sentiment in reviews. In our works, this one will be applied after emotional icons replacement. Kim [
19] also removed punctuation, URLs, stop words not containing any sentiment to improve performance.
Numbers removal normally, numbers do not contain any sentiments, it is necessary to remove them, but this should be performed after emotional icon replacement, wrong-spelling replacement because any of them contain numbers such as :3, \(<3, 8|\), 8-), etc.
Part of Speech (POS) handling POS tagging is an essential problem in natural language processing to assign part of speech to each words in a sentence as noun, verb, adjective, pronoun. This is helpful to increase semantic in text. In our works, POS tagging is used to retain words containing the sentiment, namely nouns, adjectives, verbs, adverbs. Symeonidis et al. [
4] also applies POS tag and keeps nouns, verbs, adverbs in experiments. For example of our case “
” (that phone is so beautiful, I am so pleased), the part of speech for each words is “
” (N: noun, P: pronoun, A: adjective, R: adverb), the sentence becomes “
”. In order to do this, we also used pyvi
\(^{1}\) library for POS tagging.
Negation handling is a challenge in sentiment analysis, for example “
” (the product is not good) used terms to vectorize, if not considering “không” (not) term, it might evaluate this is a positive sentence instead of a negative one. Normally, when detecting a negation lexicon (không (not),
(not),
(not yet), etc) following by a positive or negative lexicons, those phrases should be replaced by antonyms of next lexicon, for example the phrase “
” (not good) is replaced by “
” (bad) as an antonym of “
” (good) based on a certain wordnet. However, based on the experiments of Symeonidis et al. [
4], replacing negations with antonyms only logistic regression algorithm of SS-Twitter dataset and Convolutional Neural Networks for SemEval dataset beat the baseline. Even, Xia et al. [
22] presented many machine learning algorithms fail replacing negations with antonyms.
Our works are based on the negation terms (không (not),
(not),
(not yet)) to detect the negation, Fernández-Gavilanes et al. [
6] also estimated negation scope based on some negator forms (not, no, never, neither). Our case has no Vietnamese wordnet being strong enough for negation replacement, so if detecting the negation following by a positive lexicon, then replacing by “not_pos” lexicon, it also detects the negation following be a negative lexicon, then replace by “not_neg” lexicon. After that, in order to show affectation of lexicons, we append “pos” and “neg” lexicons whenever appearing a positive and negative lexicon, respectively. In our experiments, this improves significantly accuracy of classifiers.
For example the sentence “
” (this phone design is not nice, but its performance is good), “không đẹp” (not nice) is a negation phrase, “đẹp” (nice) is a positive lexicon, so it is replaced by “not_pos” lexicon, and “
” (good) is also a positive lexicon, so the sentence becomes “
not_pos,
pos”.
Intensification handling intensifier lexicons such as “
” (very), “quá” (too), “
” (a bit), “khá” (pretty) aim to emphasize, increase or decrease the semantic meanings of the lexicons which precede or follow them. This is also so necessary to detect the degree of customers’ satisfaction. Fernández-Gavilanes et al. [
6] applied intensification treatment as a preprocessing technique, they used the parsing to determine which semantic orientation altered.
For our works, if the program detects an increasing intensifier lexicon preceding or following by a positive or negative lexicon, then appending “strong_pos” or “strong_neg” lexicon, respectively. Otherwise, if detecting a decreasing intensifier lexicon preceding or following by a positive or negative lexicon, then appending “pos” and “neg” lexicon, respectively. This one will be applied after negation handling.
For example the sentence “
” (nice design, strong configuration, I’m very pleased), “
” is an increasing intensifier lexicon and follows by a positive lexicon as “
” (pleased), so the sentence becomes “
strong_pos”. For intensification handling, we have also prepared a list of intensifier lexicons used frequently in Vietnamese, grouped into increasing (
) and decreasing (
) semantic.
Other techniques relate to morpheme of word not using such as stemming, lemmatizing since Vietnamese is an inflexionless language, words are only one form.
Data augmentation
The original data augmentation is used in image classification by increasing image data such as rotate, translate, scale, add noise, etc. Similarly, data augmentation has also applied for text classification by increasing text data based on various techniques. In text, data augmentation is more complex, many studies have been investigated to get new data, also improve quality of new data without user intervention. This helps to enhance the original training data to increase accuracies of models. However, it notes that data augmentation is only useful for a small dataset.
Firstly, we present EDA techniques which was introduced by Wei and Zou [
18] and apply them to Vietnamese.
Synonym replacement the words are not stop words, their synonym words have obtained randomly and replaced them for a new sentence.
Random Swap will swap
n times two non-stopword words randomly.
Random Insert will find a random synonym of a non-stopword word and insert this randomly
n times in the sentence.
Random Delete will randomly remove each word in the sentence with probability
p. These processes have repeated many times until having the expected training data. For example the sentence: “
” (great! I’m so pleased) has applied these techniques which may generate new sentences as follows:
-
Synonym Replacement: “
” (“
” is a synonym word of “hài_lòng”).
-
Random Swap: “
hài_lòng!
” (swapping “
” and “hài_lòng”).
-
Random Insert: “
” (“
” is also a synonym word of “hài_lòng”).
-
Random Delete: “
” (deleting “tôi” word).
Although these methods may increase the meaningless sentences, but they can increase the accuracies of classifiers in experiments. Depending on the dataset size to determine the repeated times because too much augmented data can lead to overfitting issue. The biggest disadvantage of these methods is not reserving meaning concerning the context of the sentences, so we present more complex approaches retaining the meaning as the original sentence.
Back translation aims to obtain more training samples based on the translators, many research teams have used to improve translation models [
12‐
15,
23]. This technique is resolved by using the translators to translate the original data to a certain language, after that taking the translated data into the independent translator to translate back to the original language. Normally, the data of back translation will be never totally exact the same as the original data. English is one of languages having many training datasets for translation, others lack training datasets for translation models. So, English was utilized as the intermedia language to get more data.
For example the sentence “
”, Google translator translates it to English: “I really like to buy the device at this store”, taking this translated sentence to Google translator to translate back to Vietnamese as: “
”.
This approach is simple to understand and helpful to augment data retaining the meaning of the original data, but it needs effective translators. In experiments, we have used Google Translation API to translate the original data in Vietnamese into English, and translate back to Vietnamese for augmented data.
Syntax-Tree Transformations is a rule-based approach to generate new data. From the original data, a syntactic parser builds a syntax tree, then using some syntactic grammars transform this tree to the transformed tree which is used to generate new sentence form. There are many syntactic transformations such as moving active voice to passive voice.
For example, the sentence “
” (I like this phone) is parsed into “
” (P: pronoun, N: noun, V: verb) and transforms to “
” (this phone is liked by me). The generated data still retains the meaning of the original data, but this approach is costly time in calculation, especially Vietnamese which is complex in sentence structures.
Classifiers
Based on the experiments [
24], we choose the best classifiers for our experiments, namely logistic regression, SVM and ensembles of classifiers as OVO and OVR.
Logistic regression (LR) is a statistical approach to determine relationship between the dependent variable y and a set of independent variables x. In order to predict the label of a data point, this is based on the probability of logistic function and a predefined threshold belongs to [0, 1]. The logistic function is often used as sigmod function.
Support vector machine (SVM) is a strong classifier to find the hyperland which divides the dataset into various groups in multi-dimensional space, this must have the same distance between it and two hyperlands which contain the nearest data points belongs to two groups, respectively. For non-linearly separable dataset, SVM used kernel functions to transform the data points from non-linearly separable space into linearly separable space. Our experiments use RBF (radial basis function kernel) kernel as follows:
$$\begin{aligned} k(x,x')=exp(-\gamma \left\| x-x' \right\| _{2}^{2}), \gamma > 0, \end{aligned}$$
(1)
where
\(\gamma\) indicates how far the influence of a data point in calculation of a certain hyperland. Data points, which are low
\(\gamma\) values are far from or high
\(\gamma\) values which close to a separation hyperland, are considered in calculation.
One-vs-one (OVO) and one-vs-all (OVA) are ensembles of binary classifiers for multi-class problem. Each iteration of OVO takes a pairwise of classes and applies the binary classifier to indicate the label of a data point, the final label is determined based on majority voting of iterations. For OVR, the computational cost is lower, if having c classes then OVO needs to execute \(c(c-1)/2\) iterations, about OVR takes only c iterations. For each iteration, the binary classifier determines whether a data point belongs to that label, the final label is determined based on a probability.