In this and the next sections, we will discuss the state-of-the-art approaches to sentiment analysis classified into corpus-based, lexicon-based, and hybrid ones, for both English language and other languages. In particular, in this section we present corpus-based techniques, development of which focuses on feature engineering and model selection. The majority of the techniques presented here use annotated corpus and machine learning models to train a suitable sentiment analysis classifier.
English
Shi and Li [
47] developed a supervised machine learning technique for sentiment analysis of online hotel reviews in English by using unigrams features. They used features such as term frequency and TF-IDF to identify the document polarity as positive or negative. The data were separated into training and testing sets with different data instances. The instances in the training set covered the target values. The support vector machine (SVM) has been used to develop a model able to predict target values of data instances [
47]. The SVM classifier has been chosen because it has been reported to perform better than other classifiers [
38], though Tong and Koller [
55] consider Naive Bayes and SVM the most effective classifiers among machine learning techniques [
61]. The hotel-review corpus contained 4000 (positive and negative) reviews; the reviews have been pre-processed and tagged as positive and negative. Then, the obtained sentiment classification model has been used to classify live information flow into positive and negative documents. The TF-IDF feature performed better than simple term frequency [
47].
Another study [
10] used supervised classification for identification of the sentiment in documents. They applied their method to sentences found in Internet, in particular, in blogs, forums, and reviews. The features of the sentences were extracted using a state-of-the-art algorithm. Sentence parsing has been used for a deeper level of analysis. Finally, the method of active learning has been used to reduce workload in annotation [
15]. After the pre-processing stage, there were different features selected, such as unigrams, stems, negation, and discourse features. The SVM, Maximum Entropy, and multimodal Naïve Bayes classifiers have been employed as machine learning algorithms. For linearly separable data, SVM gives classification results with minimal error. The multimodal Naïve Bayes classifier is very simple to use for efficient classification and with incremental learning [
31]. The Maximum Entropy classifier is efficient in extracting information that leads to good results [
7]. English-language corpora were collected from blogs, reviews, and forum sites such as
www.livejournal.com or
www.skyrock.com.
The Maximum Entropy classifier showed 83 % accuracy, which is better compared to other classifiers used in this study, namely SVM and multinomial Naïve Bayes; however, other approaches [
47] used SVM to evaluate datasets, and other machine learning techniques have been reported to have accuracy lower than that of SVM.
The main advantage of this approach is that it involves less building effort and is simple to develop. A disadvantage of this approach is the lack of high-quality training data, because data collected from blogs contain many grammatical errors, which negatively affect classification performance [
10].
Other Languages
Habernal et al. [
23] proposed an approach for supervised sentiment analysis in social media for the Czech language. Three different datasets have been employed; first dataset was collected from Facebook, basing on top comments in popular Czech Facebook pages. The Facebook dataset contained positive, negative, neutral, and bipolar information. The second dataset was a movie review dataset downloaded from a Czech movie database. The third dataset contained product review information collected from large online Czech shops. After the data pre-processing step, the n-gram feature has been extracted. The unigrams and bigrams were used as binary features. In addition, the minimum number of occurrences of character n-grams has been established. Part-of-speech (POS) tagging provided characteristics of specific posts. Various POS features have been used, such as adjectives, verbs, and nouns. Two different emoticon lists have been used: one for positive and one for negative sentiment. Another feature used was Delta TF-IDF, a binary word feature, which showed good performance. Delta TF-IDF uses TF-IDF for words, but it treats words as positive or negative.
To evaluate the dataset, two different classifiers were trained: SVM and a Maximum Entropy classifier. The F-measure for combination of features such as bigrams, unigrams, and emoticons was 0.69. The emphasis of this approach was on feature selection. The features that were selected were bigrams, unigrams, POS, and character n-grams. This approach is useful for sentiment analysis in Czech social media. However, it cannot be directly used for other languages, and its results are not very helpful even for Czech social media. Still it can help researchers extend sentiment analysis methods to the Czech language [
23].
Tan and Zhang [
54] introduced an approach for sentiment classification for the Chinese language. First, POS tagging was used; the aim of using POS tagging was to parse and tag the Chinese text. After POS tagging, feature selection was used to determine discriminative terms for classification. Finally, a machine learning approach was used for sentiment classification. Feature selection included four types of information: document frequency, Chi-square feature, mutual information, and information gain. The threshold was defined for the document frequency of words and phrases in the training corpus, and the words with the document frequency lower than a predefined threshold or higher than another predefined threshold were removed. In order to calculate the association between terms, CHI was used. Mutual information was used for statistical language modelling. Information gain measures the amount of information useful for prediction of the category that is contributed by the presence or absence of a given term in the document.
There are various datasets available online for use in Chinese sentiment classification. The Chinese sentiment corpus ChnSentiCorp, collected from online documents, is an online benchmark sentiment analysis database. It includes 1021 documents in three domains: education, movies, and house. For each of these domains, there are positive and negative documents. The centroid classifier, SVM, Naïve Bayes, k-nearest neighbour classifier, and winnow classifier were compared. The overall accuracy of the SVM classifier was better than that of other classifiers.
This approach is unique in comparison with other approaches in that the feature selection scheme is different. The features that are selected are document frequency, mutual information, Chi-square statistic measure, and information gain. Other approaches usually employ such features as bigrams and unigrams. The results of this approach show that of such features as information gain, document frequency, Chi-square statistics, and mutual information, information gain is the best feature and can be recommended for future applications. The main disadvantage of this approach is use of traditional features such as Chi-square statistics, document frequency, and mutual information [
54].
Ghorbel and Jacot [
21] proposed an approach for sentiment analysis of French movie reviews. Their method relies on three types of features, namely lexical, morpho-syntactic, and semantic features. The unigrams were selected as a feature. The goal of this system was to find polarity of the words. The part-of-speech tags were employed to augment unigrams with morpho-syntactic information, in order to reduce word sense ambiguity and to control negation before polarity extraction. SentiWordNet was used to determine polarity of words. This information was used to measure the overall polarity score of the review [
52]. SentiWordNet is an English-language resource; in order to use SentiWordNet, French reviews were translated into English before extraction of polarity. The words were lemmatized before looking them up in a bilingual dictionary; then part-of-speech tags were used for sense selection, to remove uncertain senses, and to predict the correct synset. The dataset of French movie reviews contained 2000 documents: 1000 positive and 1000 negative reviews of ten movies.
The SVM classifier was used for classification. The overall performance on French movie reviews using unigrams, lemmatization, and negation was 92.50 % for positive reviews and 94 % for negative reviews. This approach combined lexical, morpho-syntactic, and semantic orientation of words to improve the results. The accuracy was improved by 0.25 %. The semantic orientation of the words was extracted from SentiWordNet, which further improved the result by 1.75 %.
A disadvantage of this approach is that words need to be translated into English prior to use SentiWordNet, which is an English-language resource. The quality of translation had a negative effect on the performance of the classifier, since translation of words does not preserve the semantic orientation due to differences between languages [
21].
Balahur and Turchi [
5] introduced a hybrid technique for sentiment analysis of Twitter texts. The sentiment analysis tools for various languages were developed to minimize the effort to produce linguistic resources for each of these languages; research on the use of machine translation systems to produce multilingual data was conducted in the context of Twitter texts.
The pre-processing was employed to normalize the texts: at this phase, the linguistic peculiarities of tweets were taken into consideration. Spelling variants, slang, special punctuation, and sentiment-bearing words from the training data were substituted by unique labels. For example, the sentence “I love car” was changed to “I like car”; according to the General Inquirer dictionary, love and like both have positive sentiment.
This approach can be used for various languages with minimal linguistic processing. Only tokenization was used; the method does not require any further processing. The final system should work similarly for all languages.
A standard news translation system was used to obtain data in various languages such as Italian, German, Spanish, and French. The original dictionary was created based on translation of English and Spanish texts into a third language. The dictionary was created for fifteen different languages. This approach includes two main stages: the pre-processing step and the application of a supervised machine learning technique. Support vector machine sequential minimal optimization (SVM SMO) was employed to identify features such as n-grams and bigrams in the training data [
5].
The accuracy on English language was higher than on other languages. The main novelty of this approach was the pre-processing step. The pre-processing of Twitter texts is very important for sentiment analysis, and it significantly affects the accuracy of the classifier. The normalization of tweets at the pre-processing step can improve the accuracy. The main disadvantage of this approach is that on English language better accuracy was obtained in comparison with other languages, while on other languages such as Spanish and Italian the approach did not perform well [
5].
Duwairi and Qarqaz [
19] introduced a supervised technique for sentiment analysis of Arabic tweets. The authors generated a dataset using 10,000 tweets and 500 Facebook reviews in various domains such as news and sport. A number of pre-processing techniques were used in this study including removing duplicated tweets, empty tweets, and emoticon-only reviews. In order to determine the sentiment of collected tweets and Facebook reviews, a number of volunteers were asked to label each tweet or comment as positive, negative, neutral, or other.
A number of pre-processing steps such as tokenization, stemming, forming bi-grams, and detection of negation were then applied to the tweets and Facebook comments. Finally, three supervised machine learning techniques were applied on the prepared dataset, namely k-nearest-neighbour, Naïve Bayes, and SVM classifiers. The tenfold cross-validation method was used for evaluation. It showed that SVM outperformed both k-nearest-neighbour and Naïve Bayes classifiers. A limitation of this study was that the number of trained data was rather small.