A new hybrid semi-supervised algorithm for text classification with class-based semantics

doi:10.1016/j.knosys.2016.06.021

Knowledge-Based Systems

Volume 108, 15 September 2016, Pages 50-64

https://doi.org/10.1016/j.knosys.2016.06.021 Get rights and content

Abstract

Vector Space Models (VSM) are commonly used in language processing to represent certain aspects of natural language semantics. Semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings. In our previous work, we proposed novel semantic smoothing kernels based on classspecific transformations. These kernels use class-term matrices, which can be considered as a new type of VSM. By using the class as the context, these methods can extract class specific semantics by making use of word distributions both in documents and in different classes. In this study, we adapt two of these semantic classification approaches to build a novel and high performance semi-supervised text classification algorithm. These approaches include Helmholtz principle based calculation of term meanings in the context of classes for initial classification and a supervised term weighting based semantic kernel with Support Vector Machines (SVM) for the final classification model. The approach used in the first phase is especially good at learning with very small datasets, while the approach in the second phase is specifically good at eliminating noise in a relatively large and noisy training sets when building a classification model. Overall, as a semantic semi-supervised learning algorithm, our approach can effectively utilize abundant source of unlabeled instances to improve the classification accuracy significantly especially when the amount of labeled instances are limited.

Introduction

Vector Space Models (VSM) are commonly used language processing to represent some aspects of natural language semantics. In this model, documents are simply represented as points in vector space and closeness of two points is proportional to their semantic similarity. There are several categories of VSM including word-context, pair-pattern and term-document matrices [5]. As pointed out in [5], semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings [1]. One of our main motivations is to use a class of documents as the context. In our previous work, we proposed novel semantic smoothing kernels based on class specific transformations [2,3]. Since these kernels use class labels explicitly, they are strictly supervised. Furthermore, they can be considered as a new type of VSM consist of term-class matrices specific to the class labeled documents. By using the class as the context, these matrices can extract class specific semantics by making use of word distributions both in documents and in different classes. These semantic kernels can be integrated into supervised classifiers i.e. SVM for text classification and they could be able to outperform baseline classifiers using document based traditional VSM. In this study, by combining one of these supervised semantic kernel based classification algorithm [3] with class meanings based methodology to classify unlabeled documents which is inspired from [4], we propose a novel and high performance semi-supervised text classification algorithm.

In machine learning applications there are two conventional strategies; supervised learning and unsupervised learning. Traditional supervised learning algorithms need a set of sufficient labeled data as training set to build the classification model, which will be used to predict the class memberships of the unlabeled examples. On the other hand, unsupervised learning, solely based on unlabeled samples, doesn't need any labeled data to learn a model. So as to train a classifier, it attempts to discover the indirect structure of unlabeled data [6]. There have been massive amounts of accumulated data on the web, particularly on blogs, forums, social networks and continue to increase day by day without any doubt. But unfortunately most of the available data does not have pre-assigned labels which limit their use in several practical machine learning application fields such as text classification, sentiment recognition, speech recognition. Moreover, generally it can be time-consuming, tedious and expensive to assign labels to them manually. Most importantly, learning a classifier with only a few labeled training data may not generate sufficient performance. In situations where labeled data is inadequate, many algorithms have suggested exploiting and utilizing the unlabeled data to support to learning process for better classification. SSL approaches utilize not only labeled data but also unlabeled data to increase the classification accuracy. Recently, SSL has become popular and gained increased attention of both academic and commercial platforms as a new machine learning strategy. SSL is different from two ordinary classification approaches by the usage of unlabeled data to mitigate the effect of insufficient labeled data on classifier accuracy. Many SSL systems have been offered in the past years, like co-training [7], self-training [8,9] graph-based methods [10], semi-supervised support vector machines [6], EM with generative mixture models [11], transductive support vector machines [12].

Classification of texts requires special techniques to transform unstructured text into structured format required for classification algorithms; usually a vector of features. In this domain, documents are usually symbolized by terms and their corresponding frequencies. This kind of representation methodology is actually the most popular one in the literature and it is named as Bag of Words (BOW) feature demonstration. BOW is known as the simplest feature representation method where each term comprises a dimension in a vector space, being independent of other terms in the same document [13]. A bag is mathematically similar to set with the difference that there can be duplicate values. As in sets, the order of words is lost in bag representation. Similarly, a bag can be represented as a vector and a group of bags also can be represented as a matrix where the rows are documents and columns are term frequencies. This also called Vector Space Model (VSM). Although, the VSM and BOW approach is very simple and commonly used, they have several limitations as discussed in [2]. One of the restrictions is the assumption of independency between terms. Documents are represented only with their term frequencies, disregarding their position in the document or their semantic or syntactic links between other words. This is a big problem since it clearly ignores the multi-word expressions by separating them. Moreover; it cannot handle polysemous words (i.e. words with multiple meanings) since it treats them as a single entity. Furthermore, it maps synonymous words into different components; as discussed in [14]. In principle, as Steinbach et al. [15] mention, each class has two kinds of vocabulary: one is “core” vocabulary which are closely correlated to the theme of that class, the other type is “general” vocabulary those may have similar distributions on different classes. Consequently, two documents from different classes may commonly have many general words and can be categorized as similar in the BOW representation.

In our recent study in [2], we offer a novel method for constructing a supervised semantic smoothing kernel for SVM, which we name Class Meaning Kernel (CMK). The proposed method smoothens the words of a document in BOW demonstration by class-based meaning values of terms. The meaning scores of words are calculated based on the Helmholtz principle from Gestalt theory [16], [17], [18], [19] in the scope of classes. According to our experimental results, CMK is superior to the traditional kernels such as linear kernel, polynomial kernel and RBF kernel. In one of our previous studies, we suggest a new classifier for textual data, named as Supervised Meaning Classifier (SMC) [4]. The SMC classifier uses meaning calculations, which is based on Helmholtz principle from Gestalt Theory. In SMC, meaningfulness of terms in the scope of classes are calculated and used for classification of a document. According to the experiment results this new SMC classifier outperforms Multinomial Naïve Bayes (MNB) and SVM specifically on insufficient training data.

In another recent study of ours [3], we offer a novel approach for constructing a supervised semantic kernel for SVM, which we name Class Weighting Kernel (CWK). The proposed method smoothens a document in BOW representation by using class-based weights of words. The weights of terms are calculated based on a term weighting approach that is designed as a part of a feature extraction algorithm is presented in [20,21]. According to our experimental results the classification performance of CWK is higher than the classification performance of the other commonly used kernels (i.e. linear kernel, polynomial kernel and RBF kernel).

Inspired by the advantages of CMK, CWK and SMC, and motivated by the fact that there are insufficient labeled data in real life practical applications, we build a new hybrid semi-supervised form of CWK and SMC, which is called Hybrid Class Semantics Classifier (HCSC). More precisely, in this article we suggest a novel non-iterative semi-supervised methodology that uses class-based meaning values and weights of terms. The suggested approach utilizes both labeled and unlabeled data in order to build a classifier. First it smoothens the terms of the labeled documents in BOW demonstration (document vector represented by term frequencies) by class-based meanings of words as the same way it is done in CMK [2] and SMC [4]. Then, it attempts to find suitable labels for unlabeled instances. It succeeded this labeling process by weighting the terms of the unlabeled documents in BOW representation with the help of meaning calculations [4]. Following this, HCSC combines the original labeled data with newly classified unlabeled data. Finally, CWK is applied on this enlarged labeled dataset in order to predict the labels of the samples in the test dataset. The smoothing process in HCSC increases the significance of important (i.e. meaningful) words specific to a particular class while reducing the importance of general terms those have a similar distribution in all classes. Since this method is used in the transformation phase of a kernel function from input space into a feature space, it considerably decreases the effects of above mentioned disadvantages of BOW. We note that HCSC advances the accuracy of SVM compare to the linear kernel by giving more significance to the class specific concepts, those may possibly be synonymous or very closely associated in the scope of a class. The HCSC uses a semantic smoothing matrix in the transformation of the original space into the feature space. This semantic smoothing mechanism maps the similar documents to close positions in the feature space of SVM if they are written using semantically nearby sets of terms on the same subject. The main novelty of our approach is the use of this meaning information in the both labeling of unlabeled data and smoothing process of the semantic kernel. The meanings of words are calculated based on the Helmholtz principle from Gestalt theory [16], [17], [18], [19] in the scope of both classes and documents as in [2]. HCSC directly incorporates class information to the semantic kernel for labeled instances. Additionally it also incorporates unlabeled instances in the training process. Therefore, it can be considered as a semi-supervised approach.

We performed a number of experiments on several document datasets with numerous different labeled/unlabeled/test splits of the corpus. According to our experimental results HCSC broadly outperforms the performance of baseline algorithms.

The first gain of our proposed solution is the classification capability of HCSC. To show the performance difference and robustness of HCSC we perform some experiments on various textual datasets with different labeled/unlabeled portions of the dataset. Experimental results demonstrate that HCSC exceeds the performance of baselines including the semi-supervised form of linear kernel. In linear kernel the inner product between two document vectors is used as kernel function, which exploits information about shared term in these two documents. This approach can be categorized as first-order method since its scope comprises of just a single document. Though, HCSC can take advantage of meaning values of terms in the context of classes. In this way semantic relation between two terms is composed of corresponding class-based and instance-based meaning scores of these words for all classes. Consequently if these two words are significant words in the same class then the resulting semantic relatedness value between them will be greater.

The second benefit of the offered approach is its relatively low complexity since HCSC does not need the processing of a large external knowledge base like WordNet or Wikipedia. Besides, as HCSC is built on corpus based statistics it is always up to date. As a result it can be applied to any field without adjustments or parameter optimizations.

The other improvement of HCSC is that it also forms a foundation that can easily be combined with other term based similarity measures. It is also easy to take advantage of similarities between terms derived from a semantic resource such as Wikipedia or WordNet.

According to the experimental results, there is an important improvement in the classification accuracy when class-dependent meaning values of terms are used in SVM, compared to the performance of the baseline kernels. To the best of our knowledge, class-dependent meaning values of words are used to assign labels to unlabeled instances and are utilized in the transformation phase of SVM for the first time in the bibliography and also give important insights on the semantic smoothing of words for text classification.

The rest of the paper is organized as follows: First, a brief introduction to supervised classification algorithms, mainly SVM, and semi-supervised algorithms in general for text classification, methods for building class-term matrix based VSMs including meaningfulness calculations in the class context and class-based term weighting are given in Section 2. Section 3 explains and analyzes the proposed kernel for text classification algorithm. Experimental setup, the corresponding experiment results with some discussion points are presented in Section 4 and Section 5. Finally, we conclude the article in Section 6 and provide a discussion on probable future extension points of the current work.

Section snippets

Support vector machines for classification

SVM is an effective machine learning algorithm stem from Statistical Learning Theory. SVM is based on the structural risk minimization principle from computational learning theory [22], [23], [24], whose basic goal is try to find optimal separating hyperplane with the maximal margin between two classes which guarantees the lowest error for an unseen test instance. If the problem is not linearly separable, the SVM algorithm may be altered by adding a softening variable that allows some samples

Hybrid class semantics classifier (HCSC)

Linear kernel is commonly used as the kernel of choice in SVM for text classification. This is partly due to the fact that BOW representation of documents is quite high dimensional, usually consisting of thousands of features. The high dimensionality of the text makes the other kernels, which maps the input space into a higher dimensional space with the hope of finding a better separating hyperplane, generally useless. As the simplest kernel that can be used in SVM, the linear kernel is

Datasets

In order to evaluate HCSC on text classification, we used several textual datasets. IMDB¹ is our first dataset which is a collection of movie reviews and contains 2000 reviews about several movies in IMDB. It has two types of labels; positive and negative. These labels are balanced in both training and test sets in our experiments. Our other four datasets are variations of popular 20 Newsgroup² dataset. 20 Newsgroup dataset is a

Experimental results and discussion

HCSC outperforms all of the baseline kernels we used (i.e., linear kernel, SSL-Linear, CMK and CWK) clearly in most of the labeled/unlabeled/test splits on SCIENCE except labeled set percentage 30% which can be observed from Table 3. According to Table 3 at labeled set percentages: 1%, 2%, 3%, 5%,7%, 10% and 15% the accuracies of HCSC are 56.6%, 65.97%, 67.3%, 85.93%, 89.33%, 90.93% and 92.28% while the accuracies of SSL-Linear are 50.58%, 57.23%, 65.28%, 68.1%,72.65%, 75.22% and 80.8%;

Conclusions and future work

Vector Space Models (VSM) are commonly used language processing to represent some aspects of natural language semantics. In this model, documents are simply represented as points in a space and proximity of two points is proportional to their semantic similarity [5]. Semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings [1]. From this point of view, the hybrid semi-supervised text classification algorithm

Acknowledgment

This work is supported in part by The Scientific and Technological Research Council of Turkey (TÜBİTAK) grant number 111E239. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the TÜBİTAK.

References (76)

B. Altınel et al.
A corpus-based semantic kernel for text classification by using meaning values of terms
Eng. Appl. Artif. Intell.
(2015)
B. Altınel et al.
A novel semantic smoothing kernel for text classification with class-based weighting
Knowl. Based Syst.
(2015)
A. Kontostathis et al.
A Framework for understanding lsi performance
J. Inf. Process. Manag.
(2006)
Z.S. Harris
Distributional structure
Word
(1954)
M.C. Ganiz et al.
A novel classifier based on meaning for text classification
P.D. Turney et al.
From frequency to meaning: vector space models of semantics
J. Artif. Intell. Res.
(2010)
X.J. Zhu
Semi-supervised learning literature survey
(2005)
A. Blum et al.
Semi-supervised learning literature survey
D. Yarowsky
Unsupervised word sense disambiguation rivaling supervised methods
C. Rosenberg
Semi-supervised self-training of object detection models

D. Zhou

Semi-supervised learning on directed graphs

Adv. Neural Inf. Process. Syst.

(2004)

K. Nigam

Text classification from labeled and unlabeled documents using EM

Mach. Learn.

(2000)

O. Chapelle et al.

Semi-supervised classification by low density separation

G. Salton et al.

On the specification of term values in automatic indexing

J. Doc.

(1973)

P. Wang et al.

Building semantic kernels for text classification using wikipedia

M. Steinbach et al.

A comparison of document clustering techniques

A. Balinsky et al.

On the Helmholtz principle for documents processing

A. Balinsky et al.

On the Helmholtz principle for data mining

A. Balinsky et al.

Rapid change detection and text mining

H. Balinsky et al.

Document sentences as a small world

G. Biricik et al.

A new method for attribute extraction with application on text classification

G. Biricik et al.

Abstract feature extraction for text classification

Turkish J. Elect. Eng. Comput. Sci.

(2012)

B.E. Boser et al.

A training algorithm for optimal margin classifier

V.N. Vapnik

The Nature 0f Statistical Learning Theory

(1995)

N. Cristianini et al.

An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

(2000)

E. Alpaydın

Introduction to machine learning

(2004)

G. Siolas et al.

Support vector machines based on a semantic kernel for text categorization

S. Goldman et al.

Enhancing supervised learning with unlabeled data

Y. Jin et al.

A Semi-Supervised learning algorithm based on modified self-training SVM

J. Comput.

(2011)

N. Grira et al.

Unsupervised and semi-supervised clustering: a brief survey

K. Nigam et al.

Analyzing the effectiveness and applicability of co-training

I. Muslea et al.

Active semi-supervised learning in robust multi-view learning

A. Liu et al.

A self-training approach to cost sensitive uncertainty sampling

Mach. Learn.

(2009)

B. Wang et al.

Semi-supervised self-training for sentence subjectivity classification

M. Li et al.

SETRED: self-training with editing

Advances in Knowledge Discovery and Data Mining

(2005)

Y. Guo et al.

Instance selection in semi-supervised learning

K. Li et al.

A novel semi-supervised SVM based on tri-training

Cited by (36)

A recent overview of the state-of-the-art elements of text classification
2018, Expert Systems with Applications
Citation Excerpt :
It uses both labelled and unlabelled data to perform a supervised or an unsupervised learning task. This learning is also known as self-training, co-training, learning from the labelled and unlabelled data, or transductive learning (Altınel & Ganiz, 2016; Altınel, Ganiz, & Diri, 2017; Blum & Mitchell, 1998; Muslea, 2010; Rossi, de Andrade Lopes, & Rezende, 2017; Zhu, 2010). Web page classification is widely discussed in the text classification field as a conventional example of this learning method.
The aim of this study is to provide an overview the state-of-the-art elements of text classification. For this purpose, we first select and investigate the primary and recent studies and objectives in this field. Next, we examine the state-of-the-art elements of text classification. In the following steps, we qualitatively and quantitatively analyse the related works. Herein, we describe six baseline elements of text classification including data collection, data analysis for labelling, feature construction and weighing, feature selection and projection, training of a classification model, and solution evaluation. This study will help readers acquire the necessary information about these elements and their associated techniques. Thus, we believe that this study will assist other researchers and professionals to propose new studies in the field of text classification.
Revisiting transductive support vector machines with margin distribution embedding
2018, Knowledge-Based Systems
Citation Excerpt :
Recently, Loog [50] proposed a contrastive pessimistic likelihood estimation for semi-supervised classification. SSL algorithms have successfully applied into text classification [51] and sentiment analysis [52,53]. SSL propagates their limited label information to unlabeled examples and it mainly follows the clustering or manifold assumption.
Transductive Support Vector Machine (TSVM) is one of the most successful classification methods for semi-supervised learning (SSL). One challenge of TSVMs is that the performance degeneration is caused by unlabeled examples that are obscure or misleading for the discovery of the underlying distribution. To address this problem, we disclose the underlying data distribution and describe the margin distribution of TSVMs as the first-order (margin mean) and second-order (margin variance) statistics of examples. Since the optimization problems of TSVMs are not convex, we utilize the concave-convex procedure and variation of stochastic variance reduced gradient methods to solve them. Particularly, we propose two specific algorithms to optimize the margin distribution of TSVM via maximizing the margin mean and minimizing the margin variance simultaneously, which the generalization ability is improved and being robust to the outliers and noise. In addition, we derive a bound on the expectation of error according to the leave-one-out cross-validation estimate, which is an unbiased estimate of the probability of test error. Finally, to validate the effectiveness of the proposed method, extensive experiments are conducted on diversity datasets. The experimental results demonstrate that the performance of proposed algorithms are superior to the existing TSVMs and other semi-supervised learning methods.
Identifying topical influencers on twitter based on user behavior and network topology
2018, Knowledge-Based Systems
Citation Excerpt :
This attracts attention of marketers, and businesses as a media to reach many people with no cost. Mostly researched areas of social media mining is influence analysis [1–5], and sentiment analysis [6–9]. Both areas can leverage marketing activities of companies.
Social media web sites have become major media platforms to share personal information, news, photos, videos and more. Users can even share live streams whenever they want to reach out to many other. This prevalent usage of social media attracted companies, data scientists, and researchers who are trying to infer meaningful information from this vast amount of data. Information diffusion and maximizing the spread of words is one of the most important focus for researchers working on social media. This information can serve many purposes such as; user or content recommendation, viral marketing, and user modeling. In this research, finding topical influential/authority users on Twitter is addressed. Since Twitter is a good platform to spread knowledge as a word of mouth approach and it has many more public profiles than protected ones, it is a target media for marketers. In this paper, we introduce a novel methodology, called Personalized PageRank, that integrates both the information obtained from network topology and the information obtained from user actions and activities in Twitter. The proposed approach aims to determine the topical influencers who are experts on a specific topic. Experimental results on a large dataset consisting of Turkish tweets show that using user specific features like topical focus rate, activeness, authenticity and speed of getting reaction on specific topics positively affects identifying influencers and lead to higher information diffusion. Algorithms are implemented on a distributed computing environment which makes high-cost graph processing more efficient.
A review of semi-supervised learning for text classification
2023, Artificial Intelligence Review
Semi-supervised learning models for document classification: A systematic review and meta-analysis
2023, Inteligencia Artificial
A Study on Diverse Methods and Performance Measures in Sentiment Analysis
2022, Recent Patents on Engineering

View all citing articles on Scopus

View full text

A new hybrid semi-supervised algorithm for text classification with class-based semantics

Abstract

Introduction

Section snippets

Support vector machines for classification

Hybrid class semantics classifier (HCSC)

Datasets

Experimental results and discussion

Conclusions and future work

Acknowledgment

Eng. Appl. Artif. Intell.

Knowl. Based Syst.

J. Inf. Process. Manag.

Distributional structure

Word

A novel classifier based on meaning for text classification

From frequency to meaning: vector space models of semantics

J. Artif. Intell. Res.

Semi-supervised learning literature survey

Semi-supervised learning literature survey

Unsupervised word sense disambiguation rivaling supervised methods

Semi-supervised self-training of object detection models

Semi-supervised learning on directed graphs

Adv. Neural Inf. Process. Syst.

Text classification from labeled and unlabeled documents using EM

Mach. Learn.

Semi-supervised classification by low density separation

On the specification of term values in automatic indexing

J. Doc.

Building semantic kernels for text classification using wikipedia

A comparison of document clustering techniques

On the Helmholtz principle for documents processing

On the Helmholtz principle for data mining

Rapid change detection and text mining

Document sentences as a small world

A new method for attribute extraction with application on text classification

Abstract feature extraction for text classification

Turkish J. Elect. Eng. Comput. Sci.

A training algorithm for optimal margin classifier

The Nature 0f Statistical Learning Theory

An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

Introduction to machine learning

Support vector machines based on a semantic kernel for text categorization

Enhancing supervised learning with unlabeled data

A Semi-Supervised learning algorithm based on modified self-training SVM

J. Comput.

Unsupervised and semi-supervised clustering: a brief survey

Analyzing the effectiveness and applicability of co-training

Active semi-supervised learning in robust multi-view learning

A self-training approach to cost sensitive uncertainty sampling

Mach. Learn.

Semi-supervised self-training for sentence subjectivity classification

SETRED: self-training with editing

Advances in Knowledge Discovery and Data Mining

Instance selection in semi-supervised learning

A novel semi-supervised SVM based on tri-training