Elsevier

Knowledge-Based Systems

Volume 108, 15 September 2016, Pages 50-64
Knowledge-Based Systems

A new hybrid semi-supervised algorithm for text classification with class-based semantics

https://doi.org/10.1016/j.knosys.2016.06.021Get rights and content

Abstract

Vector Space Models (VSM) are commonly used in language processing to represent certain aspects of natural language semantics. Semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings. In our previous work, we proposed novel semantic smoothing kernels based on classspecific transformations. These kernels use class-term matrices, which can be considered as a new type of VSM. By using the class as the context, these methods can extract class specific semantics by making use of word distributions both in documents and in different classes. In this study, we adapt two of these semantic classification approaches to build a novel and high performance semi-supervised text classification algorithm. These approaches include Helmholtz principle based calculation of term meanings in the context of classes for initial classification and a supervised term weighting based semantic kernel with Support Vector Machines (SVM) for the final classification model. The approach used in the first phase is especially good at learning with very small datasets, while the approach in the second phase is specifically good at eliminating noise in a relatively large and noisy training sets when building a classification model. Overall, as a semantic semi-supervised learning algorithm, our approach can effectively utilize abundant source of unlabeled instances to improve the classification accuracy significantly especially when the amount of labeled instances are limited.

Introduction

Vector Space Models (VSM) are commonly used language processing to represent some aspects of natural language semantics. In this model, documents are simply represented as points in vector space and closeness of two points is proportional to their semantic similarity. There are several categories of VSM including word-context, pair-pattern and term-document matrices [5]. As pointed out in [5], semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings [1]. One of our main motivations is to use a class of documents as the context. In our previous work, we proposed novel semantic smoothing kernels based on class specific transformations [2,3]. Since these kernels use class labels explicitly, they are strictly supervised. Furthermore, they can be considered as a new type of VSM consist of term-class matrices specific to the class labeled documents. By using the class as the context, these matrices can extract class specific semantics by making use of word distributions both in documents and in different classes. These semantic kernels can be integrated into supervised classifiers i.e. SVM for text classification and they could be able to outperform baseline classifiers using document based traditional VSM. In this study, by combining one of these supervised semantic kernel based classification algorithm [3] with class meanings based methodology to classify unlabeled documents which is inspired from [4], we propose a novel and high performance semi-supervised text classification algorithm.

In machine learning applications there are two conventional strategies; supervised learning and unsupervised learning. Traditional supervised learning algorithms need a set of sufficient labeled data as training set to build the classification model, which will be used to predict the class memberships of the unlabeled examples. On the other hand, unsupervised learning, solely based on unlabeled samples, doesn't need any labeled data to learn a model. So as to train a classifier, it attempts to discover the indirect structure of unlabeled data [6]. There have been massive amounts of accumulated data on the web, particularly on blogs, forums, social networks and continue to increase day by day without any doubt. But unfortunately most of the available data does not have pre-assigned labels which limit their use in several practical machine learning application fields such as text classification, sentiment recognition, speech recognition. Moreover, generally it can be time-consuming, tedious and expensive to assign labels to them manually. Most importantly, learning a classifier with only a few labeled training data may not generate sufficient performance. In situations where labeled data is inadequate, many algorithms have suggested exploiting and utilizing the unlabeled data to support to learning process for better classification. SSL approaches utilize not only labeled data but also unlabeled data to increase the classification accuracy. Recently, SSL has become popular and gained increased attention of both academic and commercial platforms as a new machine learning strategy. SSL is different from two ordinary classification approaches by the usage of unlabeled data to mitigate the effect of insufficient labeled data on classifier accuracy. Many SSL systems have been offered in the past years, like co-training [7], self-training [8,9] graph-based methods [10], semi-supervised support vector machines [6], EM with generative mixture models [11], transductive support vector machines [12].

Classification of texts requires special techniques to transform unstructured text into structured format required for classification algorithms; usually a vector of features. In this domain, documents are usually symbolized by terms and their corresponding frequencies. This kind of representation methodology is actually the most popular one in the literature and it is named as Bag of Words (BOW) feature demonstration. BOW is known as the simplest feature representation method where each term comprises a dimension in a vector space, being independent of other terms in the same document [13]. A bag is mathematically similar to set with the difference that there can be duplicate values. As in sets, the order of words is lost in bag representation. Similarly, a bag can be represented as a vector and a group of bags also can be represented as a matrix where the rows are documents and columns are term frequencies. This also called Vector Space Model (VSM). Although, the VSM and BOW approach is very simple and commonly used, they have several limitations as discussed in [2]. One of the restrictions is the assumption of independency between terms. Documents are represented only with their term frequencies, disregarding their position in the document or their semantic or syntactic links between other words. This is a big problem since it clearly ignores the multi-word expressions by separating them. Moreover; it cannot handle polysemous words (i.e. words with multiple meanings) since it treats them as a single entity. Furthermore, it maps synonymous words into different components; as discussed in [14]. In principle, as Steinbach et al. [15] mention, each class has two kinds of vocabulary: one is “core” vocabulary which are closely correlated to the theme of that class, the other type is “general” vocabulary those may have similar distributions on different classes. Consequently, two documents from different classes may commonly have many general words and can be categorized as similar in the BOW representation.

In our recent study in [2], we offer a novel method for constructing a supervised semantic smoothing kernel for SVM, which we name Class Meaning Kernel (CMK). The proposed method smoothens the words of a document in BOW demonstration by class-based meaning values of terms. The meaning scores of words are calculated based on the Helmholtz principle from Gestalt theory [16], [17], [18], [19] in the scope of classes. According to our experimental results, CMK is superior to the traditional kernels such as linear kernel, polynomial kernel and RBF kernel. In one of our previous studies, we suggest a new classifier for textual data, named as Supervised Meaning Classifier (SMC) [4]. The SMC classifier uses meaning calculations, which is based on Helmholtz principle from Gestalt Theory. In SMC, meaningfulness of terms in the scope of classes are calculated and used for classification of a document. According to the experiment results this new SMC classifier outperforms Multinomial Naïve Bayes (MNB) and SVM specifically on insufficient training data.

In another recent study of ours [3], we offer a novel approach for constructing a supervised semantic kernel for SVM, which we name Class Weighting Kernel (CWK). The proposed method smoothens a document in BOW representation by using class-based weights of words. The weights of terms are calculated based on a term weighting approach that is designed as a part of a feature extraction algorithm is presented in [20,21]. According to our experimental results the classification performance of CWK is higher than the classification performance of the other commonly used kernels (i.e. linear kernel, polynomial kernel and RBF kernel).

Inspired by the advantages of CMK, CWK and SMC, and motivated by the fact that there are insufficient labeled data in real life practical applications, we build a new hybrid semi-supervised form of CWK and SMC, which is called Hybrid Class Semantics Classifier (HCSC). More precisely, in this article we suggest a novel non-iterative semi-supervised methodology that uses class-based meaning values and weights of terms. The suggested approach utilizes both labeled and unlabeled data in order to build a classifier. First it smoothens the terms of the labeled documents in BOW demonstration (document vector represented by term frequencies) by class-based meanings of words as the same way it is done in CMK [2] and SMC [4]. Then, it attempts to find suitable labels for unlabeled instances. It succeeded this labeling process by weighting the terms of the unlabeled documents in BOW representation with the help of meaning calculations [4]. Following this, HCSC combines the original labeled data with newly classified unlabeled data. Finally, CWK is applied on this enlarged labeled dataset in order to predict the labels of the samples in the test dataset. The smoothing process in HCSC increases the significance of important (i.e. meaningful) words specific to a particular class while reducing the importance of general terms those have a similar distribution in all classes. Since this method is used in the transformation phase of a kernel function from input space into a feature space, it considerably decreases the effects of above mentioned disadvantages of BOW. We note that HCSC advances the accuracy of SVM compare to the linear kernel by giving more significance to the class specific concepts, those may possibly be synonymous or very closely associated in the scope of a class. The HCSC uses a semantic smoothing matrix in the transformation of the original space into the feature space. This semantic smoothing mechanism maps the similar documents to close positions in the feature space of SVM if they are written using semantically nearby sets of terms on the same subject. The main novelty of our approach is the use of this meaning information in the both labeling of unlabeled data and smoothing process of the semantic kernel. The meanings of words are calculated based on the Helmholtz principle from Gestalt theory [16], [17], [18], [19] in the scope of both classes and documents as in [2]. HCSC directly incorporates class information to the semantic kernel for labeled instances. Additionally it also incorporates unlabeled instances in the training process. Therefore, it can be considered as a semi-supervised approach.

We performed a number of experiments on several document datasets with numerous different labeled/unlabeled/test splits of the corpus. According to our experimental results HCSC broadly outperforms the performance of baseline algorithms.

The first gain of our proposed solution is the classification capability of HCSC. To show the performance difference and robustness of HCSC we perform some experiments on various textual datasets with different labeled/unlabeled portions of the dataset. Experimental results demonstrate that HCSC exceeds the performance of baselines including the semi-supervised form of linear kernel. In linear kernel the inner product between two document vectors is used as kernel function, which exploits information about shared term in these two documents. This approach can be categorized as first-order method since its scope comprises of just a single document. Though, HCSC can take advantage of meaning values of terms in the context of classes. In this way semantic relation between two terms is composed of corresponding class-based and instance-based meaning scores of these words for all classes. Consequently if these two words are significant words in the same class then the resulting semantic relatedness value between them will be greater.

The second benefit of the offered approach is its relatively low complexity since HCSC does not need the processing of a large external knowledge base like WordNet or Wikipedia. Besides, as HCSC is built on corpus based statistics it is always up to date. As a result it can be applied to any field without adjustments or parameter optimizations.

The other improvement of HCSC is that it also forms a foundation that can easily be combined with other term based similarity measures. It is also easy to take advantage of similarities between terms derived from a semantic resource such as Wikipedia or WordNet.

According to the experimental results, there is an important improvement in the classification accuracy when class-dependent meaning values of terms are used in SVM, compared to the performance of the baseline kernels. To the best of our knowledge, class-dependent meaning values of words are used to assign labels to unlabeled instances and are utilized in the transformation phase of SVM for the first time in the bibliography and also give important insights on the semantic smoothing of words for text classification.

The rest of the paper is organized as follows: First, a brief introduction to supervised classification algorithms, mainly SVM, and semi-supervised algorithms in general for text classification, methods for building class-term matrix based VSMs including meaningfulness calculations in the class context and class-based term weighting are given in Section 2. Section 3 explains and analyzes the proposed kernel for text classification algorithm. Experimental setup, the corresponding experiment results with some discussion points are presented in Section 4 and Section 5. Finally, we conclude the article in Section 6 and provide a discussion on probable future extension points of the current work.

Section snippets

Support vector machines for classification

SVM is an effective machine learning algorithm stem from Statistical Learning Theory. SVM is based on the structural risk minimization principle from computational learning theory [22], [23], [24], whose basic goal is try to find optimal separating hyperplane with the maximal margin between two classes which guarantees the lowest error for an unseen test instance. If the problem is not linearly separable, the SVM algorithm may be altered by adding a softening variable that allows some samples

Hybrid class semantics classifier (HCSC)

Linear kernel is commonly used as the kernel of choice in SVM for text classification. This is partly due to the fact that BOW representation of documents is quite high dimensional, usually consisting of thousands of features. The high dimensionality of the text makes the other kernels, which maps the input space into a higher dimensional space with the hope of finding a better separating hyperplane, generally useless. As the simplest kernel that can be used in SVM, the linear kernel is

Datasets

In order to evaluate HCSC on text classification, we used several textual datasets. IMDB1 is our first dataset which is a collection of movie reviews and contains 2000 reviews about several movies in IMDB. It has two types of labels; positive and negative. These labels are balanced in both training and test sets in our experiments. Our other four datasets are variations of popular 20 Newsgroup2 dataset. 20 Newsgroup dataset is a

Experimental results and discussion

HCSC outperforms all of the baseline kernels we used (i.e., linear kernel, SSL-Linear, CMK and CWK) clearly in most of the labeled/unlabeled/test splits on SCIENCE except labeled set percentage 30% which can be observed from Table 3. According to Table 3 at labeled set percentages: 1%, 2%, 3%, 5%,7%, 10% and 15% the accuracies of HCSC are 56.6%, 65.97%, 67.3%, 85.93%, 89.33%, 90.93% and 92.28% while the accuracies of SSL-Linear are 50.58%, 57.23%, 65.28%, 68.1%,72.65%, 75.22% and 80.8%;

Conclusions and future work

Vector Space Models (VSM) are commonly used language processing to represent some aspects of natural language semantics. In this model, documents are simply represented as points in a space and proximity of two points is proportional to their semantic similarity [5]. Semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings [1]. From this point of view, the hybrid semi-supervised text classification algorithm

Acknowledgment

This work is supported in part by The Scientific and Technological Research Council of Turkey (TÜBİTAK) grant number 111E239. Points of view in this document are those of the authors and do not necessarily represent the official position or policies of the TÜBİTAK.

References (76)

  • D. Zhou

    Semi-supervised learning on directed graphs

    Adv. Neural Inf. Process. Syst.

    (2004)
  • K. Nigam

    Text classification from labeled and unlabeled documents using EM

    Mach. Learn.

    (2000)
  • O. Chapelle et al.

    Semi-supervised classification by low density separation

  • G. Salton et al.

    On the specification of term values in automatic indexing

    J. Doc.

    (1973)
  • P. Wang et al.

    Building semantic kernels for text classification using wikipedia

  • M. Steinbach et al.

    A comparison of document clustering techniques

  • A. Balinsky et al.

    On the Helmholtz principle for documents processing

  • A. Balinsky et al.

    On the Helmholtz principle for data mining

  • A. Balinsky et al.

    Rapid change detection and text mining

  • H. Balinsky et al.

    Document sentences as a small world

  • G. Biricik et al.

    A new method for attribute extraction with application on text classification

  • G. Biricik et al.

    Abstract feature extraction for text classification

    Turkish J. Elect. Eng. Comput. Sci.

    (2012)
  • B.E. Boser et al.

    A training algorithm for optimal margin classifier

  • V.N. Vapnik

    The Nature 0f Statistical Learning Theory

    (1995)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

    (2000)
  • E. Alpaydın

    Introduction to machine learning

    (2004)
  • G. Siolas et al.

    Support vector machines based on a semantic kernel for text categorization

  • S. Goldman et al.

    Enhancing supervised learning with unlabeled data

  • Y. Jin et al.

    A Semi-Supervised learning algorithm based on modified self-training SVM

    J. Comput.

    (2011)
  • N. Grira et al.

    Unsupervised and semi-supervised clustering: a brief survey

  • K. Nigam et al.

    Analyzing the effectiveness and applicability of co-training

  • I. Muslea et al.

    Active semi-supervised learning in robust multi-view learning

  • A. Liu et al.

    A self-training approach to cost sensitive uncertainty sampling

    Mach. Learn.

    (2009)
  • B. Wang et al.

    Semi-supervised self-training for sentence subjectivity classification

  • M. Li et al.

    SETRED: self-training with editing

    Advances in Knowledge Discovery and Data Mining

    (2005)
  • Y. Guo et al.

    Instance selection in semi-supervised learning

  • K. Li et al.

    A novel semi-supervised SVM based on tri-training

  • Cited by (36)

    • A recent overview of the state-of-the-art elements of text classification

      2018, Expert Systems with Applications
      Citation Excerpt :

      It uses both labelled and unlabelled data to perform a supervised or an unsupervised learning task. This learning is also known as self-training, co-training, learning from the labelled and unlabelled data, or transductive learning (Altınel & Ganiz, 2016; Altınel, Ganiz, & Diri, 2017; Blum & Mitchell, 1998; Muslea, 2010; Rossi, de Andrade Lopes, & Rezende, 2017; Zhu, 2010). Web page classification is widely discussed in the text classification field as a conventional example of this learning method.

    • Revisiting transductive support vector machines with margin distribution embedding

      2018, Knowledge-Based Systems
      Citation Excerpt :

      Recently, Loog [50] proposed a contrastive pessimistic likelihood estimation for semi-supervised classification. SSL algorithms have successfully applied into text classification [51] and sentiment analysis [52,53]. SSL propagates their limited label information to unlabeled examples and it mainly follows the clustering or manifold assumption.

    • Identifying topical influencers on twitter based on user behavior and network topology

      2018, Knowledge-Based Systems
      Citation Excerpt :

      This attracts attention of marketers, and businesses as a media to reach many people with no cost. Mostly researched areas of social media mining is influence analysis [1–5], and sentiment analysis [6–9]. Both areas can leverage marketing activities of companies.

    View all citing articles on Scopus
    View full text