Impressive progress has been made on the problem of text classification, but few studies have tackled sentence classification (Kozlowski and Rybinski
2019; Zelikovitz and Hirsh
2000; Cheng et al.
2014; Yin et al.
2017; Khoo et al.
2006; Kim
2014; Jurafsky and Martin
2019; Aggarwal
2018). Unlike the traditional text classification problem, sentence classification pose two main challenges. First, patterns of word co-occurrence are sparse in the feature space, where a sentence contains only several to a dozen words. Second, texts face the challenge of a large-scale and manual labeling task, where with sentences this task is more burdensome as they are very small samples causing to increase noise and reduce classification accuracy.
Several techniques have been proposed to tackle the challenges posed by sentence classification, including dimension reduction (Zelikovitz and Hirsh
2000; Sriram et al.
2010; Khoo et al.
2006; Bollegala et al.
2018), topic modeling (Cheng et al.
2014; Chen et al.
2011; Yang et al.
2015), clustering (Kozlowski and Rybinski
2019; Yin et al.
2017; Bollegala et al.
2018; Dai et al.
2013; Kozlowski and Rybinski
2017; Yang et al.
2019), and word embedding (Kozlowski and Rybinski
2019; Kim
2014; Lee and Dernoncourt
2016; Hill et al.
2016). Kim (Kim
2014) proposed a single layer of CNN applied for sentence classification. He concluded that despite little tuning of hyperparameters, unsupervised pre-training of word vectors is an important ingredient in deep learning for sentence classification. Zelikovitz and Hirsh (
2000) developed a method to reduce error rates in short text classification by using a combination of labeled training data plus a large body of “uncoordinated background knowledge” that is a secondary corpus of unlabeled but related longer documents. They used the WHIRL method (Cohen
1998) for text classification, an information integration tool designed to query and integrate varied sources of text from the Web. Sriram et al. (
2010) proposed an intuitive approach to classify the short texts in tweets by using author information and features of texts. Yin et al. (
2017) proposed a short text classification technique based on a combination of the K-nearest neighbors (KNN) and hierarchical SVM classification. They used KNN to initially group labels of the samples to create subclasses and then they applied a SVM algorithm as a hierarchical multi-class classification to each group to classify labels. Cheng et al. (
2014) proposed a biterm topic model to capture topics in short texts based on aggregated biterms in the entire corpus to tackle the sparsity of patterns of word co-occurrence in texts. They defined the biterm as an unordered word pair co-occurring in a short text. They considered the corpus as a mixture of topics, where each biterm is drawn independently from a specific topic. Yang et al. (
2015) proposed a topic model to extract key phrases for short text classification using the idea that knowledge incorporation can solve the problem of sparsity. Their approach extracts topics from texts by focusing on phrases in the generative process of documents. Bollegala et al. (
2018) developed ClassiNet, a network of binary classifiers trained to predict missing features from a given short text for text classification. ClassiNets solves the problem of feature sparseness by generalizing word co-occurrence graphs by considering implicit co-occurrences between features. Dai et al. (
2013) proposed the Crest to generate topic clusters from training data by exploiting a clustering method. Crest uses topic information to extend the representation of short texts and define a new feature space. It subsequently measures the cosine similarity between a document and clusters as augmented features of the document for classification. Lee and Dernoncourt (
2016) presented a model on the basis of recurrent and convolutional neural networks. Their model incorporates preceding short texts for sequential short text classification. This model comprises two parts. The first part generates a vector representation for each text and the second part classifies the vector representations of the current text as well as a few preceding short texts using a two-layer feed-forward neural network. Kozlowski and Rybinski (
2019,
2017) used a neural network-based distributional model for enriching the semantic meaning of short texts for clustering. They proposed the SnSRC clustering algorithm that uses the SnS method (Kozlowski and Rybinski
2017), a knowledge-poor text mining algorithm to sense induction, a language-independent approach. They trained their model using continuous bag of words and negative sampling, and computed cosine similarity between the mean vector of the embeddings for the text and the vectors for each word in the distributional model. The retrieved words with the highest semantic similarity were added as additional term features to the initial BOW text representation. In their study, especially in cases involving a specific domain language, the semantic enrichment of texts by applying neural networks improved the quality of clustering. Hill et al. (
2016) overcame feature sparseness in sentence representations by embedding them into a low-dimensional, dense space. They compared deep neural language models that compute sentence representations from unlabeled data with prevalent methods for word representation, and concluded that the unsupervised BOW models delivered the best performance in terms of sentence representation compared with supervised ones.
Current methods for sentence classification and short text classification either represent texts in a lower-dimensional space to reduce feature sparseness or add data to the text to enhance the quality of the feature space. The main outstanding challenge is the construction of external knowledge repositories, a labor-intensive task in applications of domain-specific clinical text mining. We propose an approach to tackle this challenge in clinical sentence classification that deploys an unsupervised scheme for enriching the original data set by internal knowledge acquisition, where the length of each document is considered by a dynamic weighting mechanism. The proposed approach uses the output of the unsupervised scheme as an internal source for enriching that does not employ any external dictionary.