Ensemble of keyword extraction methods and classifiers in text classification
Introduction
Automatic keyword extraction is the process of identifying key terms, key phrases, key segments or keywords from a document that can appropriately represent the subject of the document (Beliga, Mestrovic, & Martincic-Ipsic, 2015). The Web is a very rich source of information which is progressively expanding. Hence, the number of digital documents available has been progressively expanding and the manual keyword extraction can be an infeasible task. Keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Since keyword extraction provides a compact representation of the document, many applications, such as automatic indexing, automatic summarization, automatic classification, automatic clustering, and automatic filtering can benefit from the keyword extraction process (Zhang et al., 2008).
Automatic keyword generation process can be broadly divided into two categories as keyword assignment and keyword extraction (Siddiqi & Sharan, 2015). In keyword assignment, a set of possible keywords is selected from a controlled vocabulary of words, whereas keyword extraction identifies the most relevant words available in the examined document (Beliga et al., 2015). Keyword extraction methods can be broadly grouped into four categories as statistical approaches, linguistic approaches, machine learning approaches and other approaches (Han & Kamber, 2006).
Text classification is an important subfield of text mining which assigns a text document into one or more predefined classes or categories. Several forms of text collections, such as news articles, digital libraries and Web pages are important sources of information (Han & Kamber, 2006). Hence, text classification is an important research direction in library science, information science and computer science (Jain, Raghuvanshi, & Shrivastava, 2012). Many applications of text mining can be modelled as a text classification problem. These applications include news filtering, organization, document organization, retrieval, opinion mining (sentiment analysis), and spam filtering (Aggarwal & Zhai, 2012).
High dimensional feature space is a typical challenge of text classification applications (Joachims, 2002). When all the words of the training documents are used as the features, text classification process becomes computationally intensive task (Onan & Korukoğlu, 2015). Hence, keywords of a text collection, which are the most important/relevant words about the content of the documents, can be good candidates to select as features in classification model construction (Liu and Wang, 2007, Rossi et al., 2014). Machine learning algorithms, such as Naïve Bayes, k-nearest neighbour algorithm, support vector machines and artificial neural networks, have been successfully applied in classifying text documents (Sebastiani, 2002). Ensemble methods are a set of learning algorithms, which combine the decisions of these algorithms so that a more robust classification model can be built with higher predictive performance (Dietterich, 2000).
Considering these issues, this paper examines the effectiveness of statistical keyword extraction methods, base learning algorithms and ensemble methods in scientific text document classification. To the best of our knowledge, this is the first attempt, which empirically evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. In comparative evaluation, five popular ensemble methods (Boosting, Bagging, Dagging, Random Subspace and Voting) are utilized. Naïve Bayes algorithm, support vector machines, logistic regression and Random Forest algorithm are utilized as the base learning algorithms. In the experimental analysis, the domain independent statistical keyword extraction framework proposed in (Rossi et al., 2014) is utilized. In summary, the experimental study aims to answer the following research questions:
(1) Which configuration of statistical keyword extraction, classification and ensemble learning algorithms yield the highest performance in scientific text document classification?
(2) Is there an optimal number of keywords to represent the text documents and which number of keywords obtains promising results?
To the best of our knowledge, this is the first extensive empirical analysis which examines the predictive performance of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The presented classification scheme, which integrates Bagging ensemble of Random Forest with the most-frequent based keyword extraction method, yields very promising results on scientific text classification. The rest of this paper is organized as follows. Section 2 briefly reviews the literature on keyword extraction and ensemble methods. Section 3 presents the statistical keyword extraction methods utilized in the experimental evaluations. Section 4 briefly describes the classification algorithms and Section 5 describes the ensemble learning methods. Section 6 presents the experimental results, discussion and statistical analysis of empirical results on ACM document collection. Section 7 presents the results of ensemble classification schemes on a larger text document collection. Finally, Section 8 presents the concluding remarks.
Section snippets
Literature review
This section briefly reviews the literature on keyword extraction methods and the ensemble methods.
Keyword extraction methods
Keyword extraction methods can be broadly divided into two categories as domain-dependent and domain-independent keyword extraction methods. Domain-dependent keyword extraction methods require to keep track of all the words within the text collection, whereas the domain-independent keyword extraction methods do not require the analysis of the entire text collections (Rossi et al., 2014). Domain-independent keyword extraction methods can have comparable high performance and do not require using
Classification algorithms
Machine learning algorithms have been successfully utilized in text classification. Machine learning classifiers can be broadly classified as decision trees (such as C4.5, ID3 and Random Forest), rule-based methods (such as RIPPER, PART and genetic algorithms), perceptron-based methods (such as artificial neural networks, radial basis function networks), statistical learning methods (such as Bayesian Networks and Naïve Bayes classifier), instance-based classifiers (such as k-nearest neighbour
Ensemble methods
Ensemble methods are popular research directions in machine learning and pattern recognition (Onan, 2016, Ranawana and Palade, 2006). Ensemble methods aim to combine decisions from a set of weak learning algorithms (base learners) so that the accuracy and robustness of the built classification model can be enhanced. The generalization ability of ensemble methods is better compared to the single base learners. There are statistical, computational and representational reasons to build multiple
ACM document collection
To make a comprehensive experimental evaluation about the performance of statistical keyword extraction methods on the document collections in scientific text classification (categorization), eight collections of the ACM Digital Library are used. In the empirical analysis, the statistical keyword extraction framework presented in Rossi et al. (2014) is adopted. All of the eight datasets have documents in five classes. In Table 1, the basic descriptive information (the number of classes, the
Experimental results on Reuters-21578 document collection
To better understand the performance of the ensemble learning methods in keyword-based text classification, we divide the experimental analysis into two sections. In the first section (Section 6), the predictive performance of five statistical keyword extraction methods, classification algorithms and ensemble learning methods are extensively analysed on ACM document collection. Text classification is characterized by high dimensionality of the feature space. In the second section (Section 7),
Conclusion
This paper presents an empirical analysis for five statistical keyword extraction methods (the most frequent measure based keyword extraction, the term frequency-inverse sentence frequency based keyword extraction, the co-occurrence statistical information based keyword extraction, the eccentricity-based keyword extraction and the TextRank algorithm) in conjunction with classification algorithms and ensemble learning methods.
The main contributions of this study can be summarized as follows.
References (65)
- et al.
Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring
Expert Systems with Applications
(2014) - et al.
A novel SVM-kNN-PSO ensemble method for intrusion detection system
Applied Soft Computing
(2016) - et al.
On the use of ensemble of classifiers for accelerometer-based activity recognition
Applied Soft Computing
(2015) - et al.
Sentiment analysis: Bayesian ensemble learning
Decision Support Systems
(2014) - et al.
Classifier selection in ensemble using genetic algorithms for bankruptcy prediction
Expert Systems with Applications
(2012) - et al.
Sentiment analysis: A combined approach
Journal of Informetrics
(2009) - et al.
A novel ensemble of classifiers that use biological relevant gene sets for microarray classification
Applied Soft Computing
(2014) - et al.
Predicting stock returns by classifier ensemble
Applied Soft Computing
(2011) An improved global feature selection scheme for text classification
Expert Systems with Applications
(2016)- et al.
Sentiment classification: The contribution of ensemble learning
Decision Support Systems
(2014)
Ensemble of feature sets and classification algorithms for sentiment classification
Information Sciences
Classifying text streams by keywords using classifier ensemble
Data & Knowledge Engineering
A survey of text classification algorithms
A systematic comparison of supervised classifiers
PLoS One
UCI machine learning repository
An overview of graph-based keyword extraction methods and approaches
Journal of Information and Organizational Sciences
Bagging predictors
Machine Learning
Random forests
Machine Learning
Tweet sentiment analysis with classifier ensembles
Decision Support Systems
Ensemble methods in machine learning
Innovative document summarization techniques: Revolutionizing knowledge understanding
Extracting key terms from noisy and multi-theme documents
Automatic extraction of keyword from abstracts
Lecture Notes in Computer Science
Automatic extraction and learning of keyphrases from scientific articles
Lecture Notes in Computer Science
Data mining: Concepts and techniques
The random subspace method for constructing decision forests
IEEE Transactions on Pattern Analysis and Machine Intelligence
Keyphrase extraction using semantic network structure analysis
Improved automatic keyword extraction given more linguistic knowledge
Text classification using machine learning techniques
WSEAS Transactions on Computers
Analysis of query based text classification approach
International Journal of Advanced Research in Computer Science and Software Engineering
Text categorization with support vector machines: Learning with many relevant features
Learning to classify text using support vector machines
Cited by (531)
Innovative agricultural ontology construction using NLP methodologies and graph neural network
2024, Engineering Science and Technology, an International JournalSARS-CoV-2 virus variant detection and mortality prediction through symptom analysis using machine learning
2024, Engineering Applications of Artificial IntelligenceClimate bonds toward achieving net zero emissions and carbon neutrality: Evidence from machine learning technique
2024, Journal of Management Science and EngineeringA study on classifying Stack Overflow questions based on difficulty by utilizing contextual features
2024, Journal of Systems and SoftwareBangla text normalization for text-to-speech synthesizer using machine learning algorithms
2024, Journal of King Saud University - Computer and Information Sciences