Feature selection algorithms for the generation of multiple classifier systems and their application to handwritten word recognition

doi:10.1016/j.patrec.2004.05.002

Pattern Recognition Letters

Volume 25, Issue 11, August 2004, Pages 1323-1336

https://doi.org/10.1016/j.patrec.2004.05.002 Get rights and content

Abstract

The study of multiple classifier systems has become an area of intensive research in pattern recognition recently. Also in handwriting recognition, systems combining several classifiers have been investigated. In this paper new methods for the creation of classifier ensembles based on feature selection algorithms are introduced. Those new methods are evaluated and compared to existing approaches in the context of handwritten word recognition, using a hidden Markov model recognizer as basic classifier.

Introduction

The field of off-line handwriting recognition has been a topic of intensive research for many years. First only the recognition of isolated handwritten characters was investigated (Suen et al., 1992), but later whole words (Simon, 1992) were addressed. Most of the systems reported in the literature until today consider constrained recognition problems based on vocabularies from specific domains, e.g. postal addresses (Kaltenmeier et al., 1993; Kim et al., 1999) or amounts on bank checks (Impedovo et al., 1997). Free handwriting recognition, without domain specific constraints and large vocabularies, was addressed only recently in a few papers (Kim et al., 1999, Marti and Bunke, 2001). The recognition rate of such systems is still low, and there is a need to improve it.

The combination of multiple classifiers was shown to be suitable for improving the recognition performance in difficult classification problems (Kittler and Roli, 2000). Particularly in handwriting recognition, classifiers combination has often been applied. Examples are given in (Huang and Suen, 1993, Lee and Srihari, 1993, Lee and Gomes, 1997, Maruyama and Nakano, 2000, Xu et al., 1992).

Recently a number of classifier combination methods, called ensemble methods, have been proposed in the field of machine learning (Bauer and Kohavi, 1999, Opitz and Maclin, 1999). Given a single classifier, called the base classifier, a set of classifiers can be automatically generated by changing the training set (Breiman, 1996), the input features (Ho, 1998), the input data by injecting randomness, or the parameters and architecture of the classifier. A summary of such methods is given in (Dietterich, 2000).

In this paper we propose a new ensemble creation method that is based on the selection of particular subsets of features. The problem of feature selection has been a focus of research in pattern recognition for a long time (Jain and Zongker, 1997). In statistical pattern recognition (Jain et al., 2000) features are typically vectors of real numbers, whereas in structural pattern recognition (Fu, 1982) symbolic data structures, such as strings, trees, and graphs, are used. In this paper we follow the statistical approach.

In general, the choice of features is application dependent and often very difficult. A way to solve this problem is to start with a set of candidate features and either modify or reduce them. The modification of a set of features includes transformations of the feature space, for example, Principal Component Analysis (Jolliffe, 1986). There exist many ways to reduce a given set of features (Kittler et al., 2001), and they all need an objective function to validate the quality of the considered subset of features. Often a validation set is used on which the recognition rate is measured. With this objective function the selection of the best subset of features becomes a search problem. An optimal approach is to calculate the objective function for all possible subsets. But this approach is only feasible for a rather small number of features. Branch and Bound algorithms (Narendra and Fukunaga, 1977) use a priori knowledge to reduce the search space. However, often there is no such a priori knowledge available and a suboptimal approach, such as sequential forward/backward search (Devijver and Kittler, 1982), floating search (Pudil et al., 1994) or oscillating search (Somol and Pudil, 2000), is used. An ensemble method using feature selection was introduced in (Alkoot and Kittler, 2000). In this method, a classifier always selects the single feature that leads to the best recognition rate of the whole classifier ensemble. In contrast with the algorithm proposed in the current paper, a feature may only be used by one classifier and the selection of the features is not reversible, i.e. features that have been selected cannot be abandoned at a later time. In (Oza and Tumer, 2001) a method was presented which creates a classifier for each class and use the set of features which are the most able to discriminate this class from all other classes for this classifier.

The contribution of the present paper is twofold. First, to the knowledge of the authors, it is the first study that applies classifier ensemble creation methods using feature subsets in the field of handwriting recognition. Secondly, in the present paper we propose a new method for classifier ensemble creation based on feature subsets. The new algorithm is based on the idea of applying feature subset search algorithms that allow to add and remove individual features to find useful subsets of features on which individual classifiers are trained. Multiple search algorithms are incorporated to increase the diversity of the resulting classifiers. Four different objective functions for assessing the quality of a feature set are proposed. In a series of experiments the new classifier ensemble creation method is compared to single classifiers and multiple classifier systems generated by means of methods proposed before.

The rest of the paper is organized as follows. Section 2 provides a review of some classical ensemble methods. In Section 3 the basic idea underlying the new ensemble creation method is described. The following section introduces some refinements to the basic algorithm. In Section 5 a description of the feature selection algorithm used in the experiments is given. Section 6 contains information about the handwritten word recognizer, which is used as the base classifier in the experiments. Those experiments are then discussed in Section 7 and, finally, conclusions are drawn in Section 8.

Section snippets

Classical ensemble methods

In this section the classical ensemble methods used in the multiple classifier systems introduced in this paper are described. Those methods were adopted, some with minor modifications, from the literature. Each method takes a base classifier and a training set as input and returns a number of trained instances of the base classifier as a result. More information about ensemble methods may be found in (Bauer and Kohavi, 1999, Opitz and Maclin, 1999).

New ensemble creation method

One of the most successful ensemble creation algorithms is the random subspace method (Ho, 1998). As described in the last section, each individual classifier uses its own subset of the given features for both training and testing. The subsets contain elements randomly chosen out of the whole feature set, where the size of the feature subset is usually the same for each classifier in the ensemble. The main goal of the algorithm is not to select good subsets of features, but to create diverse

Enhancements to the proposed ensemble creation method

As mentioned before, a good ensemble consists of classifiers that are as diverse as possible. One way to create diversity among the classifiers produced by the algorithm of Table 1 is to use one of the objective functions f_i introduced in the last section. Another way is to use different feature subset search algorithms. The aim of the ensemble creation method proposed in this paper is to create several classifiers which use good performing subsets of the original features. How the individual

Using the new ensemble creation method in conjunction with modified “plus 1–take away 1” algorithm

In the previous sections, the particular feature selection algorithm to be used for classifier ensemble generation was left unspecified. In this section we describe the selection algorithm actually adopted in our system. It is a modified version of the “plus 1–take away 1” algorithm (Pudil et al., 1994). The algorithm is changed so that a feature is only added or removed from the working feature set if the objective function is increased by that action. Like in the floating search method (Pudil

Handwritten word recognizer

The application considered in this paper is the off-line recognition of cursively handwritten words. As basic classifier an HMM-based recognizer is used. This recognizer is similar to the one described in (Marti and Bunke, 2001). We assume that each handwritten word input to the recognizer has been normalized with respect to slant, skew, baseline location and height (for details of the normalization procedures see (Marti and Bunke, 2001)). A sliding window of one pixel width is moved from left

Experiments

The new classifier ensemble generation method proposed in this paper was implemented and experimentally evaluated. For the experiments words from a part of the handwritten sentence database described in (Marti and Bunke, 2002, Zimmermann and Bunke, 2002) were used. The training set contains 8795, while validation and test set consist of 1066 words each. The vocabulary underlying the data set used in the experiments is 2296. A few sample words used in the experiments are shown in Appendix A. The

Conclusions

A new ensemble creation method for the construction of multiple classifier systems was introduced. The new method applies feature subset search algorithms in order to find different subsets of the given features. Each of these subsets is then used by a classifier. By selecting well performing feature subsets and not just random ones, as it is done in the Random Subspace Method (Ho, 1998), an improved performance of the ensemble is achieved. Four new objective functions were introduced, which

Acknowledgements

The research was supported by the Swiss National Science Foundation (No. 20-52087.97). The authors thank Dr. Urs-Victor Marti for providing the handwritten word recognizer and Matthias Zimmermann for the segmentation of a part of the IAM database. Additional funding was provided by the Swiss National Science Foundation NCCR program “Interactive Multimodal Information Management (IM)2” in the Individual Project “Scene Analysis”. A preliminary version of the paper appeared in (Günter and Bunke,

References (39)

Y. Freund
Boosting a weak learning algorithm by majority
Informat. Computat.
(1995)
Y. Freund et al.
A descision-theoretic generalisation of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997)
P. Pudil et al.
Floating search methods in feature selection
Pattern Recognition Lett.
(1994)
Alkoot, F., Kittler, J., 2000. Feature selection for an ensemble of classifiers, in: Proc. 4th World Multiconf....
E. Bauer et al.
An empirical comparison of voting classification algorithms: Bagging, boosting, and variants
Mach. Learning J.
(1999)
L. Breiman
Bagging predictors
Mach. Learning
(1996)
P.A. Devijver et al.
Pattern Recognition: A Statistical Approach
(1982)
Dietterich, T.G., 2000. Ensemble methods in machine learning. in: (Kittler and Roli, 2000). pp....
K. Fu
Syntactic Pattern Recognition and Applications
(1982)
Günter, S., Bunke, H., 2002. Creation of classifier ensembles for handwritten word recognition using feature selection...

T.K. Ho

The random subspace method for constructing decision forests

IEEE Trans. Pattern Anal. Mach. Intell.

(1998)

Huang, Y., Suen, C., 1993. Combination of multiple classifiers with measurement values, in: Proc. of 2nd Int. Conf....

Hull, D., 1993. Using statistical testing in the evaluation of retrieval experiments. in: Research and Development in...

A. Jain et al.

Feature selection: Evaluation, application, and small sample performance

IEEE Trans. Pattern Anal. Mach. Intell.

(1997)

A. Jain et al.

Statistical pattern recognition: A review

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

I.T. Jolliffe

Principal Component Analysis

(1986)

Kaltenmeier, A., Caesar, T., Gloger, J., Mandler, E., 1993. Sophisticated topology of hidden Markov models for cursive...

G. Kim et al.

Architecture for handwritten text recognition systems

Cited by (49)

Multi-view ensemble learning using multi-objective particle swarm optimization for high dimensional data classification
2022, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
Performance base view construction:This method obtains the feature subset that has equivalent performance on the validation set. Günter and Bunke (2004) consider the diversity of the views and performance of the ensemble of subsets of the features. Tsymbal et al. (2005) introduced diversity as a fitness function and compare the many feature selection methods for the best feature subsets.
In state-of-the-art, it has proven that multi-view ensemble learning performs better than classical machine learning algorithms, with the optimized setting of views (subsets of features) usually. Obtaining the appropriate number of views for a given dataset is a complex problem in multi-view ensemble learning (MEL). The finding of the total number of possible views is an NP-hard problem, i.e., equivalent to Bell number. Moreover, the complexity of multi-view learning increases over a higher number of views of the dataset. Therefore, it is highly required to consider a smaller number of views with higher accuracy for optimal performance of MEL. In this work, MEL using Multi-Objective Particle Swarm Optimization (MEL-MOPSO) method has been proposed. The two objectives (number of views of the data and classification accuracy of MEL) have considered where the trade-off between objectives has been performed while searching for an optimal solution using Particle Swarm Optimization (PSO) in the process of multiobjective optimization. The experiments have been done over sixteen high-dimensional datasets using four state-of-art view construction methods. The individual views of the dataset has been utilized to learn through a support vector machine algorithm. The quantitative and non-parametric statistical analyses show that the proposed method has performed effectively and efficiently.
Progressive subspace ensemble learning
2016, Pattern Recognition
Citation Excerpt :
Guo et al. [29] proposed a two-stage pedestrian detection algorithm using AdaBoost and SVM. There are also applications in breast tumor discrimination [30], protein structural class prediction [31], network intrusion detection [32], and handwritten digit recognition [33] as well. Daliri [54] applied an extreme learning machine-based ensemble classifier for breast tissue classification.
There are not many classifier ensemble approaches which investigate the data sample space and the feature space at the same time, and this multi-pronged approach will be helpful for constructing more powerful learning models. For example, the AdaBoost approach only investigates the data sample space, while the random subspace technique only focuses on the feature space. To address this limitation, we propose the progressive subspace ensemble learning approach (PSEL) which takes into account the data sample space and the feature space at the same time. Specifically, PSEL first adopts the random subspace technique to generate a set of subspaces. Then, a progressive selection process based on new cost functions that incorporate current and long-term information to select the classifiers sequentially will be introduced. Finally, a weighted voting scheme is used to summarize the predicted labels and obtain the final result. We also adopt a number of non-parametric tests to compare PSEL and its competitors over multiple datasets. The results of the experiments show that PSEL works well on most of the real datasets, and outperforms a number of state-of-the-art classifier ensemble approaches.
Binary segmentation algorithm for English cursive handwriting recognition
2012, Pattern Recognition
Citation Excerpt :
Using a fixed size sliding window, density feature is extracted for HMM to perform recognition by calculating the likelihood of a word against lexicon. Similarly, methodologies presented in [13–17] also adapted the segmentation-free strategy. Sliding window and geometrical feature extraction are the base of the HMM module recognition.
Segmentation in off-line cursive handwriting recognition is a process for extracting individual characters from handwritten words. It is one of the most difficult processes in handwriting recognition because characters are very often connected, slanted and overlapped. Handwritten characters differ in size and shape as well. Hybrid segmentation techniques, especially over-segmentation and validation, are a mainstream to solve the segmentation problem in cursive off-line handwriting recognition. However, the core weakness of the segmentation techniques in the literature is that they impose high risks of chain failure during an ordered validation process. This paper presents a novel Binary Segmentation Algorithm (BSA) that reduces the risks of the chain failure problems during validation and improves the segmentation accuracy. The binary segmentation algorithm is a hybrid segmentation technique and it consists of over-segmentation and validation modules. The main difference between BSA and other techniques in the literature is that BSA adopts an un-ordered segmentation strategy. The proposed algorithm has been evaluated on CEDAR benchmark database and the results of the experiments are very promising.
Dynamic Adaboost learning with feature selection based on parallel genetic algorithm for image annotation
2010, Knowledge-Based Systems
Citation Excerpt :
Assuming that all individual classifiers perform well, combination of such multiple classifiers should reduce overall classification error [7,25–28]. An important method to construct an ensemble classifier is the combination of the best weak classifiers based on different feature subsets [17,29–31]. Two of the most popular methods of ensemble learning are bagging and boosting [32,33].
Image annotation can be formulated as a classification problem. Recently, Adaboost learning with feature selection has been used for creating an accurate ensemble classifier. We propose dynamic Adaboost learning with feature selection based on parallel genetic algorithm for image annotation in MPEG-7 standard. In each iteration of Adaboost learning, genetic algorithm (GA) is used to dynamically generate and optimize a set of feature subsets on which the weak classifiers are constructed, so that an ensemble member is selected. We investigate two methods of GA feature selection: a binary-coded chromosome GA feature selection method used to perform optimal feature subset selection, and a bi-coded chromosome GA feature selection method used to perform optimal-weighted feature subset selection, i.e. simultaneously perform optimal feature subset selection and corresponding optimal weight subset selection. To improve the computational efficiency of our approach, master-slave GA, a parallel program of GA, is implemented. k-nearest neighbor classifier is used as the base classifier. The experiments are performed over 2000 classified Corel images to validate the performance of the approaches.
Selecting features from multiple feature sets for SVM committee-based screening of human larynx
2010, Expert Systems with Applications
This paper is concerned with a two stage procedure for designing a sequential SVM committee and selecting features for the committee from multiple feature sets. It is assumed that features of one type comprise one feature set. Selection of both features and hyper-parameters of SVM classifiers comprising the committee is integrated into one learning process based on genetic search. The designing process focuses on feature selection for pair-wise classification implemented by the SVM. In the first stage, a series of pair-wise SVM are designed starting from the original feature sets as well as from sets created by simple random selection from the original ones. Outputs of the SVM are then converted into probabilities and used as inputs to the second stage SVM. When testing the technique in a three-class classification problem of voice data, a statistically significant improvement in classification accuracy was obtained if compared to parallel committees. The number of feature types and features selected for the pair-wise classification are class specific.
Hand-Written Text Recognition Methods: Review Study
2022, Revue d'Intelligence Artificielle

View all citing articles on Scopus

View full text

Feature selection algorithms for the generation of multiple classifier systems and their application to handwritten word recognition

Abstract

Introduction

Section snippets

Classical ensemble methods

New ensemble creation method

Enhancements to the proposed ensemble creation method

Using the new ensemble creation method in conjunction with modified “plus 1–take away 1” algorithm

Handwritten word recognizer

Experiments

Conclusions

Acknowledgements

Informat. Computat.

J. Comput. Syst. Sci.

Pattern Recognition Lett.

An empirical comparison of voting classification algorithms: Bagging, boosting, and variants

Mach. Learning J.

Bagging predictors

Mach. Learning

Pattern Recognition: A Statistical Approach

Syntactic Pattern Recognition and Applications

The random subspace method for constructing decision forests

IEEE Trans. Pattern Anal. Mach. Intell.

Feature selection: Evaluation, application, and small sample performance

IEEE Trans. Pattern Anal. Mach. Intell.

Statistical pattern recognition: A review

IEEE Trans. Pattern Anal. Mach. Intell.

Principal Component Analysis

Architecture for handwritten text recognition systems