A novel ACO–GA hybrid algorithm for feature selection in protein function prediction
Introduction
Protein function prediction is an important problem in functional genomics. Proteins are large molecules that perform nearly all of the functions of a cell in a living organism (Alberts et al., 2002). The primary sequence of a protein consists of a long sequence of amino acids. Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemical such as biofuels.
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. Although the number of proteins with known sequence has grown exponentially in the last few years, due to rapid advances in genome sequencing technology, the number of proteins with known structure and function has grown at a substantially lower rate (Freitas & de Carvalho, 2007).
Searching for similar sequences in protein databases is a common approach used in the prediction of a protein function. The objective of this search is to find a similar protein whose function is known and assigning its function to the new protein. Despite the simplicity and usefulness this method in a large number of situations, it has also some limitations (Freitas & de Carvalho, 2007). For instance, two proteins might have very similar sequences and perform different functions, or have very different sequences and perform a similar function. Additionally, the proteins being compared may be similar in regions of the sequence that are not determinants of their function.
Another approach that may be used alternatively or in complement to the similarity-based approach is to build a model for predictive classification. The goal of such a model is to classify data instances into one of a predefined set of classes or categories. In this approach a feature vector represents each protein, a learning algorithm captures the most important relationships between the features, and the classes present in the dataset. A major problem in protein datasets is the high dimensionality of the feature space. Most of these dimensions are not relative to protein function; even some noise data hurt the performance of the classifier. Hence, we need to select some representative features from the original feature space to reduce the dimensionality of feature space and improve the efficiency and performance of classifier.
Feature selection (FS) is of considerable importance in bioinformatics (Basiri et al., 2008, Saeys et al., 2007), signal processing (Nemati, Boostani, & Jazi, 2008), text categorization (Aghdam, Ghasem-Aghaee, & Basiri, 2008), data mining and pattern recognition (Jensen, 2005). Among too many methods that are proposed for FS, population-based optimization algorithms such as genetic algorithm (GA), ant colony optimization (ACO) and particle swarm optimization (PSO) have attracted a lot of attention (Basiri et al., 2008, Liu et al., 2004, Punch et al., 1993). These methods are stochastic optimization techniques attempt to achieve better solutions by application of knowledge from previous iterations.
Genetic algorithms are optimization techniques based on the mechanism of natural selection. They used operations found in natural genetics to guide itself through the paths in the search space (Siedlecki & Sklansky, 1989). Because of their advantages, recently, GAs have been widely used as a tool for feature selection in data mining (Srinivas & Patnik, 1994).
ACO is a branch of newly developed form of artificial intelligence called Swarm Intelligence (SI). ACO algorithm is inspired by social behavior of ant colonies. Although they have no sight, ants are capable of finding the shortest route between a food source and their nest by chemical materials called pheromone that they leave when moving (Liu, Abbass, & McKay, 2004).
ACO algorithm was firstly used for solving Traveling Salesman Problem (TSP) (Dorigo, Maniezzo, & Colorni, 1996) and then has been successfully applied to a large number of difficult problems like the Quadratic Assignment Problem (QAP) (Maniezzo & Colorni, 1999), routing in telecommunication networks, graph coloring problems, scheduling, etc. This method is particularly attractive for feature selection, as there seems to be no heuristic that can guide search to the optimal minimal subset every time (Aghdam et al., 2008).
In our previous work (Basiri et al., 2008), we have proposed an ACO algorithm for feature selection in prediction postsynaptic activity of proteins. In this paper, we intend to hybridize ACO and genetic algorithms to obtain their excellent features by synthesizing them. More specifically, ACO offers a critical advantage of local searching, not found in GA. On the other hand, GA considers a global perspective by operating on the complete population from the very beginning. Therefore, ACO and GA can nullify each others drawbacks when hybridized.
The rest of this paper is organized as follows. Section 2 presents a brief overview of protein function prediction. Feature selection methods are shortly discussed in Section 3. Ant colony optimization is described in Section 4. Genetic algorithms are addressed in Section 5. Section 6 explains the proposed hybrid feature selection algorithm. Section 7 reports computational experiments. It also includes a brief discussion of the results obtained and finally the conclusion and future works are offered in the last section.
Section snippets
Protein function prediction
Proteins are the most essential and versatile macromolecules of life and serve as building blocks and functional components of a cell, and account for the second largest fraction of the cellular weight after water. They are polypeptides, formed within cells as a linear chain of amino acids (Setubal & Meidanis, 1999). Twenty different amino acids are available, which are denoted by 20 different letters of the alphabet and a linear sequence of these amino acids is known as the primary structure.
Feature selection approaches
During the last decade, application of feature selection (FS) techniques in bioinformatics has become a real prerequisite for model building. In particular, the high dimensional nature of many modeling tasks in bioinformatics, going from sequence analysis over microarray analysis to spectral analysis and literature mining has given rise to a wealth of feature selection techniques being presented in the field (Blum & Dorigo, 2004).
Feature selection is a problem of global combinatorial
Ant colony optimization
Ant colony optimization (ACO) algorithms were introduced by Marco Dorigo (Dorigo et al., 1992) in the early 1990s. While moving, ants leave a chemical pheromone trail on the ground. Ants are guided by pheromone smell. Ants tend to choose the paths marked by the strongest pheromone concentration. The indirect communication between the ants via pheromone trails enables them to find shortest paths between their nest and food sources (Dorigo, Bonaneau, & Theraulaz, 2000).
The basic idea of ACO is to
Genetic algorithm (GA)
Genetic algorithms belong to a class of population-based stochastic search algorithms that are inspired from principles of natural evolution known as evolutionary algorithms (EA) (Choenauer & Michalewicz, 1997). GA is based on the principle of “survival of fittest”, as in the natural phenomena of genetic inheritance and Darwinian strife for survival. These algorithms are general-purpose optimization algorithms with a probabilistic component that provide a means to search poorly understood,
Proposed ACO–GA algorithm
As mentioned earlier, in this paper we intend to hybridize ACO and GA in such a manner that they complement each other for feature selection in protein function prediction. The main steps of proposed feature selection algorithm are shown in Fig. 3.
ACO and GA are used to explore the space of all subsets of given feature set. The performance of selected feature subsets is measured by invoking an evaluation function with the corresponding reduced feature space and measuring the specified
Experimental results
A series of experiments was conducted to show the utility of proposed feature selection algorithm. All experiments have been run on a machine with 3.0 GHz CPU and 512 MB of RAM. We implement proposed ACO–GA, ACO-based and GA-based feature selection algorithms in Matlab R2006a. The operating system was Windows XP Professional. For experimental studies, we have considered two datasets; GPCR-PROSITE and ENZYME-PROSITE. The following sections describe these two datasets and implementation results.
Conclusion and future research
In this paper, we present a hybrid ACO–GA feature selection algorithm and adapt it for hierarchical classification of biological data in a Top-Down manner. This algorithm, ACO–GA, was compared with an ordinary ACO-based algorithm and a classical genetic algorithm for hierarchical classification of proteins. Proposed algorithm has the ability to converge quickly; it has a strong search capability in the problem space and can efficiently find minimal feature subset. Experimental results
References (36)
- et al.
Ant algorithms and stigmergy
Future Generation Computer Systems
(2000) - et al.
Ant colony optimization theory: A survey
Theoretical Computer Science
(2005) - et al.
A note on genetic algorithms for large-scale feature selection
Pattern Recognition Letters
(1989) - et al.
Feature selection based on rough sets and particle swarm optimization
Pattern Recognition Letters
(2007) - Aghdam, M. H., Ghasem-Aghaee, N., & Basiri, M. E. (2008). Application of ant colony optimization for feature selection...
- et al.
The molecular biology of the cell
(2002) - Basiri, M. E., Ghasem-Aghaee, N., & Aghdam, M. H. (2008). Using ant colony optimization-based selected features for...
- et al.
The hyper-cube framework for ant colony optimization
IEEE Transaction on Systems, Man, and Cybernetics – Part B
(2004) - et al.
Evolutionary computation control and cybernetics
Proceedings of the IEEE
(1997) - Correa, S., Freitas, A. A., & Johnson, C. G. (2007). Particle Swarm and Bayesian networks applied to attribute...
Ant system: Optimization by a colony of cooperating agents
IEEE Transaction on Systems, Man, and Cybernetics – Part B
Pattern classification and scene analysis
A tutorial on hierarchical classification with applications in bioinformatics
Research and Trends in Data Mining Technologies and Applications
Cited by (193)
Feature selection in high dimensional data: A specific preordonnances-based memetic algorithm
2023, Knowledge-Based SystemsA robust feature selection method based on meta-heuristic optimization for speech emotion recognition
2024, Evolutionary IntelligenceHybrid ACO-CI Algorithm for Beam Design Problems
2024, SN Computer ScienceEvolutionary feature selection based on hybrid bald eagle search and particle swarm optimization
2024, Intelligent Data AnalysisEmotion recognition from speech signals using digital features optimization by diversity measure fusion
2024, Journal of Intelligent and Fuzzy Systems