A novel ACO–GA hybrid algorithm for feature selection in protein function prediction

doi:10.1016/j.eswa.2009.04.023

Expert Systems with Applications

Volume 36, Issue 10, December 2009, Pages 12086-12094

https://doi.org/10.1016/j.eswa.2009.04.023 Get rights and content

Abstract

Protein function prediction is an important problem in functional genomics. Typically, protein sequences are represented by feature vectors. A major problem of protein datasets that increase the complexity of classification models is their large number of features. Feature selection (FS) techniques are used to deal with this high dimensional space of features. In this paper, we propose a novel feature selection algorithm that combines genetic algorithms (GA) and ant colony optimization (ACO) for faster and better search capability. The hybrid algorithm makes use of advantages of both ACO and GA methods. Proposed algorithm is easily implemented and because of use of a simple classifier in that, its computational complexity is very low. The performance of proposed algorithm is compared to the performance of two prominent population-based algorithms, ACO and genetic algorithms. Experimentation is carried out using two challenging biological datasets, involving the hierarchical functional classification of GPCRs and enzymes. The criteria used for comparison are maximizing predictive accuracy, and finding the smallest subset of features. The results of experiments indicate the superiority of proposed algorithm.

Introduction

Protein function prediction is an important problem in functional genomics. Proteins are large molecules that perform nearly all of the functions of a cell in a living organism (Alberts et al., 2002). The primary sequence of a protein consists of a long sequence of amino acids. Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemical such as biofuels.

Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. Although the number of proteins with known sequence has grown exponentially in the last few years, due to rapid advances in genome sequencing technology, the number of proteins with known structure and function has grown at a substantially lower rate (Freitas & de Carvalho, 2007).

Searching for similar sequences in protein databases is a common approach used in the prediction of a protein function. The objective of this search is to find a similar protein whose function is known and assigning its function to the new protein. Despite the simplicity and usefulness this method in a large number of situations, it has also some limitations (Freitas & de Carvalho, 2007). For instance, two proteins might have very similar sequences and perform different functions, or have very different sequences and perform a similar function. Additionally, the proteins being compared may be similar in regions of the sequence that are not determinants of their function.

Another approach that may be used alternatively or in complement to the similarity-based approach is to build a model for predictive classification. The goal of such a model is to classify data instances into one of a predefined set of classes or categories. In this approach a feature vector represents each protein, a learning algorithm captures the most important relationships between the features, and the classes present in the dataset. A major problem in protein datasets is the high dimensionality of the feature space. Most of these dimensions are not relative to protein function; even some noise data hurt the performance of the classifier. Hence, we need to select some representative features from the original feature space to reduce the dimensionality of feature space and improve the efficiency and performance of classifier.

Feature selection (FS) is of considerable importance in bioinformatics (Basiri et al., 2008, Saeys et al., 2007), signal processing (Nemati, Boostani, & Jazi, 2008), text categorization (Aghdam, Ghasem-Aghaee, & Basiri, 2008), data mining and pattern recognition (Jensen, 2005). Among too many methods that are proposed for FS, population-based optimization algorithms such as genetic algorithm (GA), ant colony optimization (ACO) and particle swarm optimization (PSO) have attracted a lot of attention (Basiri et al., 2008, Liu et al., 2004, Punch et al., 1993). These methods are stochastic optimization techniques attempt to achieve better solutions by application of knowledge from previous iterations.

Genetic algorithms are optimization techniques based on the mechanism of natural selection. They used operations found in natural genetics to guide itself through the paths in the search space (Siedlecki & Sklansky, 1989). Because of their advantages, recently, GAs have been widely used as a tool for feature selection in data mining (Srinivas & Patnik, 1994).

ACO is a branch of newly developed form of artificial intelligence called Swarm Intelligence (SI). ACO algorithm is inspired by social behavior of ant colonies. Although they have no sight, ants are capable of finding the shortest route between a food source and their nest by chemical materials called pheromone that they leave when moving (Liu, Abbass, & McKay, 2004).

ACO algorithm was firstly used for solving Traveling Salesman Problem (TSP) (Dorigo, Maniezzo, & Colorni, 1996) and then has been successfully applied to a large number of difficult problems like the Quadratic Assignment Problem (QAP) (Maniezzo & Colorni, 1999), routing in telecommunication networks, graph coloring problems, scheduling, etc. This method is particularly attractive for feature selection, as there seems to be no heuristic that can guide search to the optimal minimal subset every time (Aghdam et al., 2008).

In our previous work (Basiri et al., 2008), we have proposed an ACO algorithm for feature selection in prediction postsynaptic activity of proteins. In this paper, we intend to hybridize ACO and genetic algorithms to obtain their excellent features by synthesizing them. More specifically, ACO offers a critical advantage of local searching, not found in GA. On the other hand, GA considers a global perspective by operating on the complete population from the very beginning. Therefore, ACO and GA can nullify each others drawbacks when hybridized.

The rest of this paper is organized as follows. Section 2 presents a brief overview of protein function prediction. Feature selection methods are shortly discussed in Section 3. Ant colony optimization is described in Section 4. Genetic algorithms are addressed in Section 5. Section 6 explains the proposed hybrid feature selection algorithm. Section 7 reports computational experiments. It also includes a brief discussion of the results obtained and finally the conclusion and future works are offered in the last section.

Section snippets

Protein function prediction

Proteins are the most essential and versatile macromolecules of life and serve as building blocks and functional components of a cell, and account for the second largest fraction of the cellular weight after water. They are polypeptides, formed within cells as a linear chain of amino acids (Setubal & Meidanis, 1999). Twenty different amino acids are available, which are denoted by 20 different letters of the alphabet and a linear sequence of these amino acids is known as the primary structure.

Feature selection approaches

During the last decade, application of feature selection (FS) techniques in bioinformatics has become a real prerequisite for model building. In particular, the high dimensional nature of many modeling tasks in bioinformatics, going from sequence analysis over microarray analysis to spectral analysis and literature mining has given rise to a wealth of feature selection techniques being presented in the field (Blum & Dorigo, 2004).

Feature selection is a problem of global combinatorial

Ant colony optimization

Ant colony optimization (ACO) algorithms were introduced by Marco Dorigo (Dorigo et al., 1992) in the early 1990s. While moving, ants leave a chemical pheromone trail on the ground. Ants are guided by pheromone smell. Ants tend to choose the paths marked by the strongest pheromone concentration. The indirect communication between the ants via pheromone trails enables them to find shortest paths between their nest and food sources (Dorigo, Bonaneau, & Theraulaz, 2000).

The basic idea of ACO is to

Genetic algorithm (GA)

Genetic algorithms belong to a class of population-based stochastic search algorithms that are inspired from principles of natural evolution known as evolutionary algorithms (EA) (Choenauer & Michalewicz, 1997). GA is based on the principle of “survival of fittest”, as in the natural phenomena of genetic inheritance and Darwinian strife for survival. These algorithms are general-purpose optimization algorithms with a probabilistic component that provide a means to search poorly understood,

Proposed ACO–GA algorithm

As mentioned earlier, in this paper we intend to hybridize ACO and GA in such a manner that they complement each other for feature selection in protein function prediction. The main steps of proposed feature selection algorithm are shown in Fig. 3.

ACO and GA are used to explore the space of all subsets of given feature set. The performance of selected feature subsets is measured by invoking an evaluation function with the corresponding reduced feature space and measuring the specified

Experimental results

A series of experiments was conducted to show the utility of proposed feature selection algorithm. All experiments have been run on a machine with 3.0 GHz CPU and 512 MB of RAM. We implement proposed ACO–GA, ACO-based and GA-based feature selection algorithms in Matlab R2006a. The operating system was Windows XP Professional. For experimental studies, we have considered two datasets; GPCR-PROSITE and ENZYME-PROSITE. The following sections describe these two datasets and implementation results.

Conclusion and future research

In this paper, we present a hybrid ACO–GA feature selection algorithm and adapt it for hierarchical classification of biological data in a Top-Down manner. This algorithm, ACO–GA, was compared with an ordinary ACO-based algorithm and a classical genetic algorithm for hierarchical classification of proteins. Proposed algorithm has the ability to converge quickly; it has a strong search capability in the problem space and can efficiently find minimal feature subset. Experimental results

References (36)

M. Dorigo et al.
Ant algorithms and stigmergy
Future Generation Computer Systems
(2000)
M. Dorigo et al.
Ant colony optimization theory: A survey
Theoretical Computer Science
(2005)
W. Siedlecki et al.
A note on genetic algorithms for large-scale feature selection
Pattern Recognition Letters
(1989)
X. Wang et al.
Feature selection based on rough sets and particle swarm optimization
Pattern Recognition Letters
(2007)
Aghdam, M. H., Ghasem-Aghaee, N., & Basiri, M. E. (2008). Application of ant colony optimization for feature selection...
B. Alberts et al.
The molecular biology of the cell
(2002)
Basiri, M. E., Ghasem-Aghaee, N., & Aghdam, M. H. (2008). Using ant colony optimization-based selected features for...
C. Blum et al.
The hyper-cube framework for ant colony optimization
IEEE Transaction on Systems, Man, and Cybernetics – Part B
(2004)
M. Choenauer et al.
Evolutionary computation control and cybernetics
Proceedings of the IEEE
(1997)
Correa, S., Freitas, A. A., & Johnson, C. G. (2007). Particle Swarm and Bayesian networks applied to attribute...

Dorigo, M. (1992). Optimization, learning and natural algorithms. PhD thesis, Dipartimento di Elettronica, Politecnico...

M. Dorigo et al.

Ant system: Optimization by a colony of cooperating agents

IEEE Transaction on Systems, Man, and Cybernetics – Part B

(1996)

R.O. Duda et al.

Pattern classification and scene analysis

(1973)

A.A. Freitas et al.

A tutorial on hierarchical classification with applications in bioinformatics

Research and Trends in Data Mining Technologies and Applications

(2007)

Holden, N., & Freitas, A. A. (2006). Hierarchical classification of g-protein-coupled receptors with a pso/aco...

Hulo, N., et al. (2006). The PROSITE database. Nucleic acids research (Vol. 34) (pp. D227–D230). Database...

Jensen, R. (2005). Combining rough and fuzzy sets for feature selection. PhD thesis, University of...

Leguizamon, G., & Michalewicz, Z. (1999). A new version of ant system for subset problems. In Proceedings of IEEE...

Cited by (193)

Integration of deep learning and improved multi-objective algorithm to optimize reservoir operation for balancing human and downstream ecological needs
2024, Water Research
Dam (reservoir)-induced alterations of flow and water temperature regimes can threaten downstream fish habitats and native aquatic ecosystems. Alleviating the negative environmental impacts of dam-reservoir and balancing the multiple purposes of reservoir operation have attracted wide attention. While previous studies have incorporated ecological flow requirements in reservoir operation strategies, a comprehensive analysis of trade-offs among hydropower benefits, ecological flow, and ecological water temperature demands is lacking. Hence, this study develops a multi-objective ecological scheduling model, considering total power generation, ecological flow guarantee index, and ecological water temperature guarantee index simultaneously. The model is based on an integrated multi-objective simulation-optimization (MOSO) framework which is applied to Three Gorges Reservoir. To that end, first, a hybrid long short-term memory and one-dimensional convolutional neural network (LSTM_1DCNN) model is utilized to simulate the dam discharge temperature. Then, an improved epsilon multi-objective ant colony optimization for continuous domain algorithm (ε-MOACOR) is proposed to investigate the trade-offs among the competing objectives. Results show that LSTM _1DCNN outperforms other competing models in predicting dam discharge temperature. The conflicts among economic and ecological objectives are often prominent. The proposed ε-MOACOR has potential in resolving such conflicts and has high efficiency in solving multi-objective benchmark tests as well as reservoir optimization problem. More realistic and pragmatic Pareto-optimal solutions for typical dry, normal and wet years can be generated by the MOSO framework. The ecological water temperature guarantee index objective, which should be considered in reservoir operation, can be improved as inflow discharge increases or the temporal distribution of dam discharge volume becomes more uneven.
Feature selection in high dimensional data: A specific preordonnances-based memetic algorithm
2023, Knowledge-Based Systems
In supervised learning scenarios, feature selection has been largely investigated in the literature because only a few features carry valuable information. This study introduces an algorithm for heterogeneous variable selection in the discrimination problem. The proposed algorithm, Specific Memetic Algorithm Preordonnances-based (SMAP), uses association techniques based on preordonnances theory and is a hybrid filter-wrapper algorithm to make full use of the benefits of each: The filter phase measures the relevance of features by their agreement with the target variable and discards those that disagree ; this leads to a reduction of the search space, while the wrapper phase measures the usefulness of subsets of features to identify the best one using a memetic algorithm based on preordonnances theory. The association is quantified by a coefficient measuring the concordance between two variables, even if one is numeric and the other is categorical (mixed). We propose a generalization of this coefficient measuring the concordance between several (more than two) heterogeneous variables. In this study, a new feature discrimination power measure combining the two coefficients is introduced to intensify and diversify the search taking a minimum amount of time. SMAP is empirically analyzed by comparing its performance to that of recently referred state-of-art approaches on seven datasets and on simulated ones using three different classifiers. The experimental results show the superiority of SMAP over comparative methods.
A robust feature selection method based on meta-heuristic optimization for speech emotion recognition
2024, Evolutionary Intelligence
Hybrid ACO-CI Algorithm for Beam Design Problems
2024, SN Computer Science
Evolutionary feature selection based on hybrid bald eagle search and particle swarm optimization
2024, Intelligent Data Analysis
Emotion recognition from speech signals using digital features optimization by diversity measure fusion
2024, Journal of Intelligent and Fuzzy Systems

View all citing articles on Scopus

View full text

A novel ACO–GA hybrid algorithm for feature selection in protein function prediction

Abstract

Introduction

Section snippets

Protein function prediction

Feature selection approaches

Ant colony optimization

Genetic algorithm (GA)

Proposed ACO–GA algorithm

Experimental results

Conclusion and future research

Future Generation Computer Systems

Theoretical Computer Science

Pattern Recognition Letters

Pattern Recognition Letters

The molecular biology of the cell

The hyper-cube framework for ant colony optimization

IEEE Transaction on Systems, Man, and Cybernetics – Part B

Evolutionary computation control and cybernetics

Proceedings of the IEEE

Ant system: Optimization by a colony of cooperating agents

IEEE Transaction on Systems, Man, and Cybernetics – Part B

Pattern classification and scene analysis

A tutorial on hierarchical classification with applications in bioinformatics

Research and Trends in Data Mining Technologies and Applications