Elsevier

Journal of Theoretical Biology

Volume 443, 14 April 2018, Pages 125-137
Journal of Theoretical Biology

PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework

https://doi.org/10.1016/j.jtbi.2018.01.023Get rights and content

Highlights

  • PREvaIL is a new method for inferring catalytic residues based on a comprehensive set of features.

  • Benchmarking experiments showed that PREvaIL achieved competitive performance.

  • It was able to capture useful signals to improve catalytic residue prediction.

  • PREvaIL can facilitate characterization and functional annotation of proteins.

Abstract

Determining the catalytic residues in an enzyme is critical to our understanding the relationship between protein sequence, structure, function, and enhancing our ability to design novel enzymes and their inhibitors. Although many enzymes have been sequenced, and their primary and tertiary structures determined, experimental methods for enzyme functional characterization lag behind. Because experimental methods used for identifying catalytic residues are resource- and labor-intensive, computational approaches have considerable value and are highly desirable for their ability to complement experimental studies in identifying catalytic residues and helping to bridge the sequence–structure–function gap. In this study, we describe a new computational method called PREvaIL for predicting enzyme catalytic residues. This method was developed by leveraging a comprehensive set of informative features extracted from multiple levels, including sequence, structure, and residue-contact network, in a random forest machine-learning framework. Extensive benchmarking experiments on eight different datasets based on 10-fold cross-validation and independent tests, as well as side-by-side performance comparisons with seven modern sequence- and structure-based methods, showed that PREvaIL achieved competitive predictive performance, with an area under the receiver operating characteristic curve and area under the precision-recall curve ranging from 0.896 to 0.973 and from 0.294 to 0.523, respectively. We demonstrated that this method was able to capture useful signals arising from different levels, leveraging such differential but useful types of features and allowing us to significantly improve the performance of catalytic residue prediction. We believe that this new method can be utilized as a valuable tool for both understanding the complex sequence–structure–function relationships of proteins and facilitating the characterization of novel enzymes lacking functional annotations.

Graphical abstract

The flowchart of PREvaIL for inferring catalytic residues based on the integration of sequence, structural, and residue-contact-network features using the random forest machine-learning framework.

Image, graphical abstract
  1. Download : Download high-res image (160KB)
  2. Download : Download full-size image

Introduction

As powerful biological catalysts, enzymes can effectively catalyze biochemical reactions at extremely high rates and are thus indispensable for many biological processes and pathways (Khosla and Harbury, 2001). Many important findings acquired from enzyme fast reaction systems (Chou and Zhou, 1982, Kuo-chen and Shou-ping, 1974, Zhou and Zhong, 1982) significantly impact both basic research (Gardner et al., 2015) and drive changes in medicinal chemistry (Chou, 2017). However, the residues comprising an enzyme differ greatly in functional significance, with only a small number directly involved in catalytic activity (Furnham et al., 2014). Accordingly, understanding which of these are catalytic residues is critical for our determining relationships between protein sequence, structure, function, and enhancing our ability to design novel inhibitors and enzymes. This has important implications in the post-genomic era, with its challenge of bridging the widening protein sequence–structure gap. Although sequence information for many enzymes is known, relatively few enzymes have been functionally characterized. Therefore, detailed information regarding catalytic residues and enzyme active sites explicitly involved in catalysis remains lacking. Because experimental methods for identifying catalytic residues are resource- and labor-intensive, high-throughput in silico approaches have considerable value and are highly desirable for complementing experimental efforts in identifying catalytic residues and helping to bridge the sequence–structure–function gap.

In recent years, a variety of computational methods have been developed for predicting catalytic residues or functional residues involved in catalytic reactions (Chou and Cai, 2004). These methods differ in several ways, including in the machine-learning or statistical-scoring technique used, the types of sequence features used, whether or not structural features are used in addition to sequence features, and in the sources of training and testing datasets. According to the types of features used for constructing prediction models, existing methods can be generally categorized into four major groups.

The first group of methods was primarily developed based on protein sequence and typically relied upon extracting useful sequence features for inputs used to train the prediction models. Commonly used sequence features include evolutionary information in the form of position-specific scoring matrices (PSSMs) or sequence conservation inferred from multiple sequence alignments (Capra and Singh, 2007, Fischer et al., 2008, La et al., 2005, Pai et al., 2015, Youn et al., 2007, Zhang et al., 2008) or other sequence-derived features, such as Jensen-Shannon divergence scores, relative entropies (Dou et al., 2012, Dou et al., 2010, Fischer et al., 2008), and predicted structural information inferred from sequences, including secondary structure and solvent accessibility (Dou et al., 2012, Kauffman and Karypis, 2009, Shen et al., 2009).

Recently, many research groups exploited the increasing quantity of structural data deposited in the Protein Data Bank (PDB) (Rose et al., 2017), prompting the proliferation of the second group of methods, which leverage structural information to build the prediction models (Alterovitz et al., 2009, Chea and Livesay, 2007, Cilia and Passerini, 2010, Gutteridge et al., 2003, Han et al., 2012, Kirshner et al., 2013, Panchenko et al., 2004, Petrova and Wu, 2006, Sun et al., 2016, Xin et al., 2010, Youn et al., 2007). Xin et al. (2010) proposed a structure-based kernel algorithm for the prediction of catalytic residues by explicitly modeling the similarity between residue-centered neighborhoods in protein structures (Xin et al., 2010). They showed that the geometry, physicochemical properties, and evolutionary conservation play an important role in determining catalytic residue activity. In a recent study, Sun et al. (2016) developed the CRHunter method which combined both sequence and structural information in an SVM framework that achieved stable performance when compared with other template-based predictors (Sun et al., 2016). Chien and Huang proposed an approach EXIA based on residue side chain orientation and backbone flexibility of protein structure, which achieved a comparable performance to that of evolutionary sequence conservation (Chien and Huang, 2012). In another study, Kirshner et al. (2013) developed the Catsid (Catalytic site identification) search engine, which enables rapid searches for structural matches to a user-specified catalytic site among all PDB structures. Its capacity to rapidly search all known protein structures in the PDB is enabled by a logistic regression-based model that allows for systematic identification of true positives based on a set of feature descriptors (Kirshner et al., 2013).

The third group of methods (Chea and Livesay, 2007, del Sol et al., 2006, del Sol and O'Meara, 2005, Li et al., 2011) involve graph-theoretical methods that essentially rely on representing protein three-dimensional (3D) structures as small world networks (Watts and Strogatz, 1998), where amino acid residues specify vertices within a graph while two residues in a proximal spatial neighborhood form edges. Zhou et al. provided a comprehensive review on recent progress in this area (Zhou et al., 2016). Previous studies showed that representing protein structure as a topological residue-contact network can provide novel insights into protein folding mechanisms, stability, and function (del Sol et al., 2006, del Sol and O'Meara, 2005, Jiao and Ranganathan, 2017, Song et al., 2010, Tang et al., 2008, Wang et al., 2012, Zheng et al., 2012). Chea and Livesay (2007) benchmarked the performance of one particular network measure called closeness centrality and showed that it provided statistically significant predictive power for catalytic residue predictions. They also demonstrated that solvent accessibility or residue identity could be used as an efficient filter by this network feature to further improve its predictive performance (Chea and Livesay, 2007).

The fourth group of methods uses heterogeneous features through the integration or fusion of sequence, structure, and other types of features (Li et al., 2011, Sankararaman et al., 2010, Tang et al., 2008). Because the extracted features are heterogeneous, redundant, and noisy, a number of feature-selection and dimensionality reduction algorithms are often employed and used in combination with the learning algorithms to remove irrelevant features and improve model training in order to increase prediction accuracy. In terms of the algorithms used for training these prediction models, machine learning or statistical scoring approaches are often employed and used include neural networks (Gutteridge et al., 2003), information-theoretic algorithms (Capra and Singh, 2007, Fischer et al., 2008), genetic algorithms (Izidoro et al., 2015), support vector machines (SVMs) (Chea and Livesay, 2007, Li et al., 2011, Pai et al., 2015, Petrova and Wu, 2006, Sun et al., 2016, Youn et al., 2007), kernel-based algorithms (Xin et al., 2010), AdaBoost (Alterovitz et al., 2009), and logistic regression (Dou et al., 2012, Kirshner et al., 2013, Sankararaman et al., 2010). The consensus of these studies has been that evolutionary information, sequence conservation, and the structural neighborhood of catalytic residues are important predictive features, with machine learning-based approaches often providing competitive performance, making them particularly suitable for dealing with high-dimensional heterogeneous feature spaces.

Despite the development and increasing availability of such a wide range of methods, three main challenges need to be overcome to predict catalytic residues by machine leaning-based approaches: (1) Sequence and structural features are still not sufficient to predict the catalytic residues of certain proteins. Accordingly, it is necessary to find and exploit other novel and complementary groups or types of features that can be used to further improve prediction performance. (2) Methods for quantifying and characterizing the relative importance and contribution of each group of features according to model performance are needed. (3) It is necessary to determine which machine learning algorithm provides the overall highest and most reliable prediction performance.

To address these questions, in this study, we present a new machine learning-based approach called PREvaIL (PRotEin various Information-based cataLytic site predictor) for predicting catalytic residues based on a random forest (RF) algorithm. In terms of input features, this approach combines a variety of sequence and structural features, as well as residue-contact-network properties, and uses an efficient feature-selection technique to select a subset of more useful features for catalytic residue prediction. We performed extensive benchmark experiments using eight different test datasets to evaluate the performance of this approach and compared it with other competing methods. The results showed that this new approach performed favorably as compared with other methods, thereby illustrating its effectiveness.

Section snippets

Materials and methods

According to the 5-step rule (Chou, 2011), the first important step in developing a new predictor involves construction or selection of an effective benchmark dataset. In this study, we addressed this problem as follows.

Feature ranking by the MDGI Z-score

We calculated and ranked the MDGI Z-scores of all initial 3424 features (see Table 1 for a summary of these features) using the randomForest R package in order to assess the relative importance and contribution of each feature type. As a result, we identified a total of 127 feature-vector elements with MDGI Z-score > 1.0, of which 41 had an MDGI Z-score > 2.0. The relative importance and ranking of these feature vectors are plotted in Fig. 2. A detailed list of these feature vectors according

Conclusions

In this study, we demonstrated that the combinatorial application of machine learning techniques on multi-level protein features involving sequence-derived, structural, and residue-contact-network features allowed the development of a powerful bioinformatics predictor, PREvaIL. Previous methods explored these different levels of features separately; however, we illustrated their effective integration into a machine-learning framework to provide complementary information to collectively help

Acknowledgments

We would like to thank Drs. Tuo Zhang and Lukasz Kurgan who graciously made publicly available the datasets and extracted features used in the benchmarking testing of CRpred.

References (104)

  • R. Alterovitz

    ResBoost: characterizing and predicting catalytic residues in enzymes

    BMC Bioinf.

    (2009)
  • S.F. Altschul

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • E. Amanzadeh et al.

    Classification of DNA minor and major grooves binding proteins according to the NLSS by data analysis methods

    Appl. Biochem. Biotechnol.

    (2014)
  • G. Amitai

    Network analysis of protein structures identifies functional residues

    J. Mol. Biol.

    (2004)
  • M. Behbahani et al.

    Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou's general pseudo amino acid composition

    J. Theor. Biol.

    (2016)
  • M.M. Beigi et al.

    Prediction of metalloproteinase family based on the concept of Chou's pseudo amino acid composition using a machine learning approach

    J. Struct. Funct. Genomics

    (2011)
  • L Breiman

    Random forests

    Mach. Learn.

    (2001)
  • D.S. Cao et al.

    propy: a tool to generate various modes of Chou's PseAAC

    Bioinformatics

    (2013)
  • J.A. Capra et al.

    Predicting functionally important residues from sequence conservation

    Bioinformatics

    (2007)
  • P. Carter et al.

    Dissecting the catalytic triad of a serine protease

    Nature

    (1988)
  • E. Chea et al.

    How accurate and statistically robust are catalytic site predictions based on closeness centrality?

    BMC Bioinf.

    (2007)
  • Z. Chen

    ZincExplorer: an accurate hybrid method to improve the prediction of zinc-binding sites from protein sequences

    Mol. Biosyst.

    (2013)
  • Y.-T. Chien et al.

    Accurate prediction of protein catalytic residues by side chain orientation and residue contact density

    PLoS One

    (2012)
  • K.C. Chou

    Prediction of protein cellular attributes using pseudo-amino acid composition

    Proteins Struct. Funct. Genet.

    (2001)
  • K.C. Chou

    Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes

    Bioinformatics

    (2005)
  • K.C. Chou

    Some remarks on protein attribute prediction and pseudo amino acid composition

    J. Theor. Biol.

    (2011)
  • K.C. Chou

    Impacts of Bioinformatics to Medicinal Chemistry

    Med. Chem.

    (2015)
  • K.C. Chou

    An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science

    Curr. Top. Med. Chem.

    (2017)
  • K.C. Chou et al.

    A novel approach to predict active sites of enzyme molecules

    Proteins Struct. Funct. Bioinf.

    (2004)
  • K.C. Chou et al.

    Role of the protein outside active-site on the diffusion-controlled reaction of enzyme

    J. Am. Chem. Soc.

    (1982)
  • E. Cilia et al.

    Automatic prediction of catalytic residues by modeling residue structural neighborhood

    BMC Bioinf.

    (2010)
  • P.J. Cock

    Biopython: freely available Python tools for computational molecular biology and bioinformatics

    Bioinformatics

    (2009)
  • G. Csardi et al.

    The igraph software package for complex network research

    Int. J. Complex Syst.

    (2006)
  • A. del Sol

    Residue centrality, functionally important residues, and active site shape: analysis of enzyme and non-enzyme families

    Protein Sci.

    (2006)
  • A. del Sol et al.

    Small-world network approach to identify key residues in protein-protein interaction

    Proteins

    (2005)
  • F.M. Disfani

    MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins

    Bioinformatics

    (2012)
  • Y. Dou

    L1pred: a sequence-based prediction tool for catalytic residues in enzymes with the L1-logreg classifier

    PLoS One

    (2012)
  • Y. Dou

    Prediction of catalytic residues based on an overlapping amino acid classification

    Amino Acids

    (2010)
  • P.F. Du et al.

    PseAAC-General: fast building various modes of general form of chou's pseudo-amino acid composition for large-scale protein datasets

    Int. J. Mol. Sci.

    (2014)
  • P.F. Du

    PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions

    Anal. Biochem.

    (2012)
  • M. Esmaeili et al.

    Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses

    J. Theor. Biol.

    (2010)
  • J.D. Fischer et al.

    Prediction of protein functional residues from sequence by probability density estimation

    Bioinformatics

    (2008)
  • K. Fritz-Wolf

    Structure of mitochondrial creatine kinase

    Nature

    (1996)
  • L. Fu

    CD-HIT: accelerated for clustering the next-generation sequencing data

    Bioinformatics

    (2012)
  • N. Furnham

    The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes

    Nucleic Acids Res.

    (2014)
  • P.R. Gardner et al.

    Globins Scavenge Sulfur Trioxide Anion Radical

    J. Biol. Chem.

    (2015)
  • A. Gutteridge et al.

    Using a neural network and spatial clustering to predict the location of active sites in enzymes

    J. Mol. Biol.

    (2003)
  • Z. Hajisharifi

    Predicting anticancer peptides with Chou′ s pseudo amino acid composition and investigating their mutagenicity via Ames test

    J. Theor. Biol.

    (2014)
  • T Hamelryck

    An amino acid has two sides: a new 2D measure provides a different view of solvent exposure

    Proteins

    (2005)
  • L. Han

    Identification of catalytic residues using a novel feature that integrates the microenvironment and geometrical location properties of residues

    PLoS One

    (2012)
  • S.J. Hubbard et al.
    (1993)
  • S.C. Izidoro et al.

    GASS: identifying enzyme active sites with genetic algorithms

    Bioinformatics

    (2015)
  • J.H. Jia

    iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC

    J. Theor. Biol.

    (2015)
  • J.H. Jia

    pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach

    J. Theor. Biol.

    (2016)
  • X. Jiao et al.

    Prediction of interface residue based on the features of residue interaction network

    J. Theor. Biol.

    (2017)
  • L. Jin et al.

    Crystal structure at 2.8 A resolution of anabolic ornithine transcarbamylase from Escherichia coli

    Nat. Struct. Biol.

    (1997)
  • D.T. Jones

    Protein secondary structure prediction based on position-specific scoring matrices

    J. Mol. Biol.

    (1999)
  • D.T. Jones et al.

    DISOPRED3: precise disordered region predictions with annotated protein-binding activity

    Bioinformatics

    (2015)
  • W. Kabsch et al.

    Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features

    Biopolymers

    (1983)
  • C. Kauffman et al.

    LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction

    Bioinformatics

    (2009)
  • Cited by (114)

    • Protein encoder: An autoencoder-based ensemble feature selection scheme to predict protein secondary structure

      2023, Expert Systems with Applications
      Citation Excerpt :

      An RF algorithm has been used in protein-RNA binding sites, enzyme catalyst residues, helical domain linker, and oligomer status of coiled helical regions. This enables better results (Song et al., 2018; Okun & Priisalu, 2007; Jia & Hu, 2011; Richa et al., 2017; Liu et al., 2010). Yavuz et al. (2018) use MLP classifier for prediction of protein secondary structure.

    • Hot spots-making directed evolution easier

      2022, Biotechnology Advances
    View all citing articles on Scopus
    View full text