Protein topology recognition from secondary structure sequences: application of the hidden markov models to the alpha class proteins

doi:10.1006/jmbi.1996.0874

Journal of Molecular Biology

Volume 267, Issue 2, 28 March 1997, Pages 446-463

https://doi.org/10.1006/jmbi.1996.0874 Get rights and content

Abstract

The three-dimensional fold of a protein is described by the organization of its secondary structure elements in 3D space, i.e. its “topology”. We find that the protein topology can be recognized from the 1D sequence of secondary structure states of the residues alone. Automated recognition is facilitated by use of hidden Markov models (HMMs) to represent topology families of proteins. Such models can be trained on the experimentally observed secondary structure sequences of family members using well established algorithms. Here, we model various topology groups in the alpha class of proteins and identify, from a large database, those proteins having the topology described by each model. The correct topology family for protein secondary structure sequences could be recognized 12 out of 14 times. When the observed secondary structure sequences are replaced with predicted sequences recognitiion is still achievable 8 out of 14 times. The success rate for observed sequences indicates that our approach will become increasingly useful as the accuracy of secondary prediction algorithms is improved. Our study indicates that the HMMs are useful for protein topology recognition even when no detectable primary amino acid sequence similarity is present. To illustrate the potential utility of our method, protein topology recognition is attempted on leptin, the obese gene product, and the human interleukin-6 sequence, for which fold predictions have been previously published.

Introduction

The biological function of a protein can be better understood, and in some cases predicted, when its tertiary structure is known. Despite the remarkable increase in the number of experimentally determined structures, the number of available sequences exceeds the number of determined structures by about two orders of magnitude and is growing exponentially with time. Moreover, the de novo design of proteins requires a precise knowledge of the relationship between the amino acid sequences and the folded (tertiary) structure. This relationship has not yet been determined and its precise formulation remains an important and challenging problem. However, two useful methodologies are of emerging importance: homology modeling and fold recognition techniques.

Homology modeling and fold recognition are based on the observation that during evolution protein folds vary much less than amino acid sequences. The folding patterns that we observe today can be understood as the result of the evolution of a set of ancestral protein folds (Doolittle, 1992) limited to about 1000 or so in number (Chothia, 1992). If two proteins of sufficient length are more than 25% sequence identical, then they are likely to have similar structure (Schneider & Sander, 1991). Thus, if one protein has known structure, it is possible to obtain a reasonable 3D model for the structure of the other by homology modeling Browne et al 1969, Greer 1991. A recent analysis of the accuracy of such methods has been recently discussed (Mossiman et al., 1995).

Analysis of the relationship between sequence and structural similarity has shown that around the “twilight zone” of sequence identity of 20 to 30%, the relationship between primary and tertiary structure becomes problematic Doolittle 1986, Jones and Thornton 1993, Schneider and Sander 1991. Much recent research has been devoted to the problem of recognizing when two sequences of low or undetectable sequence similarity adopt similar folds. The problem was approached by performing “optimal” sequence threading through a known 3D structure and evaluating the resulting alignment by means of profiles or empirical contact potential functions Bowie et al 1991, Bryant and Lawrence 1993, Godzik and Skolnick 1992, Hendlich et al 1990, Jones et al 1992, Luthy et al 1992, Maiorov and Crippen 1992, Nishikawa and Matsuo 1994, Wilmanns and Eisenberg 1993. Several of these threading methods have been analyzed (Lemer et al., 1995) in the context of a structure prediction contest. Lemer’s analysis showed, on a limited number of cases, that the fold of every target protein studied was correctly identified by at least one group of investigators. This result suggests there is room for further progress.

Richardson 1981, Robson and Garnier 1988 summarized the three-dimensional folds of proteins by describing the arrangement of secondary structure elements in space. Russell & Barton (1994) showed that secondary structure is conserved more often than residue pair interactions in protein pairs with similar folds but less than 20% sequence identity. This suggests that secondary structure propensity may play an important role in the recognition of distantly related proteins.

Attempts have been made to identify the fold of a protein using explicit secondary structures Hubbard and Park 1995, Rost 1995, Russell et al 1996, Sheridan et al 1985. Yet, the information about protein structural topology derived from only the sequence of secondary structure states of residues has not been fully exploited. Our goal is to show that protein topology can indeed be recognized from the sequence of secondary structures.

Because of their rigorous but flexible mathematical structure, hidden Markov models (HMMs; Rabiner, 1989) have been used in a variety of computational biology applications such as sequence motif recognition (Fujiwara et al., 1994), gene finding (Krogh et al., 1994b), and protein secondary structure prediction (Asai et al., 1993). HMMs have also been shown to provide good multiple sequence alignments Eddy 1995, Krogh et al 1994a. In the context of the protein topology recognition problem, we use the HMMs framework (Rabiner, 1989), to build models of topology groups of alpha class proteins by aligning the sequences of secondary structure states of the group members. We defined topology groups according to the CATH database Orengo et al 1993, Orengo et al 1994, which has provided a nearly exhaustive partition of all the known protein structures into fold families. We test the ability of these structural models to recognize both observed and predicted secondary structure sequences of group members. An alignment for a sequence of predicted secondary structure to the sequences of a topology group can be made, using HMM procedures, thus providing a more complete description of the unknown structure than a fold recognition procedure that only gives a yes-or-no answer, as in Craven et al 1995, Dubchak et al 1995. Finally, two topology recognition experiments are presented, to compare our predictions with previously published fold predictions for leptin, the obese gene product (Madej et al., 1995), and the human interleukin-6 sequence (Bazan, 1990a).

Section snippets

Terminology

We will use the terminology of the CATH database Orengo et al 1993, Orengo et al 1994, release January 1995, where proteins have been classified with the following hierarchical levels: Class, the highest level of the classification, derived from secondary structure content (e.g. alpha class, alpha/beta class);Architecture, the gross arrangement of secondary structures (e.g. aligned helices; singly wound alpha/beta proteins); Topology, topological description of well known folds and taxonomies

HMMs recognize secondary structure sequences

Figure 2 shows histograms of the NLO scores assigned to the proteins in Database II by the HMM trained on EF-Hand topology proteins. The model clearly recognizes EF-Hand proteins as topology group members, since the distribution of scores assigned to observed secondary structure sequences of EF-Hand proteins (mean=−0.39, std=0.16) is well separated from the distribution of scores of non-members of the topology group (mean=0.07, std=0.11) and occupies the left tail of the histograms with the

Discussion

The success rate obtained when using experimentally derived secondary structure state sequences (Table 6) suggests that the length, the type and the relative position in the sequence of secondary structure elements (helix, strand and coil) convey sufficient information to identify and distinguish different protein topologies. Although protein topologies are themselves defined by the overall organization of secondary structure elements in the 3D space, this is quite an interesting finding when

Conclusions

With the use of the hidden Markov models, we have shown that experimentally derived secondary structure sequences alone embody enough information to correctly identify the topology of a protein. Even when secondary structure sequences are predicted with commonly available prediction algorithms, enough information is retained for topology recognition. Clearly, the rate of correct protein topology recognition will improve as the secondary structure prediction algorithms improve and more 3D

Acknowledgements

We thank Dr Peter J. Steinbach for critically reading the manuscript and offering useful suggestions; Dr jonathan M. Levin for providing us with predicted secondary structure sequences of leptin and interleukin-6; the various participants in the 1996 “Identifying features in biological sequences” workshop at the Aspen Center for Physics.

References (62)

S.F Altschul
Amino acid substitution matrices from an information theoretic perspective
J. Mol. Biol.
(1991)
S.F Altschul et al.
Basic local alignment search tool
J. Mol. Biol.
(1990)
J.F Bazan
Haemopoietic receptors and helical cytokines
Immunol. Today
(1990)
F.C Bernstein et al.
The Protein Data Banka computer-based archival file for macromolecular structures
J. Mol. Biol.
(1977)
W.J Browne et al.
A possible three-dimensional structure of bovine a-lactalbumin based on that of hen’s egg-white lysozyme
J. Mol. Biol.
(1969)
J Greer
Comparative modeling in homologous proteins
Methods Enzymol.
(1991)
M Hendlich et al.
Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force
J. Mol. Biol.
(1990)
A Krogh et al.
Hidden Markov models in computational biologyapplications to protein modeling
J. Mol. Biol.
(1994)
J.M Levin et al.
Improvements in secondary structure prediction method based on search for local sequence homologies and its use as a model building tool
Biochim. Biophys. Acta
(1988)
T Madej et al.
Threading analysis suggests that the obese gene product may be a helical cytokine
FEBS Letters
(1995)

V.N Maiorov et al.

Contact potential that recognizes the correct folding of globular proteins

J. Mol. Biol.

(1992)

A.G Murzin et al.

SCOPa structural classification of proteins database for the investigation of sequences and structures

J. Mol. Biol.

(1995)

J.S Richardson

The anatomy and taxonomy of protein structure

Advan. Protein Chem.

(1981)

B Rost et al.

Prediction of protein secondary structure at better than 70% accuracy

J. Mol. Biol.

(1993)

R.B Russell et al.

Structural features can be unconserved in proteins with similar folds

J. Mol. Biol.

(1994)

R.B Russell et al.

Protein fold recognition by mapping predicted secondary structures

J. Mol. Biol.

(1996)

A.A Salamov et al.

Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments

J. Mol. Biol.

(1995)

L.A Tartaglia et al.

Identification and expression cloning of a leptin receptor, OB-R

Cell

(1995)

K Asai et al.

Prediction of protein secondary structure by the hidden Markov model

CABIOS

(1993)

J.F Bazan

Structural design and molecular evolution of a cytokine receptor superfamily

Proc. Natl Acad. Sci. USA

(1990)

J.U Bowie et al.

A method to identify protein sequences that fold into a known three-dimensional structure

Science

(1991)

M Brown et al.

Using Dirichelet Mixture priors to derive hidden Markov models for protein families

S.H Bryant et al.

An empirical energy function for threading protein sequence through folding motif

Proteins: Struct. Funct. Genet.

(1993)

C Chothia

One thousand families for the molecular biologist

Nature

(1992)

M.W Craven et al.

Predicting protein folding classes without overly relying on homology

V Di Francesco et al.

Improving protein secondary structure prediction with aligned homologous sequences

Protein Sci.

(1996)

R.F Doolittle

Stein and Moore Award address. Reconstructing history with amino acid sequences

Protein Sci.

(1992)

I Dubchak et al.

Prediction of protein folding class using global description of amino acid sequence

Proc. Natl Acad. Sci., USA

(1995)

S Eddy

Multiple alignment using hidden Markov models

S.R Eddy et al.

Maximum Discrimination hidden Markov models of sequence consensus

J. Comput. Biol.

(1995)

Cited by (32)

Sequence-based protein structure prediction using a reduced state-space hidden Markov model
2007, Computers in Biology and Medicine
Citation Excerpt :
The model length changes when a different training set is used, unlike when using the reduced HMM. Di Francesco et al. [20] showed that the decision in that case is based upon ranked scores, which means that the higher the rank assigned by a model to a query sequence, the more we believe that the model produced that sequence. In the output files containing the scores given by SAM for a test set against a model these scores are already ranked.
This work describes the use of a hidden Markov model (HMM), with a reduced number of states, which simultaneously learns amino acid sequence and secondary structure for proteins of known three-dimensional structure and it is used for two tasks: protein class prediction and fold recognition. The Protein Data Bank and the annotation of the SCOP database are used for training and evaluation of the proposed HMM for a number of protein classes and folds. Results demonstrate that the reduced state–space HMM performs equivalently, or even better in some cases, on classifying proteins than a HMM trained with the amino acid sequence. The major advantage of the proposed approach is that a small number of states is employed and the training algorithm is of low complexity and thus relatively fast.
Application of expert networks for predicting proteins secondary structure
2007, Biomolecular Engineering
The present study utilizes expert neural networks for the prediction of proteins secondary structure. We use three independent networks, one for each structure (alpha, beta and coil) as the first-level processing unit; decision upon the chosen structure for each residue is carried out by a second-level, post-processing unit, which utilizes the Chou and Fasman frequency values F_α and F_β in order to strengthen and/or deplete the probability of the specific structure under investigation. The highest prediction case was 76%.
Our method requires primitive computational means and a relatively small training set, while still been comparable to previous work. It is not meant to be an alternative to the determination of secondary structure by means of free energy minimization, integration of dynamic equations of motion or crystallography, which are expensive, time-consuming and complicated, but to provide additional constrains, which might be considered and incorporated into larger computing setups in order to reduce the initial search space for the above methods.
Recognizing the Pleckstrin homology domain fold in mammalian phospholipase D using hidden Markov models
1999, FEBS Letters
Phospholipase D was first described in plant tissue but has recently been shown to occur in mammalian cells where it is activated by cell surface receptors. Its mode of activation by receptors in unclear. Biochemical studies suggest that it may occur downstream of other effector proteins and that small GTP-dependent regulatory proteins may be involved. The sequence in a non-designated region of mammalian phospholipase D1 and 2 shows similarity to a structural domain that is present in signalling proteins that are regulated by protein kinases or heterotrimeric G-proteins. Mammalian phospholipase D has structural similarities with other lipid signalling phospholipases and thus may be regulated by receptors in an analogous fashion.
Principles governing amino acid composition of integral membrane proteins: Application to topology prediction
1998, Journal of Molecular Biology
A new method is suggested here for topology prediction of helical transmembrane proteins. The method is based on the hypothesis that the localizations of the transmembrane segments and the topology are determined by the difference in the amino acid distributions in various structural parts of these proteins rather than by specific amino acid compositions of these parts. A hidden Markov model with special architecture was developed to search transmembrane topology corresponding to the maximum likelihood among all the possible topologies of a given protein. The prediction accuracy was tested on 158 proteins and was found to be higher than that found using prediction methods already available. The method successfully predicted all the transmembrane segments in 143 proteins out of the 158, and for 135 of these proteins both the membrane spanning regions and the topologies were predicted correctly. The observed level of accuracy is a strong argument in favor of our hypothesis.
Protein structure prediction
1998, Current Opinion in Biotechnology
Genome sequencing projects continue to provide a flood of new protein sequences, and prediction methods remain an important means of adding structural information. Recently, there have been advances in secondary structure prediction, which feed, in turn, into improved fold recognition algorithms. Finally, there have been technical improvements in comparative modelling, and studies of the expected accuracy of three-dimensional structural models built by this method.
Unification of protein families
1998, Current Opinion in Structural Biology
Computational biology exploits the evolutionary connectivity between proteins and protein families to predict structural and functional properties of uncharacterized gene products. In the past year, conceptual and statistical refinements have substantially improved algorithms for the detection of remote homologues. In conjunction with the rapid growth of biological databases, the global organization of proteins into sequence families, functional families and structural families has become both pertinent and feasible.

View all citing articles on Scopus

¹: Edited by F. E. Cohen

View full text

Journal of Molecular Biology

Regular articleProtein topology recognition from secondary structure sequences: application of the hidden markov models to the alpha class proteins1

Abstract

Introduction

Section snippets

Terminology

HMMs recognize secondary structure sequences

Discussion

Conclusions

Acknowledgements

J. Mol. Biol.

J. Mol. Biol.

Immunol. Today

J. Mol. Biol.

J. Mol. Biol.

Methods Enzymol.

J. Mol. Biol.

J. Mol. Biol.

Biochim. Biophys. Acta

FEBS Letters

J. Mol. Biol.

J. Mol. Biol.

Advan. Protein Chem.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Cell

Prediction of protein secondary structure by the hidden Markov model

CABIOS

Structural design and molecular evolution of a cytokine receptor superfamily

Proc. Natl Acad. Sci. USA

A method to identify protein sequences that fold into a known three-dimensional structure

Science

Using Dirichelet Mixture priors to derive hidden Markov models for protein families

An empirical energy function for threading protein sequence through folding motif

Proteins: Struct. Funct. Genet.

One thousand families for the molecular biologist

Nature

Predicting protein folding classes without overly relying on homology

Improving protein secondary structure prediction with aligned homologous sequences

Protein Sci.

Stein and Moore Award address. Reconstructing history with amino acid sequences

Protein Sci.

Prediction of protein folding class using global description of amino acid sequence

Proc. Natl Acad. Sci., USA

Multiple alignment using hidden Markov models

Maximum Discrimination hidden Markov models of sequence consensus

J. Comput. Biol.

Regular article
Protein topology recognition from secondary structure sequences: application of the hidden markov models to the alpha class proteins¹