Journal of Molecular Biology
Regular articleProtein topology recognition from secondary structure sequences: application of the hidden markov models to the alpha class proteins1
Introduction
The biological function of a protein can be better understood, and in some cases predicted, when its tertiary structure is known. Despite the remarkable increase in the number of experimentally determined structures, the number of available sequences exceeds the number of determined structures by about two orders of magnitude and is growing exponentially with time. Moreover, the de novo design of proteins requires a precise knowledge of the relationship between the amino acid sequences and the folded (tertiary) structure. This relationship has not yet been determined and its precise formulation remains an important and challenging problem. However, two useful methodologies are of emerging importance: homology modeling and fold recognition techniques.
Homology modeling and fold recognition are based on the observation that during evolution protein folds vary much less than amino acid sequences. The folding patterns that we observe today can be understood as the result of the evolution of a set of ancestral protein folds (Doolittle, 1992) limited to about 1000 or so in number (Chothia, 1992). If two proteins of sufficient length are more than 25% sequence identical, then they are likely to have similar structure (Schneider & Sander, 1991). Thus, if one protein has known structure, it is possible to obtain a reasonable 3D model for the structure of the other by homology modeling Browne et al 1969, Greer 1991. A recent analysis of the accuracy of such methods has been recently discussed (Mossiman et al., 1995).
Analysis of the relationship between sequence and structural similarity has shown that around the “twilight zone” of sequence identity of 20 to 30%, the relationship between primary and tertiary structure becomes problematic Doolittle 1986, Jones and Thornton 1993, Schneider and Sander 1991. Much recent research has been devoted to the problem of recognizing when two sequences of low or undetectable sequence similarity adopt similar folds. The problem was approached by performing “optimal” sequence threading through a known 3D structure and evaluating the resulting alignment by means of profiles or empirical contact potential functions Bowie et al 1991, Bryant and Lawrence 1993, Godzik and Skolnick 1992, Hendlich et al 1990, Jones et al 1992, Luthy et al 1992, Maiorov and Crippen 1992, Nishikawa and Matsuo 1994, Wilmanns and Eisenberg 1993. Several of these threading methods have been analyzed (Lemer et al., 1995) in the context of a structure prediction contest. Lemer’s analysis showed, on a limited number of cases, that the fold of every target protein studied was correctly identified by at least one group of investigators. This result suggests there is room for further progress.
Richardson 1981, Robson and Garnier 1988 summarized the three-dimensional folds of proteins by describing the arrangement of secondary structure elements in space. Russell & Barton (1994) showed that secondary structure is conserved more often than residue pair interactions in protein pairs with similar folds but less than 20% sequence identity. This suggests that secondary structure propensity may play an important role in the recognition of distantly related proteins.
Attempts have been made to identify the fold of a protein using explicit secondary structures Hubbard and Park 1995, Rost 1995, Russell et al 1996, Sheridan et al 1985. Yet, the information about protein structural topology derived from only the sequence of secondary structure states of residues has not been fully exploited. Our goal is to show that protein topology can indeed be recognized from the sequence of secondary structures.
Because of their rigorous but flexible mathematical structure, hidden Markov models (HMMs; Rabiner, 1989) have been used in a variety of computational biology applications such as sequence motif recognition (Fujiwara et al., 1994), gene finding (Krogh et al., 1994b), and protein secondary structure prediction (Asai et al., 1993). HMMs have also been shown to provide good multiple sequence alignments Eddy 1995, Krogh et al 1994a. In the context of the protein topology recognition problem, we use the HMMs framework (Rabiner, 1989), to build models of topology groups of alpha class proteins by aligning the sequences of secondary structure states of the group members. We defined topology groups according to the CATH database Orengo et al 1993, Orengo et al 1994, which has provided a nearly exhaustive partition of all the known protein structures into fold families. We test the ability of these structural models to recognize both observed and predicted secondary structure sequences of group members. An alignment for a sequence of predicted secondary structure to the sequences of a topology group can be made, using HMM procedures, thus providing a more complete description of the unknown structure than a fold recognition procedure that only gives a yes-or-no answer, as in Craven et al 1995, Dubchak et al 1995. Finally, two topology recognition experiments are presented, to compare our predictions with previously published fold predictions for leptin, the obese gene product (Madej et al., 1995), and the human interleukin-6 sequence (Bazan, 1990a).
Section snippets
Terminology
We will use the terminology of the CATH database Orengo et al 1993, Orengo et al 1994, release January 1995, where proteins have been classified with the following hierarchical levels: Class, the highest level of the classification, derived from secondary structure content (e.g. alpha class, alpha/beta class);Architecture, the gross arrangement of secondary structures (e.g. aligned helices; singly wound alpha/beta proteins); Topology, topological description of well known folds and taxonomies
HMMs recognize secondary structure sequences
Figure 2 shows histograms of the NLO scores assigned to the proteins in Database II by the HMM trained on EF-Hand topology proteins. The model clearly recognizes EF-Hand proteins as topology group members, since the distribution of scores assigned to observed secondary structure sequences of EF-Hand proteins (mean=−0.39, std=0.16) is well separated from the distribution of scores of non-members of the topology group (mean=0.07, std=0.11) and occupies the left tail of the histograms with the
Discussion
The success rate obtained when using experimentally derived secondary structure state sequences (Table 6) suggests that the length, the type and the relative position in the sequence of secondary structure elements (helix, strand and coil) convey sufficient information to identify and distinguish different protein topologies. Although protein topologies are themselves defined by the overall organization of secondary structure elements in the 3D space, this is quite an interesting finding when
Conclusions
With the use of the hidden Markov models, we have shown that experimentally derived secondary structure sequences alone embody enough information to correctly identify the topology of a protein. Even when secondary structure sequences are predicted with commonly available prediction algorithms, enough information is retained for topology recognition. Clearly, the rate of correct protein topology recognition will improve as the secondary structure prediction algorithms improve and more 3D
Acknowledgements
We thank Dr Peter J. Steinbach for critically reading the manuscript and offering useful suggestions; Dr jonathan M. Levin for providing us with predicted secondary structure sequences of leptin and interleukin-6; the various participants in the 1996 “Identifying features in biological sequences” workshop at the Aspen Center for Physics.
References (62)
Amino acid substitution matrices from an information theoretic perspective
J. Mol. Biol.
(1991)- et al.
Basic local alignment search tool
J. Mol. Biol.
(1990) Haemopoietic receptors and helical cytokines
Immunol. Today
(1990)- et al.
The Protein Data Banka computer-based archival file for macromolecular structures
J. Mol. Biol.
(1977) - et al.
A possible three-dimensional structure of bovine a-lactalbumin based on that of hen’s egg-white lysozyme
J. Mol. Biol.
(1969) Comparative modeling in homologous proteins
Methods Enzymol.
(1991)- et al.
Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force
J. Mol. Biol.
(1990) - et al.
Hidden Markov models in computational biologyapplications to protein modeling
J. Mol. Biol.
(1994) - et al.
Improvements in secondary structure prediction method based on search for local sequence homologies and its use as a model building tool
Biochim. Biophys. Acta
(1988) - et al.
Threading analysis suggests that the obese gene product may be a helical cytokine
FEBS Letters
(1995)
Contact potential that recognizes the correct folding of globular proteins
J. Mol. Biol.
SCOPa structural classification of proteins database for the investigation of sequences and structures
J. Mol. Biol.
The anatomy and taxonomy of protein structure
Advan. Protein Chem.
Prediction of protein secondary structure at better than 70% accuracy
J. Mol. Biol.
Structural features can be unconserved in proteins with similar folds
J. Mol. Biol.
Protein fold recognition by mapping predicted secondary structures
J. Mol. Biol.
Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments
J. Mol. Biol.
Identification and expression cloning of a leptin receptor, OB-R
Cell
Prediction of protein secondary structure by the hidden Markov model
CABIOS
Structural design and molecular evolution of a cytokine receptor superfamily
Proc. Natl Acad. Sci. USA
A method to identify protein sequences that fold into a known three-dimensional structure
Science
Using Dirichelet Mixture priors to derive hidden Markov models for protein families
An empirical energy function for threading protein sequence through folding motif
Proteins: Struct. Funct. Genet.
One thousand families for the molecular biologist
Nature
Predicting protein folding classes without overly relying on homology
Improving protein secondary structure prediction with aligned homologous sequences
Protein Sci.
Stein and Moore Award address. Reconstructing history with amino acid sequences
Protein Sci.
Prediction of protein folding class using global description of amino acid sequence
Proc. Natl Acad. Sci., USA
Multiple alignment using hidden Markov models
Maximum Discrimination hidden Markov models of sequence consensus
J. Comput. Biol.
Cited by (32)
Sequence-based protein structure prediction using a reduced state-space hidden Markov model
2007, Computers in Biology and MedicineCitation Excerpt :The model length changes when a different training set is used, unlike when using the reduced HMM. Di Francesco et al. [20] showed that the decision in that case is based upon ranked scores, which means that the higher the rank assigned by a model to a query sequence, the more we believe that the model produced that sequence. In the output files containing the scores given by SAM for a test set against a model these scores are already ranked.
Application of expert networks for predicting proteins secondary structure
2007, Biomolecular EngineeringPrinciples governing amino acid composition of integral membrane proteins: Application to topology prediction
1998, Journal of Molecular BiologyProtein structure prediction
1998, Current Opinion in BiotechnologyUnification of protein families
1998, Current Opinion in Structural Biology
- 1
Edited by F. E. Cohen