Regular article
Protein topology recognition from secondary structure sequences: application of the hidden markov models to the alpha class proteins1

https://doi.org/10.1006/jmbi.1996.0874Get rights and content

Abstract

The three-dimensional fold of a protein is described by the organization of its secondary structure elements in 3D space, i.e. its “topology”. We find that the protein topology can be recognized from the 1D sequence of secondary structure states of the residues alone. Automated recognition is facilitated by use of hidden Markov models (HMMs) to represent topology families of proteins. Such models can be trained on the experimentally observed secondary structure sequences of family members using well established algorithms. Here, we model various topology groups in the alpha class of proteins and identify, from a large database, those proteins having the topology described by each model. The correct topology family for protein secondary structure sequences could be recognized 12 out of 14 times. When the observed secondary structure sequences are replaced with predicted sequences recognitiion is still achievable 8 out of 14 times. The success rate for observed sequences indicates that our approach will become increasingly useful as the accuracy of secondary prediction algorithms is improved. Our study indicates that the HMMs are useful for protein topology recognition even when no detectable primary amino acid sequence similarity is present. To illustrate the potential utility of our method, protein topology recognition is attempted on leptin, the obese gene product, and the human interleukin-6 sequence, for which fold predictions have been previously published.

Introduction

The biological function of a protein can be better understood, and in some cases predicted, when its tertiary structure is known. Despite the remarkable increase in the number of experimentally determined structures, the number of available sequences exceeds the number of determined structures by about two orders of magnitude and is growing exponentially with time. Moreover, the de novo design of proteins requires a precise knowledge of the relationship between the amino acid sequences and the folded (tertiary) structure. This relationship has not yet been determined and its precise formulation remains an important and challenging problem. However, two useful methodologies are of emerging importance: homology modeling and fold recognition techniques.

Homology modeling and fold recognition are based on the observation that during evolution protein folds vary much less than amino acid sequences. The folding patterns that we observe today can be understood as the result of the evolution of a set of ancestral protein folds (Doolittle, 1992) limited to about 1000 or so in number (Chothia, 1992). If two proteins of sufficient length are more than 25% sequence identical, then they are likely to have similar structure (Schneider & Sander, 1991). Thus, if one protein has known structure, it is possible to obtain a reasonable 3D model for the structure of the other by homology modeling Browne et al 1969, Greer 1991. A recent analysis of the accuracy of such methods has been recently discussed (Mossiman et al., 1995).

Analysis of the relationship between sequence and structural similarity has shown that around the “twilight zone” of sequence identity of 20 to 30%, the relationship between primary and tertiary structure becomes problematic Doolittle 1986, Jones and Thornton 1993, Schneider and Sander 1991. Much recent research has been devoted to the problem of recognizing when two sequences of low or undetectable sequence similarity adopt similar folds. The problem was approached by performing “optimal” sequence threading through a known 3D structure and evaluating the resulting alignment by means of profiles or empirical contact potential functions Bowie et al 1991, Bryant and Lawrence 1993, Godzik and Skolnick 1992, Hendlich et al 1990, Jones et al 1992, Luthy et al 1992, Maiorov and Crippen 1992, Nishikawa and Matsuo 1994, Wilmanns and Eisenberg 1993. Several of these threading methods have been analyzed (Lemer et al., 1995) in the context of a structure prediction contest. Lemer’s analysis showed, on a limited number of cases, that the fold of every target protein studied was correctly identified by at least one group of investigators. This result suggests there is room for further progress.

Richardson 1981, Robson and Garnier 1988 summarized the three-dimensional folds of proteins by describing the arrangement of secondary structure elements in space. Russell & Barton (1994) showed that secondary structure is conserved more often than residue pair interactions in protein pairs with similar folds but less than 20% sequence identity. This suggests that secondary structure propensity may play an important role in the recognition of distantly related proteins.

Attempts have been made to identify the fold of a protein using explicit secondary structures Hubbard and Park 1995, Rost 1995, Russell et al 1996, Sheridan et al 1985. Yet, the information about protein structural topology derived from only the sequence of secondary structure states of residues has not been fully exploited. Our goal is to show that protein topology can indeed be recognized from the sequence of secondary structures.

Because of their rigorous but flexible mathematical structure, hidden Markov models (HMMs; Rabiner, 1989) have been used in a variety of computational biology applications such as sequence motif recognition (Fujiwara et al., 1994), gene finding (Krogh et al., 1994b), and protein secondary structure prediction (Asai et al., 1993). HMMs have also been shown to provide good multiple sequence alignments Eddy 1995, Krogh et al 1994a. In the context of the protein topology recognition problem, we use the HMMs framework (Rabiner, 1989), to build models of topology groups of alpha class proteins by aligning the sequences of secondary structure states of the group members. We defined topology groups according to the CATH database Orengo et al 1993, Orengo et al 1994, which has provided a nearly exhaustive partition of all the known protein structures into fold families. We test the ability of these structural models to recognize both observed and predicted secondary structure sequences of group members. An alignment for a sequence of predicted secondary structure to the sequences of a topology group can be made, using HMM procedures, thus providing a more complete description of the unknown structure than a fold recognition procedure that only gives a yes-or-no answer, as in Craven et al 1995, Dubchak et al 1995. Finally, two topology recognition experiments are presented, to compare our predictions with previously published fold predictions for leptin, the obese gene product (Madej et al., 1995), and the human interleukin-6 sequence (Bazan, 1990a).

Section snippets

Terminology

We will use the terminology of the CATH database Orengo et al 1993, Orengo et al 1994, release January 1995, where proteins have been classified with the following hierarchical levels: Class, the highest level of the classification, derived from secondary structure content (e.g. alpha class, alpha/beta class);Architecture, the gross arrangement of secondary structures (e.g. aligned helices; singly wound alpha/beta proteins); Topology, topological description of well known folds and taxonomies

HMMs recognize secondary structure sequences

Figure 2 shows histograms of the NLO scores assigned to the proteins in Database II by the HMM trained on EF-Hand topology proteins. The model clearly recognizes EF-Hand proteins as topology group members, since the distribution of scores assigned to observed secondary structure sequences of EF-Hand proteins (mean=−0.39, std=0.16) is well separated from the distribution of scores of non-members of the topology group (mean=0.07, std=0.11) and occupies the left tail of the histograms with the

Discussion

The success rate obtained when using experimentally derived secondary structure state sequences (Table 6) suggests that the length, the type and the relative position in the sequence of secondary structure elements (helix, strand and coil) convey sufficient information to identify and distinguish different protein topologies. Although protein topologies are themselves defined by the overall organization of secondary structure elements in the 3D space, this is quite an interesting finding when

Conclusions

With the use of the hidden Markov models, we have shown that experimentally derived secondary structure sequences alone embody enough information to correctly identify the topology of a protein. Even when secondary structure sequences are predicted with commonly available prediction algorithms, enough information is retained for topology recognition. Clearly, the rate of correct protein topology recognition will improve as the secondary structure prediction algorithms improve and more 3D

Acknowledgements

We thank Dr Peter J. Steinbach for critically reading the manuscript and offering useful suggestions; Dr jonathan M. Levin for providing us with predicted secondary structure sequences of leptin and interleukin-6; the various participants in the 1996 “Identifying features in biological sequences” workshop at the Aspen Center for Physics.

References (62)

  • V.N Maiorov et al.

    Contact potential that recognizes the correct folding of globular proteins

    J. Mol. Biol.

    (1992)
  • A.G Murzin et al.

    SCOPa structural classification of proteins database for the investigation of sequences and structures

    J. Mol. Biol.

    (1995)
  • J.S Richardson

    The anatomy and taxonomy of protein structure

    Advan. Protein Chem.

    (1981)
  • B Rost et al.

    Prediction of protein secondary structure at better than 70% accuracy

    J. Mol. Biol.

    (1993)
  • R.B Russell et al.

    Structural features can be unconserved in proteins with similar folds

    J. Mol. Biol.

    (1994)
  • R.B Russell et al.

    Protein fold recognition by mapping predicted secondary structures

    J. Mol. Biol.

    (1996)
  • A.A Salamov et al.

    Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments

    J. Mol. Biol.

    (1995)
  • L.A Tartaglia et al.

    Identification and expression cloning of a leptin receptor, OB-R

    Cell

    (1995)
  • K Asai et al.

    Prediction of protein secondary structure by the hidden Markov model

    CABIOS

    (1993)
  • J.F Bazan

    Structural design and molecular evolution of a cytokine receptor superfamily

    Proc. Natl Acad. Sci. USA

    (1990)
  • J.U Bowie et al.

    A method to identify protein sequences that fold into a known three-dimensional structure

    Science

    (1991)
  • M Brown et al.

    Using Dirichelet Mixture priors to derive hidden Markov models for protein families

  • S.H Bryant et al.

    An empirical energy function for threading protein sequence through folding motif

    Proteins: Struct. Funct. Genet.

    (1993)
  • C Chothia

    One thousand families for the molecular biologist

    Nature

    (1992)
  • M.W Craven et al.

    Predicting protein folding classes without overly relying on homology

  • V Di Francesco et al.

    Improving protein secondary structure prediction with aligned homologous sequences

    Protein Sci.

    (1996)
  • R.F Doolittle
  • R.F Doolittle

    Stein and Moore Award address. Reconstructing history with amino acid sequences

    Protein Sci.

    (1992)
  • I Dubchak et al.

    Prediction of protein folding class using global description of amino acid sequence

    Proc. Natl Acad. Sci., USA

    (1995)
  • S Eddy

    Multiple alignment using hidden Markov models

  • S.R Eddy et al.

    Maximum Discrimination hidden Markov models of sequence consensus

    J. Comput. Biol.

    (1995)
  • Cited by (32)

    • Sequence-based protein structure prediction using a reduced state-space hidden Markov model

      2007, Computers in Biology and Medicine
      Citation Excerpt :

      The model length changes when a different training set is used, unlike when using the reduced HMM. Di Francesco et al. [20] showed that the decision in that case is based upon ranked scores, which means that the higher the rank assigned by a model to a query sequence, the more we believe that the model produced that sequence. In the output files containing the scores given by SAM for a test set against a model these scores are already ranked.

    • Protein structure prediction

      1998, Current Opinion in Biotechnology
    • Unification of protein families

      1998, Current Opinion in Structural Biology
    View all citing articles on Scopus
    1

    Edited by F. E. Cohen

    View full text