Stochastic protein folding simulation in the three-dimensional HP-model

https://doi.org/10.1016/j.compbiolchem.2008.03.004Get rights and content

Abstract

We present results from three-dimensional protein folding simulations in the HP-model on ten benchmark problems. The simulations are executed by a simulated annealing-based algorithm with a time-dependent cooling schedule. The neighbourhood relation is determined by the pull-move set. The results provide experimental evidence that the maximum depth D of local minima of the underlying energy landscape can be upper bounded by D<n2/3. The local search procedure employs the stopping criterion (m/δ)D/γ, where m is an estimation of the average number of neighbouring conformations, γ relates to the mean of non-zero differences of the objective function for neighbouring conformations, and 1δ is the confidence that a minimum conformation has been found. The bound complies with the results obtained for the ten benchmark problems.

Introduction

Anfinsen’s thermodynamic hypothesis motivates the attempt to predict protein folding by solving certain optimization problems (Anfinsen, 1973). However, there are two main difficulties with this approach: the precise definition of the energy function that has to be minimised, and the extremely difficult optimization problems arising from energy functions commonly used in folding simulations Neumaier, 1997, Greenberg et al., 2004. To overcome at least some of these difficulties, various lattice models and simplified energy functions have been introduced. One of the most popular lattice models of protein folding is the hydrophobic–hydrophilic (HP) model Dill et al., 1995, Aichholzer et al., 2003, Greenberg et al., 2004, Leite et al., 2004, Schiemann et al., 2005. In the HP-model, proteins are modelled as chains whose vertices are marked either as H (hydrophobic) or P (hydrophilic); the resulting chain is embedded into some lattice; H nodes are considered to attract each other while P nodes are neutral. An optimal embedding is one that maximizes the number of H–H contacts. The rationale for this objective is that hydrophobic interactions contribute a significant portion of the total energy function Greenberg et al., 2004, Brylinski et al., 2006. Unlike more sophisticated models of protein folding, the main goal of the HP-model is to explore broad qualitative questions about protein folding such as whether the dominant interactions are local or global with respect to the chain (Greenberg et al., 2004). A major goal of our research on stochastic protein folding simulations in the HP-model is to investigate whether optimum conformations computed within this model (and some refinements of this model) can serve as appropriate starting conformations for folding simulations of real protein sequences and realistic energy functions.

In the present paper we focus on complexity aspects of protein folding simulations. For a variety of lattice models, protein folding simulation has been proven to be NP-complete Paterson and Przytycka, 1996, Berger and Leighton, 1998, Nayak et al., 1999, which a priori implies an upper time bound exp(n) for deterministic folding simulations of sequences of length n (unless P=NP, which is generally considered to be unlikely). To the best of our knowledge, Clote (1999) was the first to present a rigorous complexity analysis of protein folding simulations to optimum conformations. The underlying model is Monte Carlo simulations, and the aim is to provide a theoretical underpinning for an earlier work by Sali et al. (1994) where the authors conjecture that the folding time is inverse proportional to the energy gap between the ground state and the lowest energy of a misfolded state. By using the notion of rapidly mixing Markov chains (Sinclair, 1993), Clote (1999) prove an upper bound for the mean first passage time through the ground state that decreases if the energy gap increases. The upper bound does not explicitly relate protein sequence parameters to the first passage time. For protein sequences of length n, Fu and Wang (2004) proved an exp(O(n11/dlnn)) deterministic algorithm for d-dimensional protein folding simulation to optimum conformations in the HP-model. Linear time algorithms for constant-factor approximations of optimum conformations are reported by Hart and Istrail (1997) and Heun (2003); however, protein folding is one of the problems where indeed the knowledge of optimum solutions is essential.

The importance of the HP-model is highlighted by the fact that the algorithmic complexity bound by Fu and Wang (2004) comes close to the (worst case) folding time approximation of exp(λn2/3±χn1/2/2) ns by Finkelstein and Badretdinov (1997), where λ and χ are constants close to unity. We note that both time bounds depend on the sequence length n only. In our paper, we try to link the time estimation to crucial parameters of the underlying energy landscape induced by a particular sequence. This is motivated by the ongoing attempts to improve folding time approximations for specific (subsets of) protein sequences. For example, Bakk et al. (2002) suggest folding time approximations of O(n2) and exp(O(n)) for some small globular proteins; Galzitskaya et al. (2003) stress the dependence of folding time on sequence length and analyse folding time estimations of type exp(O(np)), 0<p1, for 54 proteins, where 41n396; a comprehensive experimental study of folding time as a function of temperature is presented by Zhu et al. (2004) for a single sequence (and its mutants); similar problems are discussed by Lee et al. (2003), Leite et al. (2004), Wang (2004), and Li et al. (2004), based on analytical approximations and numeric simulations. We note that the model used by Li et al. (2004) is close to the above-mentioned HP-model and the model employed by Clote (1999), including a combinatorial neighbourhood similar to the one we utilize in the present paper. The coefficient λ in the folding time approximation by Finkelstein and Badretdinov (1997) has been further analysed by Ivankov et al. (2003).

We analyse a sequence-dependent time bound for protein folding simulations in the HP-model. The simulations utilize a stochastic local search procedure that is based on a specific type of simulated annealing. Simulated annealing-based search has been discussed and applied in the context of protein folding simulation for about twenty years Li and Scheraga, 1987, Wilson et al., 1988, Kawai et al., 1989 as an extension of Monte Carlo methods (Paine and Scheraga, 1985), shortly after the independent introduction of simulated annealing as a new optimization tool by Kirkpatrick et al. (1983) and Černy (1985). For on overview on early applications of simulated annealing to protein folding simulations, we refer the reader to Hansmann and Okamoto (1996). Apart from simulated annealing, search-based methods like genetic algorithms Unger and Moult, 1993, Rabow and Scheraga, 1996, Cooper et al., 2003 and tabu search (Pardalos et al., 1997) have been applied to protein folding simulation (see also Greenberg et al., 2004). More recent simulated annealing-related applications to protein folding simulation can be found in Eastman et al. (2001) and Klepeis et al. (2003).

In order to be able to relate the time bound of simulations to landscape parameters, we utilize the cooling schedule T=T(k)=D/log(k+2) in simulated annealing procedures, where D is the maximum value of the minimum escape height from local minima of the underlying conformation space, i.e. the transition probabilities are time-dependent. Under some natural assumptions about the conformation space, the probability distribution over the conformation space tends to the distribution over optimum solutions (Hajek, 1988). We note that Hajek’s result has been discussed before in the wider context of protein folding by Reidys and Stadler (2002) and Apaydin et al. (2003).

We focus on the three-dimensional rectangular lattice model, but extensions to other lattice models, e.g. as described by Backofen (2004), are actually depending upon the definition of a suitable neighbourhood relation only. We employ the pull-move set for the definition of the neighbourhood relation in the conformation space, which was independently introduced by Lesh et al. (2003) and Milostan et al. (2003) (see also Blazewicz et al., 2005). We further analyse experimentally the time complexity bound (m/δ)aD for protein folding simulations as presented by Albrecht and Steinhöfel (2006) and Steinhöfel et al. (2007a), where a is a small constant and D the above-mentioned maximum escape height from local minima; the parameter m=O(n) depends on the lattice structure and the pull-move set; the value p=1δ indicates the confidence that a minimum energy conformation has been found after the number of search steps provided by the complexity bound.

In the present paper, the complexity bound is improved to (m/δ)D/γ, where γ relates to the mean of non-zero differences between the values of the objective function for neighbouring conformations. Unfortunately, to calculate or approximate the parameter D by analytical methods is, in general, a difficult problem. Therefore, we try to provide experimental evidence for a worst case upper bound that depends on the sequence length n only. Based on Fu and Wang (2004) and Finkelstein and Badretdinov (1997), we conjecture D<n2/3. The conjecture is verified on ten benchmark problems defined by Beutler and Dill (1996) (see also Blazewicz et al., 2005). For m=n/2, 1δ=0.5 and D=n2/3/κ<n2/3 with κ>1, the upper bound complies with the results of our computational experiments and we were able to find ground states for all ten benchmark problems.

We note that the simulation experiments in the present paper are not optimized with respect to the run-time in order to have a strict comparison to the theoretical convergence bound (i.e. the impact of the parameter D). A preliminary version of the present paper has been presented at the International Conference on Bioinformatics Research and Development in March 2007 (Steinhöfel et al., 2007b).

Section snippets

Previous Work on Run-time Analysis of Folding Simulations

Apart from a run-time analysis related to the NP-completeness, there are only a few papers dealing with a rigorous run-time estimation (upper bounds) of algorithms for protein folding simulation.

Clote (1999) approached the problem by Metropolis’ algorithm (Metropolis et al., 1953), which is basically simulated annealing at a fixed temperature. Under natural assumptions about the underlying Markov chain, the distribution over all feasible conformations tends to the stationary Boltzmann

Run-time Analysis for a Simulated Annealing-based Approach

As mentioned in Section 1 already, simulated annealing-based search has been utilized in protein folding simulations for about twenty years now. The applications of simulated annealing to protein folding have in common that they rely on a combination of Metropolis’ algorithm (Metropolis et al., 1953) with some heuristic cooling schedule. The justification for this approach is based upon the following observation: at a fixed “temperature” T, the probability distribution over the conformation

Numeric Results on 3D Benchmark Problems

The algorithm as described in Section 3.1 until Section 3.3 was implemented in C++ and running under Gentoo Linux on a 2.4 GHz Intel Pentium IV processor.

The run-time estimation (15) is problem-specific, i.e. depends on the parameter D of the landscape induced by an individual protein sequence and the pull-move set. For a problem-independent, worst case upper bound we conjecture D<n2/3, which is close to the energy barrier calculated by Finkelstein and Badretdinov (1997) and complies with the

Concluding Remarks

Our computational experiments on ten prominent 3D benchmark problems for protein folding simulations in the HP-model provide evidence that the stopping criterion (n/2δ)n2/3/γ complies with the number of Markov chain transitions that lead to minimum conformations. Moreover, the experimental results imply for the given benchmark set and neighbourhood relation that the maximum escape height from local minima can be upper bounded by n2/3. Currently under way is a comparison of foldings in the

Acknowledgement

The research has been partially supported by EPSRC Grant EP/D062012/1.

References (64)

  • J.L. Klepeis et al.

    A new class of hybrid global optimization algorithms for peptide structure prediction: integrated hybrids

    Comput. Phys. Commun.

    (2003)
  • V.B.P. Leite et al.

    Probing the kinetics of single molecule protein folding

    Biophys. J.

    (2004)
  • M.S. Li et al.

    Thermal denaturation and folding rates of single domain proteins: size matters

    Polymer

    (2004)
  • M. Löwe

    Simulated annealing with time dependent energy function via Sobolev inequalities

    Stoch. Proc. Appl.

    (1996)
  • M. Paterson et al.

    On the complexity of string folding

    Discrete Appl. Math.

    (1996)
  • R. Unger et al.

    Genetic algorithms for protein folding simulations

    J. Mol. Biol.

    (1993)
  • J. Wang

    The complex kinetics of protein folding in wide temperature ranges

    Biophys. J.

    (2004)
  • S.R. Wilson et al.

    Conformational analysis of flexible molecules: location of the global minimum energy conformation by the simulated annealing method

    Tetrahedron Lett.

    (1988)
  • Y. Zhu et al.

    Guiding the search for a protein’s maximum rate of folding

    Chem. Phys.

    (2004)
  • E.H.L. Aarts

    Local Search in Combinatorial Optimization

    (1998)
  • A.A. Albrecht

    A stopping criterion for logarithmic simulated annealing

    Computing

    (2006)
  • Albrecht, A.A., Steinhöfel, K., 2006. Run-time estimates for protein folding simulation in the H–P model. In: Online...
  • C.B. Anfinsen

    Principles that govern the folding of protein chains

    Science

    (1973)
  • M.S. Apaydin et al.

    Stochastic roadmap simulation: an efficient representation and algorithm for analyzing molecular motion

    J. Comp. Biol.

    (2003)
  • B. Berger et al.

    Protein folding in the hydrophobic–hydrophilic (HP) model is NP-complete

    J. Comput. Biol.

    (1998)
  • T.C. Beutler et al.

    A fast conformational search strategy for finding low energy structures of model proteins

    Protein Sci.

    (1996)
  • O. Catoni

    Rough large deviation estimates for simulated annealing: applications to exponential schedules

    Ann. Prob.

    (1992)
  • V. Černy

    A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm

    J. Optim. Theor. Appl.

    (1985)
  • T.S. Chiang et al.

    A limit theorem for a class of inhomogeneous Markov processes

    Ann. Prob.

    (1989)
  • P. Clote
  • K.A. Dill et al.

    Principles of protein folding—a perspective from simple exact models

    Protein Sci.

    (1995)
  • P. Eastman et al.

    Simulation of protein folding by reaction path annealing

    J. Chem. Phys.

    (2001)
  • Cited by (22)

    • Heuristic energy landscape paving for protein folding problem in the three-dimensional HP lattice model

      2012, Computational Biology and Chemistry
      Citation Excerpt :

      Despite its simplification, the corresponding protein folding problem based on the HP model has proven to be NP-complete (Berger and Leighton, 1998). In recent years, a wide variety of approximate computational methods have been employed to simulate and analyze this model, including genetic algorithm (GA) (Unger and Moult, 1993; Custódio et al., 2004; Huang et al., 2010), self-organizing map (SOM) (Zhang et al., 2005), the pruned-enriched Rosenbluth method (PERM) (Grassberger, 1997) and its variations (nPERMis (Hsu et al., 2003) and nPERMh (Huang et al., 2005)), elastic net algorithm with local search (EN) (Guo and Feng, 2006), contact interactions method (CI) (Toma and Toma, 1996), core-directed chain growth method (CG) (Beutler and Dill, 1996), constraint-based hydrophobic core construction (CHCC) (Yue et al., 1995), ant colony optimization (ACO) (Shmygelska and Hoos, 2005), simulated annealing algorithm (SA) (Albrecht et al., 2008), replica exchange Monte Carlo method (REMC) (Thachuk et al., 2007), particle swarm optimization (PSO) (Kanj et al., 2009), and others. In this paper, the modified energy landscape paving (ELP) (Liu et al., 2009; Liu and Li, 2010) is introduced for the protein folding problem in 3D HP lattice model.

    • A Novel Branch-and-Bound Algorithm for the Protein Folding Problem in the 3D HP Model

      2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    • A new effective algorithm for protein chain lattice fit problem

      2016, 2015 IEEE International WIE Conference on Electrical and Computer Engineering, WIECON-ECE 2015
    View all citing articles on Scopus
    View full text