Stochastic protein folding simulation in the three-dimensional HP-model
Introduction
Anfinsen’s thermodynamic hypothesis motivates the attempt to predict protein folding by solving certain optimization problems (Anfinsen, 1973). However, there are two main difficulties with this approach: the precise definition of the energy function that has to be minimised, and the extremely difficult optimization problems arising from energy functions commonly used in folding simulations Neumaier, 1997, Greenberg et al., 2004. To overcome at least some of these difficulties, various lattice models and simplified energy functions have been introduced. One of the most popular lattice models of protein folding is the hydrophobic–hydrophilic (HP) model Dill et al., 1995, Aichholzer et al., 2003, Greenberg et al., 2004, Leite et al., 2004, Schiemann et al., 2005. In the HP-model, proteins are modelled as chains whose vertices are marked either as H (hydrophobic) or P (hydrophilic); the resulting chain is embedded into some lattice; H nodes are considered to attract each other while P nodes are neutral. An optimal embedding is one that maximizes the number of H–H contacts. The rationale for this objective is that hydrophobic interactions contribute a significant portion of the total energy function Greenberg et al., 2004, Brylinski et al., 2006. Unlike more sophisticated models of protein folding, the main goal of the HP-model is to explore broad qualitative questions about protein folding such as whether the dominant interactions are local or global with respect to the chain (Greenberg et al., 2004). A major goal of our research on stochastic protein folding simulations in the HP-model is to investigate whether optimum conformations computed within this model (and some refinements of this model) can serve as appropriate starting conformations for folding simulations of real protein sequences and realistic energy functions.
In the present paper we focus on complexity aspects of protein folding simulations. For a variety of lattice models, protein folding simulation has been proven to be -complete Paterson and Przytycka, 1996, Berger and Leighton, 1998, Nayak et al., 1999, which a priori implies an upper time bound for deterministic folding simulations of sequences of length n (unless , which is generally considered to be unlikely). To the best of our knowledge, Clote (1999) was the first to present a rigorous complexity analysis of protein folding simulations to optimum conformations. The underlying model is Monte Carlo simulations, and the aim is to provide a theoretical underpinning for an earlier work by Sali et al. (1994) where the authors conjecture that the folding time is inverse proportional to the energy gap between the ground state and the lowest energy of a misfolded state. By using the notion of rapidly mixing Markov chains (Sinclair, 1993), Clote (1999) prove an upper bound for the mean first passage time through the ground state that decreases if the energy gap increases. The upper bound does not explicitly relate protein sequence parameters to the first passage time. For protein sequences of length n, Fu and Wang (2004) proved an deterministic algorithm for d-dimensional protein folding simulation to optimum conformations in the HP-model. Linear time algorithms for constant-factor approximations of optimum conformations are reported by Hart and Istrail (1997) and Heun (2003); however, protein folding is one of the problems where indeed the knowledge of optimum solutions is essential.
The importance of the HP-model is highlighted by the fact that the algorithmic complexity bound by Fu and Wang (2004) comes close to the (worst case) folding time approximation of ns by Finkelstein and Badretdinov (1997), where and are constants close to unity. We note that both time bounds depend on the sequence length n only. In our paper, we try to link the time estimation to crucial parameters of the underlying energy landscape induced by a particular sequence. This is motivated by the ongoing attempts to improve folding time approximations for specific (subsets of) protein sequences. For example, Bakk et al. (2002) suggest folding time approximations of and for some small globular proteins; Galzitskaya et al. (2003) stress the dependence of folding time on sequence length and analyse folding time estimations of type , , for 54 proteins, where ; a comprehensive experimental study of folding time as a function of temperature is presented by Zhu et al. (2004) for a single sequence (and its mutants); similar problems are discussed by Lee et al. (2003), Leite et al. (2004), Wang (2004), and Li et al. (2004), based on analytical approximations and numeric simulations. We note that the model used by Li et al. (2004) is close to the above-mentioned HP-model and the model employed by Clote (1999), including a combinatorial neighbourhood similar to the one we utilize in the present paper. The coefficient in the folding time approximation by Finkelstein and Badretdinov (1997) has been further analysed by Ivankov et al. (2003).
We analyse a sequence-dependent time bound for protein folding simulations in the HP-model. The simulations utilize a stochastic local search procedure that is based on a specific type of simulated annealing. Simulated annealing-based search has been discussed and applied in the context of protein folding simulation for about twenty years Li and Scheraga, 1987, Wilson et al., 1988, Kawai et al., 1989 as an extension of Monte Carlo methods (Paine and Scheraga, 1985), shortly after the independent introduction of simulated annealing as a new optimization tool by Kirkpatrick et al. (1983) and Černy (1985). For on overview on early applications of simulated annealing to protein folding simulations, we refer the reader to Hansmann and Okamoto (1996). Apart from simulated annealing, search-based methods like genetic algorithms Unger and Moult, 1993, Rabow and Scheraga, 1996, Cooper et al., 2003 and tabu search (Pardalos et al., 1997) have been applied to protein folding simulation (see also Greenberg et al., 2004). More recent simulated annealing-related applications to protein folding simulation can be found in Eastman et al. (2001) and Klepeis et al. (2003).
In order to be able to relate the time bound of simulations to landscape parameters, we utilize the cooling schedule in simulated annealing procedures, where D is the maximum value of the minimum escape height from local minima of the underlying conformation space, i.e. the transition probabilities are time-dependent. Under some natural assumptions about the conformation space, the probability distribution over the conformation space tends to the distribution over optimum solutions (Hajek, 1988). We note that Hajek’s result has been discussed before in the wider context of protein folding by Reidys and Stadler (2002) and Apaydin et al. (2003).
We focus on the three-dimensional rectangular lattice model, but extensions to other lattice models, e.g. as described by Backofen (2004), are actually depending upon the definition of a suitable neighbourhood relation only. We employ the pull-move set for the definition of the neighbourhood relation in the conformation space, which was independently introduced by Lesh et al. (2003) and Milostan et al. (2003) (see also Blazewicz et al., 2005). We further analyse experimentally the time complexity bound for protein folding simulations as presented by Albrecht and Steinhöfel (2006) and Steinhöfel et al. (2007a), where a is a small constant and D the above-mentioned maximum escape height from local minima; the parameter depends on the lattice structure and the pull-move set; the value indicates the confidence that a minimum energy conformation has been found after the number of search steps provided by the complexity bound.
In the present paper, the complexity bound is improved to , where relates to the mean of non-zero differences between the values of the objective function for neighbouring conformations. Unfortunately, to calculate or approximate the parameter D by analytical methods is, in general, a difficult problem. Therefore, we try to provide experimental evidence for a worst case upper bound that depends on the sequence length n only. Based on Fu and Wang (2004) and Finkelstein and Badretdinov (1997), we conjecture . The conjecture is verified on ten benchmark problems defined by Beutler and Dill (1996) (see also Blazewicz et al., 2005). For , and with , the upper bound complies with the results of our computational experiments and we were able to find ground states for all ten benchmark problems.
We note that the simulation experiments in the present paper are not optimized with respect to the run-time in order to have a strict comparison to the theoretical convergence bound (i.e. the impact of the parameter D). A preliminary version of the present paper has been presented at the International Conference on Bioinformatics Research and Development in March 2007 (Steinhöfel et al., 2007b).
Section snippets
Previous Work on Run-time Analysis of Folding Simulations
Apart from a run-time analysis related to the NP-completeness, there are only a few papers dealing with a rigorous run-time estimation (upper bounds) of algorithms for protein folding simulation.
Clote (1999) approached the problem by Metropolis’ algorithm (Metropolis et al., 1953), which is basically simulated annealing at a fixed temperature. Under natural assumptions about the underlying Markov chain, the distribution over all feasible conformations tends to the stationary Boltzmann
Run-time Analysis for a Simulated Annealing-based Approach
As mentioned in Section 1 already, simulated annealing-based search has been utilized in protein folding simulations for about twenty years now. The applications of simulated annealing to protein folding have in common that they rely on a combination of Metropolis’ algorithm (Metropolis et al., 1953) with some heuristic cooling schedule. The justification for this approach is based upon the following observation: at a fixed “temperature” T, the probability distribution over the conformation
Numeric Results on 3D Benchmark Problems
The algorithm as described in Section 3.1 until Section 3.3 was implemented in and running under Gentoo Linux on a 2.4 GHz Intel Pentium IV processor.
The run-time estimation (15) is problem-specific, i.e. depends on the parameter D of the landscape induced by an individual protein sequence and the pull-move set. For a problem-independent, worst case upper bound we conjecture , which is close to the energy barrier calculated by Finkelstein and Badretdinov (1997) and complies with the
Concluding Remarks
Our computational experiments on ten prominent 3D benchmark problems for protein folding simulations in the HP-model provide evidence that the stopping criterion complies with the number of Markov chain transitions that lead to minimum conformations. Moreover, the experimental results imply for the given benchmark set and neighbourhood relation that the maximum escape height from local minima can be upper bounded by . Currently under way is a comparison of foldings in the
Acknowledgement
The research has been partially supported by EPSRC Grant EP/D062012/1.
References (64)
- et al.
Long proteins with unique optimal foldings in the H-P model
Comp. Geom. -Theor. Appl.
(2003) A polynomial time upper bound for the number of contacts in the HP-model on the face-centred-cubic lattice (FCC)
J. Discrete Alg.
(2004)- et al.
Thermodynamics of proteins: fast folders and sharp transitions
Comput. Phys. Commun.
(2002) - et al.
Application of tabu search strategy for finding low energy structure of protein
Artif. Intell. Med.
(2005) - et al.
Hydrophobic collapse in (in silico) protein folding
Comput. Biol. Chem.
(2006) - et al.
Sufficient and necessary condition for the convergence of stochastic approximation algorithms
Stat. Prob. Lett.
(2006) - et al.
Use of a novel hill-climbing genetic algorithm in protein folding simulations
Comput. Biol. Chem.
(2003) - et al.
Rate of protein folding near the point of thermodynamic equilibrium between the coil and the most stable chain fold
Fold. Des.
(1997) Approximate protein folding in the HP side chain model on extended cubic lattices
Discrete Appl. Math.
(2003)- et al.
On the convergence of generalized hill climbing algorithms
Discrete Appl. Math.
(2002)
A new class of hybrid global optimization algorithms for peptide structure prediction: integrated hybrids
Comput. Phys. Commun.
Probing the kinetics of single molecule protein folding
Biophys. J.
Thermal denaturation and folding rates of single domain proteins: size matters
Polymer
Simulated annealing with time dependent energy function via Sobolev inequalities
Stoch. Proc. Appl.
On the complexity of string folding
Discrete Appl. Math.
Genetic algorithms for protein folding simulations
J. Mol. Biol.
The complex kinetics of protein folding in wide temperature ranges
Biophys. J.
Conformational analysis of flexible molecules: location of the global minimum energy conformation by the simulated annealing method
Tetrahedron Lett.
Guiding the search for a protein’s maximum rate of folding
Chem. Phys.
Local Search in Combinatorial Optimization
A stopping criterion for logarithmic simulated annealing
Computing
Principles that govern the folding of protein chains
Science
Stochastic roadmap simulation: an efficient representation and algorithm for analyzing molecular motion
J. Comp. Biol.
Protein folding in the hydrophobic–hydrophilic (HP) model is NP-complete
J. Comput. Biol.
A fast conformational search strategy for finding low energy structures of model proteins
Protein Sci.
Rough large deviation estimates for simulated annealing: applications to exponential schedules
Ann. Prob.
A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm
J. Optim. Theor. Appl.
A limit theorem for a class of inhomogeneous Markov processes
Ann. Prob.
Principles of protein folding—a perspective from simple exact models
Protein Sci.
Simulation of protein folding by reaction path annealing
J. Chem. Phys.
Cited by (22)
Heuristic energy landscape paving for protein folding problem in the three-dimensional HP lattice model
2012, Computational Biology and ChemistryCitation Excerpt :Despite its simplification, the corresponding protein folding problem based on the HP model has proven to be NP-complete (Berger and Leighton, 1998). In recent years, a wide variety of approximate computational methods have been employed to simulate and analyze this model, including genetic algorithm (GA) (Unger and Moult, 1993; Custódio et al., 2004; Huang et al., 2010), self-organizing map (SOM) (Zhang et al., 2005), the pruned-enriched Rosenbluth method (PERM) (Grassberger, 1997) and its variations (nPERMis (Hsu et al., 2003) and nPERMh (Huang et al., 2005)), elastic net algorithm with local search (EN) (Guo and Feng, 2006), contact interactions method (CI) (Toma and Toma, 1996), core-directed chain growth method (CG) (Beutler and Dill, 1996), constraint-based hydrophobic core construction (CHCC) (Yue et al., 1995), ant colony optimization (ACO) (Shmygelska and Hoos, 2005), simulated annealing algorithm (SA) (Albrecht et al., 2008), replica exchange Monte Carlo method (REMC) (Thachuk et al., 2007), particle swarm optimization (PSO) (Kanj et al., 2009), and others. In this paper, the modified energy landscape paving (ELP) (Liu et al., 2009; Liu and Li, 2010) is introduced for the protein folding problem in 3D HP lattice model.
Population-based local search for protein folding simulation in the MJ energy model and cubic lattices
2009, Computational Biology and ChemistryA Novel Branch-and-Bound Algorithm for the Protein Folding Problem in the 3D HP Model
2021, IEEE/ACM Transactions on Computational Biology and BioinformaticsProtein Tertiary Structure Prediction with Hybrid Clonal Selection and Differential Evolution Algorithms
2020, Advances in Intelligent Systems and ComputingA new effective algorithm for protein chain lattice fit problem
2016, 2015 IEEE International WIE Conference on Electrical and Computer Engineering, WIECON-ECE 2015Extended particle swarm optimisation method for folding protein on triangular lattice
2016, IET Systems Biology