Stochastic protein folding simulation in the three-dimensional HP-model

doi:10.1016/j.compbiolchem.2008.03.004

Computational Biology and Chemistry

Volume 32, Issue 4, August 2008, Pages 248-255

https://doi.org/10.1016/j.compbiolchem.2008.03.004 Get rights and content

Abstract

We present results from three-dimensional protein folding simulations in the HP-model on ten benchmark problems. The simulations are executed by a simulated annealing-based algorithm with a time-dependent cooling schedule. The neighbourhood relation is determined by the pull-move set. The results provide experimental evidence that the maximum depth D of local minima of the underlying energy landscape can be upper bounded by $D < n^{2 / 3}$ . The local search procedure employs the stopping criterion ${(m / δ)}^{D / γ}$ , where m is an estimation of the average number of neighbouring conformations, $γ$ relates to the mean of non-zero differences of the objective function for neighbouring conformations, and $1 - δ$ is the confidence that a minimum conformation has been found. The bound complies with the results obtained for the ten benchmark problems.

Introduction

Anfinsen’s thermodynamic hypothesis motivates the attempt to predict protein folding by solving certain optimization problems (Anfinsen, 1973). However, there are two main difficulties with this approach: the precise definition of the energy function that has to be minimised, and the extremely difficult optimization problems arising from energy functions commonly used in folding simulations Neumaier, 1997, Greenberg et al., 2004. To overcome at least some of these difficulties, various lattice models and simplified energy functions have been introduced. One of the most popular lattice models of protein folding is the hydrophobic–hydrophilic (HP) model Dill et al., 1995, Aichholzer et al., 2003, Greenberg et al., 2004, Leite et al., 2004, Schiemann et al., 2005. In the HP-model, proteins are modelled as chains whose vertices are marked either as H (hydrophobic) or P (hydrophilic); the resulting chain is embedded into some lattice; H nodes are considered to attract each other while P nodes are neutral. An optimal embedding is one that maximizes the number of H–H contacts. The rationale for this objective is that hydrophobic interactions contribute a significant portion of the total energy function Greenberg et al., 2004, Brylinski et al., 2006. Unlike more sophisticated models of protein folding, the main goal of the HP-model is to explore broad qualitative questions about protein folding such as whether the dominant interactions are local or global with respect to the chain (Greenberg et al., 2004). A major goal of our research on stochastic protein folding simulations in the HP-model is to investigate whether optimum conformations computed within this model (and some refinements of this model) can serve as appropriate starting conformations for folding simulations of real protein sequences and realistic energy functions.

In the present paper we focus on complexity aspects of protein folding simulations. For a variety of lattice models, protein folding simulation has been proven to be $NP$ -complete Paterson and Przytycka, 1996, Berger and Leighton, 1998, Nayak et al., 1999, which a priori implies an upper time bound $\exp (n)$ for deterministic folding simulations of sequences of length n (unless $P = NP$ , which is generally considered to be unlikely). To the best of our knowledge, Clote (1999) was the first to present a rigorous complexity analysis of protein folding simulations to optimum conformations. The underlying model is Monte Carlo simulations, and the aim is to provide a theoretical underpinning for an earlier work by Sali et al. (1994) where the authors conjecture that the folding time is inverse proportional to the energy gap between the ground state and the lowest energy of a misfolded state. By using the notion of rapidly mixing Markov chains (Sinclair, 1993), Clote (1999) prove an upper bound for the mean first passage time through the ground state that decreases if the energy gap increases. The upper bound does not explicitly relate protein sequence parameters to the first passage time. For protein sequences of length n, Fu and Wang (2004) proved an $\exp (O (n^{1 - 1 / d} \cdot \ln n))$ deterministic algorithm for d-dimensional protein folding simulation to optimum conformations in the HP-model. Linear time algorithms for constant-factor approximations of optimum conformations are reported by Hart and Istrail (1997) and Heun (2003); however, protein folding is one of the problems where indeed the knowledge of optimum solutions is essential.

The importance of the HP-model is highlighted by the fact that the algorithmic complexity bound by Fu and Wang (2004) comes close to the (worst case) folding time approximation of $\exp (λ \cdot n^{2 / 3} \pm χ \cdot n^{1 / 2} / 2)$ ns by Finkelstein and Badretdinov (1997), where $λ$ and $χ$ are constants close to unity. We note that both time bounds depend on the sequence length n only. In our paper, we try to link the time estimation to crucial parameters of the underlying energy landscape induced by a particular sequence. This is motivated by the ongoing attempts to improve folding time approximations for specific (subsets of) protein sequences. For example, Bakk et al. (2002) suggest folding time approximations of $O (n^{2})$ and $\exp (O (\sqrt{n}))$ for some small globular proteins; Galzitskaya et al. (2003) stress the dependence of folding time on sequence length and analyse folding time estimations of type $\exp (O (n^{p}))$ , $0 < p \leq 1$ , for 54 proteins, where $41 \leq n \leq 396$ ; a comprehensive experimental study of folding time as a function of temperature is presented by Zhu et al. (2004) for a single sequence (and its mutants); similar problems are discussed by Lee et al. (2003), Leite et al. (2004), Wang (2004), and Li et al. (2004), based on analytical approximations and numeric simulations. We note that the model used by Li et al. (2004) is close to the above-mentioned HP-model and the model employed by Clote (1999), including a combinatorial neighbourhood similar to the one we utilize in the present paper. The coefficient $λ$ in the folding time approximation by Finkelstein and Badretdinov (1997) has been further analysed by Ivankov et al. (2003).

We analyse a sequence-dependent time bound for protein folding simulations in the HP-model. The simulations utilize a stochastic local search procedure that is based on a specific type of simulated annealing. Simulated annealing-based search has been discussed and applied in the context of protein folding simulation for about twenty years Li and Scheraga, 1987, Wilson et al., 1988, Kawai et al., 1989 as an extension of Monte Carlo methods (Paine and Scheraga, 1985), shortly after the independent introduction of simulated annealing as a new optimization tool by Kirkpatrick et al. (1983) and Černy (1985). For on overview on early applications of simulated annealing to protein folding simulations, we refer the reader to Hansmann and Okamoto (1996). Apart from simulated annealing, search-based methods like genetic algorithms Unger and Moult, 1993, Rabow and Scheraga, 1996, Cooper et al., 2003 and tabu search (Pardalos et al., 1997) have been applied to protein folding simulation (see also Greenberg et al., 2004). More recent simulated annealing-related applications to protein folding simulation can be found in Eastman et al. (2001) and Klepeis et al. (2003).

In order to be able to relate the time bound of simulations to landscape parameters, we utilize the cooling schedule $T = T (k) = D / \log (k + 2)$ in simulated annealing procedures, where D is the maximum value of the minimum escape height from local minima of the underlying conformation space, i.e. the transition probabilities are time-dependent. Under some natural assumptions about the conformation space, the probability distribution over the conformation space tends to the distribution over optimum solutions (Hajek, 1988). We note that Hajek’s result has been discussed before in the wider context of protein folding by Reidys and Stadler (2002) and Apaydin et al. (2003).

We focus on the three-dimensional rectangular lattice model, but extensions to other lattice models, e.g. as described by Backofen (2004), are actually depending upon the definition of a suitable neighbourhood relation only. We employ the pull-move set for the definition of the neighbourhood relation in the conformation space, which was independently introduced by Lesh et al. (2003) and Milostan et al. (2003) (see also Blazewicz et al., 2005). We further analyse experimentally the time complexity bound ${(m / δ)}^{a \cdot D}$ for protein folding simulations as presented by Albrecht and Steinhöfel (2006) and Steinhöfel et al. (2007a), where a is a small constant and D the above-mentioned maximum escape height from local minima; the parameter $m = O (n)$ depends on the lattice structure and the pull-move set; the value $p = 1 - δ$ indicates the confidence that a minimum energy conformation has been found after the number of search steps provided by the complexity bound.

In the present paper, the complexity bound is improved to ${(m / δ)}^{D / γ}$ , where $γ$ relates to the mean of non-zero differences between the values of the objective function for neighbouring conformations. Unfortunately, to calculate or approximate the parameter D by analytical methods is, in general, a difficult problem. Therefore, we try to provide experimental evidence for a worst case upper bound that depends on the sequence length n only. Based on Fu and Wang (2004) and Finkelstein and Badretdinov (1997), we conjecture $D < n^{2 / 3}$ . The conjecture is verified on ten benchmark problems defined by Beutler and Dill (1996) (see also Blazewicz et al., 2005). For $m = n / 2$ , $1 - δ = 0.5$ and $D = n^{2 / 3} / κ < n^{2 / 3}$ with $κ > 1$ , the upper bound complies with the results of our computational experiments and we were able to find ground states for all ten benchmark problems.

We note that the simulation experiments in the present paper are not optimized with respect to the run-time in order to have a strict comparison to the theoretical convergence bound (i.e. the impact of the parameter D). A preliminary version of the present paper has been presented at the International Conference on Bioinformatics Research and Development in March 2007 (Steinhöfel et al., 2007b).

Section snippets

Previous Work on Run-time Analysis of Folding Simulations

Apart from a run-time analysis related to the NP-completeness, there are only a few papers dealing with a rigorous run-time estimation (upper bounds) of algorithms for protein folding simulation.

Clote (1999) approached the problem by Metropolis’ algorithm (Metropolis et al., 1953), which is basically simulated annealing at a fixed temperature. Under natural assumptions about the underlying Markov chain, the distribution over all feasible conformations tends to the stationary Boltzmann

Run-time Analysis for a Simulated Annealing-based Approach

As mentioned in Section 1 already, simulated annealing-based search has been utilized in protein folding simulations for about twenty years now. The applications of simulated annealing to protein folding have in common that they rely on a combination of Metropolis’ algorithm (Metropolis et al., 1953) with some heuristic cooling schedule. The justification for this approach is based upon the following observation: at a fixed “temperature” T, the probability distribution over the conformation

Numeric Results on 3D Benchmark Problems

The algorithm as described in Section 3.1 until Section 3.3 was implemented in $C^{+ +}$ and running under Gentoo Linux on a 2.4 GHz Intel Pentium IV processor.

The run-time estimation (15) is problem-specific, i.e. depends on the parameter D of the landscape induced by an individual protein sequence and the pull-move set. For a problem-independent, worst case upper bound we conjecture $D < n^{2 / 3}$ , which is close to the energy barrier calculated by Finkelstein and Badretdinov (1997) and complies with the

Concluding Remarks

Our computational experiments on ten prominent 3D benchmark problems for protein folding simulations in the HP-model provide evidence that the stopping criterion ${(n / 2 \cdot δ)}^{n^{2 / 3} / γ}$ complies with the number of Markov chain transitions that lead to minimum conformations. Moreover, the experimental results imply for the given benchmark set and neighbourhood relation that the maximum escape height from local minima can be upper bounded by $n^{2 / 3}$ . Currently under way is a comparison of foldings in the

Acknowledgement

The research has been partially supported by EPSRC Grant EP/D062012/1.

References (64)

O. Aichholzer et al.
Long proteins with unique optimal foldings in the H-P model
Comp. Geom. -Theor. Appl.
(2003)
R. Backofen
A polynomial time upper bound for the number of contacts in the HP-model on the face-centred-cubic lattice (FCC)
J. Discrete Alg.
(2004)
A. Bakk et al.
Thermodynamics of proteins: fast folders and sharp transitions
Comput. Phys. Commun.
(2002)
J. Blazewicz et al.
Application of tabu search strategy for finding low energy structure of protein
Artif. Intell. Med.
(2005)
M. Brylinski et al.
Hydrophobic collapse in (in silico) protein folding
Comput. Biol. Chem.
(2006)
N. Chen et al.
Sufficient and necessary condition for the convergence of stochastic approximation algorithms
Stat. Prob. Lett.
(2006)
L.R. Cooper et al.
Use of a novel hill-climbing genetic algorithm in protein folding simulations
Comput. Biol. Chem.
(2003)
A.V. Finkelstein et al.
Rate of protein folding near the point of thermodynamic equilibrium between the coil and the most stable chain fold
Fold. Des.
(1997)
V. Heun
Approximate protein folding in the HP side chain model on extended cubic lattices
Discrete Appl. Math.
(2003)
A.W. Johnson et al.
On the convergence of generalized hill climbing algorithms
Discrete Appl. Math.
(2002)

J.L. Klepeis et al.

A new class of hybrid global optimization algorithms for peptide structure prediction: integrated hybrids

Comput. Phys. Commun.

(2003)

V.B.P. Leite et al.

Probing the kinetics of single molecule protein folding

Biophys. J.

(2004)

M.S. Li et al.

Thermal denaturation and folding rates of single domain proteins: size matters

Polymer

(2004)

M. Löwe

Simulated annealing with time dependent energy function via Sobolev inequalities

Stoch. Proc. Appl.

(1996)

M. Paterson et al.

On the complexity of string folding

Discrete Appl. Math.

(1996)

R. Unger et al.

Genetic algorithms for protein folding simulations

J. Mol. Biol.

(1993)

J. Wang

The complex kinetics of protein folding in wide temperature ranges

Biophys. J.

(2004)

S.R. Wilson et al.

Conformational analysis of flexible molecules: location of the global minimum energy conformation by the simulated annealing method

Tetrahedron Lett.

(1988)

Y. Zhu et al.

Guiding the search for a protein’s maximum rate of folding

Chem. Phys.

(2004)

E.H.L. Aarts

Local Search in Combinatorial Optimization

(1998)

A.A. Albrecht

A stopping criterion for logarithmic simulated annealing

Computing

(2006)

Albrecht, A.A., Steinhöfel, K., 2006. Run-time estimates for protein folding simulation in the H–P model. In: Online...

C.B. Anfinsen

Principles that govern the folding of protein chains

Science

(1973)

M.S. Apaydin et al.

Stochastic roadmap simulation: an efficient representation and algorithm for analyzing molecular motion

J. Comp. Biol.

(2003)

B. Berger et al.

Protein folding in the hydrophobic–hydrophilic (HP) model is NP-complete

J. Comput. Biol.

(1998)

T.C. Beutler et al.

A fast conformational search strategy for finding low energy structures of model proteins

Protein Sci.

(1996)

O. Catoni

Rough large deviation estimates for simulated annealing: applications to exponential schedules

Ann. Prob.

(1992)

V. Černy

A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm

J. Optim. Theor. Appl.

(1985)

T.S. Chiang et al.

A limit theorem for a class of inhomogeneous Markov processes

Ann. Prob.

(1989)

P. Clote

K.A. Dill et al.

Principles of protein folding—a perspective from simple exact models

Protein Sci.

(1995)

P. Eastman et al.

Simulation of protein folding by reaction path annealing

J. Chem. Phys.

(2001)

Cited by (22)

Heuristic energy landscape paving for protein folding problem in the three-dimensional HP lattice model
2012, Computational Biology and Chemistry
Citation Excerpt :
Despite its simplification, the corresponding protein folding problem based on the HP model has proven to be NP-complete (Berger and Leighton, 1998). In recent years, a wide variety of approximate computational methods have been employed to simulate and analyze this model, including genetic algorithm (GA) (Unger and Moult, 1993; Custódio et al., 2004; Huang et al., 2010), self-organizing map (SOM) (Zhang et al., 2005), the pruned-enriched Rosenbluth method (PERM) (Grassberger, 1997) and its variations (nPERMis (Hsu et al., 2003) and nPERMh (Huang et al., 2005)), elastic net algorithm with local search (EN) (Guo and Feng, 2006), contact interactions method (CI) (Toma and Toma, 1996), core-directed chain growth method (CG) (Beutler and Dill, 1996), constraint-based hydrophobic core construction (CHCC) (Yue et al., 1995), ant colony optimization (ACO) (Shmygelska and Hoos, 2005), simulated annealing algorithm (SA) (Albrecht et al., 2008), replica exchange Monte Carlo method (REMC) (Thachuk et al., 2007), particle swarm optimization (PSO) (Kanj et al., 2009), and others. In this paper, the modified energy landscape paving (ELP) (Liu et al., 2009; Liu and Li, 2010) is introduced for the protein folding problem in 3D HP lattice model.
The protein folding problem, i.e., the prediction of the tertiary structures of protein molecules from their amino acid sequences is one of the most important problems in computational biology and biochemistry. However, the extremely difficult optimization problem arising from energy function is a key challenge in protein folding simulation. The energy landscape paving (ELP) method has already been applied very successfully to off-lattice protein models and other optimization problems with complex energy landscape in continuous space. By improving the ELP method, and subsequently incorporating the neighborhood strategy with the pull-move set into the improved ELP method, a heuristic ELP algorithm is proposed to find low-energy conformations of 3D HP lattice model proteins in the discrete space. The algorithm is tested on three sets of 3D HP benchmark instances consisting 31 sequences. For eleven sequences with 27 monomers, the proposed method explores the conformation surfaces more efficiently than other methods, and finds new lower energies in several cases. For ten 48-monomer sequences, we find the lowest energies so far. With the achieved results, the algorithm converges rapidly and efficiently. For all ten 64-monomer sequences, the algorithm finds lower energies within comparable computation times than previous methods. Numeric results show that the heuristic ELP method is a competitive tool for protein folding simulation in 3D lattice model. To the best of our knowledge, this is the first application of ELP to the 3D discrete space.
Population-based local search for protein folding simulation in the MJ energy model and cubic lattices
2009, Computational Biology and Chemistry
We present experimental results on benchmark problems in 3D cubic lattice structures with the Miyazawa–Jernigan energy function for two local search procedures that utilise the pull-move set: (i) population-based local search (PLS) that traverses the energy landscape with greedy steps towards (potential) local minima followed by upward steps up to a certain level of the objective function; (ii) simulated annealing with a logarithmic cooling schedule (LSA). The parameter settings for PLS are derived from short LSA-runs executed in pre-processing and the procedure utilises tabu lists generated for each member of the population. In terms of the total number of energy function evaluations both methods perform equally well, however, PLS has the potential of being parallelised with an expected speed-up in the region of the population size. Furthermore, both methods require a significant smaller number of function evaluations when compared to Monte Carlo simulations with kink-jump moves.
A Novel Branch-and-Bound Algorithm for the Protein Folding Problem in the 3D HP Model
2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics
Protein Tertiary Structure Prediction with Hybrid Clonal Selection and Differential Evolution Algorithms
2020, Advances in Intelligent Systems and Computing
A new effective algorithm for protein chain lattice fit problem
2016, 2015 IEEE International WIE Conference on Electrical and Computer Engineering, WIECON-ECE 2015
Extended particle swarm optimisation method for folding protein on triangular lattice
2016, IET Systems Biology

View all citing articles on Scopus

View full text

Stochastic protein folding simulation in the three-dimensional HP-model

Abstract

Introduction

Section snippets

Previous Work on Run-time Analysis of Folding Simulations

Run-time Analysis for a Simulated Annealing-based Approach

Numeric Results on 3D Benchmark Problems

Concluding Remarks

Acknowledgement

Comp. Geom. -Theor. Appl.

J. Discrete Alg.

Comput. Phys. Commun.

Artif. Intell. Med.

Comput. Biol. Chem.

Stat. Prob. Lett.

Comput. Biol. Chem.

Fold. Des.

Discrete Appl. Math.

Discrete Appl. Math.

Comput. Phys. Commun.

Biophys. J.

Polymer

Stoch. Proc. Appl.

Discrete Appl. Math.

J. Mol. Biol.

Biophys. J.

Tetrahedron Lett.

Chem. Phys.

Local Search in Combinatorial Optimization

A stopping criterion for logarithmic simulated annealing

Computing

Principles that govern the folding of protein chains

Science

Stochastic roadmap simulation: an efficient representation and algorithm for analyzing molecular motion

J. Comp. Biol.

Protein folding in the hydrophobic–hydrophilic (HP) model is NP-complete

J. Comput. Biol.

A fast conformational search strategy for finding low energy structures of model proteins

Protein Sci.

Rough large deviation estimates for simulated annealing: applications to exponential schedules

Ann. Prob.

A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm

J. Optim. Theor. Appl.

A limit theorem for a class of inhomogeneous Markov processes

Ann. Prob.

Principles of protein folding—a perspective from simple exact models

Protein Sci.

Simulation of protein folding by reaction path annealing

J. Chem. Phys.