ABSTRACT
Finding motifs — patterns of conserved residues — within nucleotide and protein sequences is a key part of understanding function and regulation within biological systems. This paper presents a review of current approaches to motif discovery, both evolutionary computation based and otherwise, and a speculative look at the advantages of the evolutionary computation approach and where it might lead us in the future. Particular attention is given to the problem of characterising regulatory DNA motifs and the value of expressive representations for providing accurate classification.
- L. A. Anbarasu, V. Sundararajan, and N. Narayanasamy. Parallel genetic algorithm for performance-driven sequence alignment. In L. Spector, E. D. Goodman, A. Wu, et al., eds., Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), p. 1015. Morgan Kaufmann, San Francisco, California, USA, 7--11 Jul. 2001.Google Scholar
- T. K. Attwood. The PRINTS database: a resource for identification of protein families. Brief Bioinform, 3(3):252--63, Sep 2002.Google ScholarCross Ref
- R. M. A. Azad and C. Ryan. Structural emergence with order independent representations. In E. C.-P. et al., ed., Proceedings of the 2003 Genetic and Evolutionary Computation Conference, GECCO 2003, vol. 2724 of Lecture Notes in Computer Science, pp. 1626--1638. Springer-Verlag, 2003. Google ScholarDigital Library
- T. L. Bailey and W. S. Noble. Searching for statistically significant regulatory modules. Bioinformatics, 19 Suppl 2:II16-II25, Oct 2003.Google Scholar
- P. Baldi and S. Brunak. Bioinformatics: the machine learning approach. MIT Press, 2 edn., 2001. Baldi. Google ScholarDigital Library
- A. Bateman, L. Coin, R. Durbin, et al. The Pfam protein families database. Nucleic Acids Res, 32(Database issue):D138--41, Jan 2004.Google Scholar
- H. M. Berman, J. Westbrook, Z. Feng, et al. The protein data bank. Nucleic Acids Research, 28:235--242, 2000.Google ScholarCross Ref
- K. Blekas, D. I. Fotiadis, and A. Likas. Motif-based protein sequence classification using neural networks. Journal of Computational Biology, 12(1):64--82, February 2005.Google ScholarCross Ref
- B. Boeckmann, A. Bairoch, R. Apweiler, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 31(1):365--70, Jan 2003.Google ScholarCross Ref
- A. Brazma, I. Jonassen, I. Eidhammer, et al. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2):277--304, 1998.Google ScholarCross Ref
- C. Bru, E. Courcelle, S. Carrre, et al. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res, 33 Database Issue:D212--5, Jan 2005.Google Scholar
- L. Cai, D. Juedes, and E. Liakhovitch. Evolutionary computation techniques for multiple sequence alignment. In Proceedings of the 2000 Congress on Evolutionary Computation CEC00, pp. 829--835. IEEE Press, La Jolla Marriott Hotel La Jolla, California, USA, 6--9 Jul. 2000.Google ScholarCross Ref
- R. Carr, W. Hart, N. Krasnogor, et al. Alignment of protein structures with a memetic evolutionary algorithm. In W. B. Langdon, E. Cantú-Paz, K. Mathias, et al., eds., GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1027--1034. Morgan Kaufmann Publishers, New York, 9--13 Jul. 2002. Google ScholarDigital Library
- B. C. H. Chang, A. Ratnaweera, H. Halgamuge, et al. Particle swarm optimisation for protein motif discovery. Genetic Programming and Evolvable Machines, 5(2):203--214, 2004. Google ScholarDigital Library
- K. E. Chapman and S. J. Higgins. Regulation of Gene Expression, vol. 37 of Essays in Biochemistry. Portland Press, 2001.Google Scholar
- C. Chau, S. Kwong, C. Diu, et al. Optimization of HMM by a genetic algorithm. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 3, pp. 1727 -- 1730. 1997. Google ScholarDigital Library
- K. Chellapilla and G. B. Fogel. Multiple sequence alignment using evolutionary programming. In P. J. Angeline, Z. Michalewicz, M. Schoenauer, et al., eds., Proceedings of the Congress on Evolutionary Computation, vol. 1, pp. 445--452. IEEE Press, Mayflower Hotel, Washington D.C., USA, 6--9 Jul. 1999.Google Scholar
- R. Choudhury. Application of evolutionary computation for multiple sequence alignment. Project report, Stanford University, 2003.Google Scholar
- M. Conrad. The geometry of evolution. BioSystems, 24:61--81, 1990.Google ScholarCross Ref
- D. Corne, A. Meade, and R. Sibly. Evolving core promoter signal motifs. In Proceedings of the 2001 Congress on Evolutionary Computation CEC2001, pp. 1162--1169. IEEE Press, COEX, World Trade Center, 159 Samseong-dong, Gangnam-gu, Seoul, Korea, 27--30 May 2001.Google ScholarCross Ref
- R. Durbin, S. R. Eddy, A. Krogh, et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1999.Google Scholar
- L. Falquet, M. Pagni, P. Bucher, et al. The PROSITE database, its status in 2002. Nucleic Acids Research, 30:235--238, 2002.Google ScholarCross Ref
- M. Frith, U. Hansen, and Z. Weng. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17(10):878--89, Oct 2001.Google ScholarCross Ref
- D. E. Goldberg, K. Deb, H. Kargupta, et al. Rapid accurate optimization of difficult problems using fast messy genetic algorithms. In S. Forrest, ed., Proceedings of The Fifth International Conference On Genetic Algorithms. Morgan Kaufmann, 1993. Google ScholarDigital Library
- R. R. Gonzalez, C. M. Izquierdo, and J. Seijas. Multiple protein sequence comparison by genetic algorithms. In S. K. Rogers, D. B. Fogel, J. C. Bezdek, et al., eds., Proceedings of SPIE, vol. 3390 of Applications and Science of Computational Intelligence, pp. 99--102. 1998.Google Scholar
- R. Goodacre and D. B. Kell. Evolutionary computation for the interpretation of metabolome data. In G. G. Harrigan and R. Goodacre, eds., Metabolic profiling: its role in biomarker discovery and gene function analysis, pp. 239--256. Kluwer, Boston, 2003.Google Scholar
- K. Hanada, T. Yokoyama, and T. Shimizu. Multiple sequence alignment by genetic algorithm. Genome Informatics, 11:317--318, 2000.Google Scholar
- S. Handley. Predicting whether or not a nucleic acid sequence is an E. coli promoter region using genetic programming. In Proceedings of the First International Symposium on Intelligence in Neural and Biological Systems INBS-95, pp. 122--127. IEEE Computer Society Press, Herndon, Virginia, USA, 29--31 May 1995. Google ScholarDigital Library
- A. Heddad, M. Brameier, and M. MacCallum. Evolving regular expression-based sequence classifiers for protein nuclear localisation. In G. R. Raidl, S. Cagnoni, J. Branke, et al., eds., Applications of Evolutionary Computing, EvoWorkshops2004, vol. 3005 of LNCS, pp. 31--40. Springer Verlag, Coimbra, Portugal, 5--7 Apr. 2004.Google Scholar
- S. Henikoff, J. Henikoff, and S. Pietrokovski. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics, 15(6):471--9, Jun 1999.Google ScholarCross Ref
- L. S. Ho and J. Rajapakse. High sensitivity technique for translation initiation site detection. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 153--159. 2004.Google ScholarCross Ref
- J.-T. Horng, C.-M. Lin, B.-J. Liu, et al. Using genetic algorithms to solve multiple sequence alignments. In D. Whitley, D. Goldberg, E. Cantu-Paz, et al., eds., Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000), pp. 883--890. Morgan Kaufmann, Las Vegas, Nevada, USA, 10--12 Jul. 2000.Google Scholar
- D. Howard and K. Benson. Evolutionary computation method for pattern recognition of cis-acting sites. Biosystems, 72(1--2):19--27, Nov. 2003. Special Issue on Computational Intelligence in Bioinformatics.Google Scholar
- Y.-J. Hu. Biopattern discovery by genetic programming. In J. K. et al, ed., Proceedings Genetic Programming 1998, pp. 152--157. Morgan Kaufmann, 1998.Google Scholar
- K. Huang, J. Louis, L. Donaldson, et al. Solution structure of the MEF2A-DNA complex: structural basis for the modulation of DNA bending and specificity by MADS-box transcription factors. EMBO J., 19:2615--2628, 2000.Google ScholarCross Ref
- M. Ishikawa, T. Toya, Y. Totoki, et al. Parallel iterative aligner with genetic algorithm. In Genome Informatics Workshop IV, pp. 13--22. 1993.Google Scholar
- M. Isokawa, M. Wayama, and T. Shimizu. Multiple sequence alignment using a genetic algorithm. Genome Informatics, 7:176--177, 1996.Google Scholar
- S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology, 22:437--467, 1969.Google ScholarCross Ref
- J. Koza, F. Bennett, and D. Andre. Using programmatic motifs and genetic programming to classify protein sequences as to extracellular and membrane cellular location. In V. W. Porto, N. Saravanan, D. Waagen, et al., eds., Evolutionary Programming VII: Proceedings of the Seventh Annual Conference on Evolutionary Programming, vol. 1447 of LNCS. Springer-Verlag, Mission Valley Marriott, San Diego, California, USA, 25--27 Mar. 1998. Google ScholarDigital Library
- J. R. Koza and D. Andre. Automatic discovery of protein motifs using genetic programming. In X. Yao, ed., Evolutionary Computation: Theory and Applications. World Scientific, Singapore, 1999.Google ScholarCross Ref
- S. Kwong, C. W. Chaua, K. F. Manb, et al. Optimisation of HMM topology and its model parameters by genetic algorithms. Pattern Recognition, 34(2):509--522, 2001.Google ScholarCross Ref
- C.-C. Lai and S.-W. Chung. Multiple DNA sequences alignment by means of genetic algorithm. In Design and application of hybrid intelligent systems, pp. 224--232. IOS Press, 2003. Google ScholarDigital Library
- D. Latchman. Eukaryotic Transcription Factors. Academic Press, 3 edn., 1999.Google Scholar
- S. Leopold. An alignment graph based evolutionary algorithm for the multiple sequence alignment problem. Master's thesis, Vienna University of Technology, Institute of Computer Graphics and Algorithms, 2004.Google Scholar
- A. M. Lesk. Introduction to Protein Architecture. Oxford University Press, 2000.Google Scholar
- O. Littlefield, Y. Korkhin, and P. Sigler. The structural basis for the oriented assembly of a TBP/TFB/promoter complex. Proc. Natl. Acad. Sci. USA, 96:13668--13673, 1999.Google ScholarCross Ref
- M. A. Lones. Enzyme Genetic Programming: Modelling Biological Evolvability in Genetic Programming. Ph.D. thesis, Department of Electronics, University of York, 2003.Google Scholar
- N. J. Mulder, R. Apweiler, T. K. Attwood, et al. InterPro, progress and status in 2005. Nucleic Acids Res, 33 Database Issue:D201--5, Jan 2005.Google Scholar
- C. L. Nehaniv. Editorial for special issue on evolvability. BioSystems, 69:77--81, 2003.Google ScholarCross Ref
- H. D. Nguyen, I. Yoshihara, K. Yamamori, et al. A parallel hybrid genetic algorithm for multiple protein sequence alignment. In D. B. Fogel, M. A. El-Sharkawi, X. Yao, et al., eds., Proceedings of the 2002 Congress on Evolutionary Computation CEC2002, pp. 309--314. IEEE Press, 2002.Google Scholar
- H. D. Nguyen, I. Yoshihara, Y. Yamamori, et al. Improved GA-based method for multiple protein sequence alignment. In R. Sarker, R. Reynolds, H. Abbass, et al., eds., Proceedings of the 2003 Congress on Evolutionary Computation CEC2003, pp. 1826--1832. IEEE Press, Canberra, 8--12 Dec. 2003.Google ScholarCross Ref
- C. Notredame and D. G. Higgins. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res., 24(8):1515--1524, 1996.Google ScholarCross Ref
- B. Olsson. Using evolutionary algorithms in the design of protein fingerprints. In W. Banzhaf, J. Daida, A. E. Eiben, et al., eds., Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, pp. 1636--1642. Morgan Kaufmann, Orlando, Florida, USA, 13--17 Jul. 1999.Google Scholar
- A. Pedersen, P. Baldi, Y. Chauvin, et al. The biology of eukaryotic promoter prediction---a review. Comput Chem, 23(3--4):191--207, Jun 1999.Google Scholar
- D. Pribnow. Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc. Natl. Acad. Sci. USA, 72:784--788, 1975.Google ScholarCross Ref
- B. J. Ross. The evolution of stochastic regular motifs for protein sequences. New Generation Computing, 20(2):187--213, Feb. 2002. Google ScholarDigital Library
- L. Sheneman and J. A. Foster. Evolving better multiple sequence alignments. In K. Deb, R. Poli, W. Banzhaf, et al., eds., Genetic and Evolutionary Computation -- GECCO-2004, Part I, vol. 3102 of Lecture Notes in Computer Science, pp. 449--460. Springer-Verlag, Seattle, WA, USA, 26--30 Jun. 2004.Google Scholar
- L. J. Sheneman and J. A. Foster. Evolving guide trees in progressive multiple sequence alignment. In A. M. Barry, ed., GECCO 2003: Proceedings of the Bird of a Feather Workshops, Genetic and Evolutionary Computation Conference, pp. 89--91. AAAI, Chigaco, 11 Jul. 2003.Google Scholar
- C. Shyu, L. Sheneman, and J. A. Foster. Multiple sequence alignment with evolutionary computation. Genetic Programming and Evolvable Machines, 5(2):121--144, 2004. Prepublication Date: 03/23/2004. Google ScholarDigital Library
- M. Slimane, G. Venturini, J. P. A. de Beauville, et al. Optimizing hidden Markov models with a genetic algorithm. In J.-M. Alliot, E. Lutton, E. M. A. Ronald, et al., eds., Artificial Evolution, vol. 1063 of Lecture Notes in Computer Science, pp. 384--396. Springer, 1995. Google ScholarDigital Library
- G. Stormo. DNA binding sites: representation and discovery. Bioinformatics, 16(1):16--23, January 2000.Google ScholarCross Ref
- J. Thompson, D. Higgins, and T. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673--80, Nov 1994.Google ScholarCross Ref
- R. Thomsen. Evolving the topology of hidden markov models using evolutionary algorithms. In J. J. M. Guervós, P. Adamidis, H.-G. Beyer, et al., eds., Parallel Problem Solving from Nature - PPSN VII, vol. 2439 of Lecture Notes in Computer Science, pp. 861--870. Springer, 2002. Google ScholarDigital Library
- R. Thomsen, G. B. Fogel, and T. Krink. A clustal alignment improver using evolutionary algorithms. In D. B. Fogel, X. Yao, G. Greenwood, et al., eds., Proceedings of the Fourth Congress on Evolutionary Computation (CEC-2002), vol. 1, pp. 121--126. 2002.Google Scholar
- E. E. Vallejo and F. Ramos. Evolving Turing machines for biosequences recognition and analysis. In J. F. Miller, M. Tomassini, P. L. Lanzi, et al., eds., Genetic Programming, Proceedings of EuroGP'2001, vol. 2038 of LNCS, pp. 192--203. Springer-Verlag, Lake Como, Italy, 18--20 Apr. 2001. Google ScholarDigital Library
- E. E. Vallejo and F. Ramos. Evolutionary two-dimensional DNA sequence alignment. In E. Cantú-Paz, J. A. Foster, K. Deb, et al., eds., Genetic and Evolutionary Computation -- GECCO-2003, vol. 2723 of LNCS, pp. 429--430. Springer-Verlag, 12-16 Jul. 2003. Google ScholarDigital Library
- A. Vanet, L. Marsan, and M.-F. Sagot. Promoter sequences and algorithmical methods for identifying them. Res. Microbiol., 150:779--799, 1999.Google ScholarCross Ref
- M. Wayama, K. Takahashi, and T. Shimizu. An approach to amino acid sequence alignment using a genetic algorithm. Genome Informatics, 6:122--123, 1995.Google Scholar
- T. Werner. Models for prediction and recognition of eukaryotic promoters. Mamm Genome, 10(2):168--75, Feb 1999.Google ScholarCross Ref
- T. Werner. The state of the art of mammalian promoter recognition. Brief Bioinform, 4(1):22--30, Mar 2003.Google ScholarCross Ref
- K.-J. Won, A. Prügel-Bennett, and A. Krogh. Training HMM structure with genetic algorithm for biological sequence analysis. Bioinformatics, 20(18):3613--9, Dec 2004. Google ScholarDigital Library
- T. Yada, M. Ishikawa, H. Tanaka, et al. Extraction of hidden Markov model representations of signal patterns in DNA sequences. In Pacific Symposium on Biocomputing, pp. 686--96. 1996.Google Scholar
- C. Zhang and A. K. Wong. A genetic algorithm for multiple molecular sequence alignment. Comput Appl Biosci., 13(6):565--581, 1997.Google Scholar
Index Terms
- The evolutionary computation approach to motif discovery in biological sequences
Recommendations
Efficient automatic exact motif discovery algorithms for biological sequences
ObjectiveThis paper presents an algorithm for the solution of the motif discovery problem (MDP). Methods and materialsMotif discovery problem can be considered in two cases: motifs with insertions/deletions, and motifs without insertions/deletions. The ...
Regulatory Motif Discovery Using a Population Clustering Evolutionary Algorithm
This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within ...
A Monte Carlo EM Algorithm for De Novo Motif Discovery in Biomolecular Sequences
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular ...
Comments