Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K. and Watson, J.D., (1989) Molecular Biology of the Cell. Garland Publishing, New York and London.
Altschul, S.F., Gish, W., Miller, W., Myers E.W. and Lipman, D.J., (1990) Basic local alignment search tool. J. Mol. Bio. 215: 403–410.
Altschul, S.F. and Gish, G., (1996) Local alignment statistics. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 460–480.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25: 3389–3402.
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J.P., Chothia, C., and Murzin, A.G. (2004). SCOP database in 2004: refinements integrate structure and sequence family data. Nucl. Acid Res. 32: D226–D229.
Attwood, T.K. (2002) The PRINTS database: a resource for identification of protein families.
Ball, C.A., Sherlock, G., Parkinson, H., Rocca-Sera, P., Brooksbank, C., Causton, H.C., Cavalieri, D., Gaasterland, T., Hingamp, P., Holstege, F., Ringwald, M., Spellman, P., Stoeckert, C.J. Jr, Stewart, J.E., Taylor, R., Brazma, A. and Quackenbush, J. (2002) An open letter to the scientific journals. Published in Science 298(5593): 539 and Bioinformatics 18(11):1409.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. (2003) GenBank. Nucl. Acids. Res. 31: 23–27.
Bowie, J.U., Luthy, R. and Eisenberg, D. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253: 164–170.
Bowie, J.U., Zhang, K., Wilmanns, M. and Eisenberg D (1996) Three-dimensional profiles for measuring compatibility of amino acid sequence with threedimensional structure. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 598–616.
Branden, C. and Tooze, J. (1999) Introduction to Protein Structure. 2nd Ed., Garland Science Publishing, New York
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C, Aach J, Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J. and Vingron, M. (2001) Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genetics 29: 365–371
Bryant, S.H. and Lawrence, C.E. (1993) An empirical energy function for threading protein sequence through the fold motif. Proteins Struct. Funct. Genet. 16: 92–112.
Burset, M. and Guigo, R. (1996) Evaluation of Gene Structure Prediction Programs. Genomics 34: 353–367
Chou, P.Y. and Fasman, G.D. (1978) Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 47: 45–147.
Dayhoff, M.O., Schwartz, R.M. and Orcutt BC (1978) A model of evolutionary change in proteins. Atlas of Protein Science and Structure, vol. 5,supplement 3, National Biomedical Research Foundation, Washington, DC, pp. 345–351
Dovichi, N.J. and Zhang, J.Z. (2001) DNA sequencing by capillary array electrophoresis. Methods Mol. Bio. 167: 225–239.
Eddy, S.R. (1998) Profile Hidden Markov models. Bioinformatics 14: 755–763.
Felsenstein, J. (1993) PHYLIP 3.5 (phylogeny inference package). Department of Genetics, University of Washington, Seattle.
Felsenstein, J. (1996) Inferring phylogeny from protein sequences by parsimony, distance and likelihood methods. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 418–427.
Feng, D.F. and Doolittle, R.F. (1996) Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 368–382.
Fickett, J.W. (1982) Recognition of protein coding regions in DNA sequences. Nucl. Acids Res. 10: 5303–5318.
Fickett, J.W. (1996) Finding genes by computer: the state of the art. Trends Genet. 12: 316–320.
Fickett, J.W. and Tung, C.S. (1992). Assessment of protein coding measures. Nucl. Acids Res. 20: 6641–6450.
Frishman, D. and Argos, P. (1996) Incorporation of long-distance interactions into a secondary structure prediction algorithm. Protein Engineering 9: 133–142.
Frishman D and Argos, P. (1997) 75% accuracy in protein secondary structure prediction. Proteins 27: 329–335.
Galperin, M.Y. (2004) The Molecular Biology Database Collection: 2004 update. Nucl. Acids Res. 32: D3–D22.
Garnier, J., Osguthorpe, D.J. and Robson, B. (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120: 97–120.
Garnier, J., Gilbrat, J.F. and Robson, B. (1996) GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 540–553.
Geer, R.C. and Sayers, E.W. (2003) Entrez: Making use of its power. Briefings in Bioinformatics 4: 1779–184
Gibbs, A.J., McIntyre, G.A. (1970) The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16: 1–11.
Graur, D., Li, W.H. (2000) Fundamentals of molecular evolution. (2nd ed.) Sinauer Associates, Sunderland, Massachusetts.
Grosse, I., Buldyrev, S.V., Stanley, H.E., Holste, D. and Herzel, H. (2000) Average mutual information of coding and noncoding DNA. Pacific Symposium on Biocomputing 5: 611–620.
Guigo, R. (1999) DNA Composition, Codon Usage and Exon Prediction. In: Genetic Databases, (ed. M.J. Bishop), chap. 4, pp. 53–80, Academic Press.
Henikoff, S., Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: 10915–10919.
Henikoff, S. and Henikoff, J.G. (1994) Protein family classification based on searching a database of blocks. Genomics 19: 97–107.
Henikoff, J.G., Greene, E.A., Pietrokovski, S. and Henikoff S (2000) Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28: 228–230.
Herzel, H. and Grosse, I. (1995) Measuring correlations in symbol sequences. Physica A 216: 518–542.
Hawkins, J.D. (1988) A survey on intron and exon lengths. Nucl. Acids Res. 16: 9893–9908.
Helt, G.A., Lewis, S., Loraine, A.E. and Rubin, G.M. (1998) BioViews: Java-based tools for genomic data visualization. Genome Res. 8: 291–305.
Hoersch, S., Leroy, C., Brown, N.P., Andrade, M.A., and Sander, C. (2000) The GeneQuiz Web server: protein functional analysis through the Web. Trends in Biochem. Sci. 25: 33–35.
Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233: 23–138
Holm, L. and Sander, C. (1996a) Mapping the protein universe. Science 273: 595–602.
Holm, L. and Sander, C. (1996b) The FSSP database: fold classification based on structure-structure alignment of proteins. Nucl. Acids Res. 24: 206–209
Hughey, R. and Krogh, A. (1996) Hidden Markov models for sequence analysis: Extension and the analysis of the basic method. Comput. Appl. Biosci. 12: 95–107.
Huang, J.Y. and Brutlag, D.L. (2001). The eMOTIF database. Nucl. Acids Res. 29: 202–204.
Johnson, M.S., May, A.C. and Ridionov, M.A., Overington JP (1996) Discrimination of common protein folds: Application of protein structure to sequence/structure comparisons. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 575–598.
Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195–202.
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992a) A new approach to protein fold recognition. Nature 358: 86–89.
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1992b) The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8: 275–282.
Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 2577–2637.
Kim, J., Pramanik. S. and Chung, M.J. (1994). Multiple sequence alignment by simulated annealing. Comput. Appl. Biosci. 10: 419–426.
Konopka, A.K. (1994) Structure and Methods: VI. Human Genome Initiative and DNA Recombination, chapter Towards Mapping Functional Domains in Indiscriminantly Sequenced Nucleic Acids: A Computational Approach. Adenine Press, Guilderland, New York.
Kulikova, T., Aldebert, P., Althorpe, N., Baker, W., Bates, K. and Browne, P., van den Broek A, Cochrane G, Duggan K, Eberhardt R, Faruque N, Garcia-Pastor M, Harte N, Kanz C, Leinonen R, Lin Q, Lombard V, Lopez R, Mancuso R, McHale M, Nardone F, Silventoinen V, Stoehr P, Stoesser G, Tuli MA, Tzouvara K, Vaughan R, Wu D and Zhu W, Apweiler R (2004) The EMBL Nucleotide Sequence Database. Nucl. Acids Res. 32: D27–D30.
Lathrop, R.H., Rogers R.G. Jr., Bienkowska J., Bryant B.K.M, Buturovic L.J., Gaitatzes C., Nambudripad R., White J.V., and Smith T.F. (1998). Analysis and algorithms for protein sequence-structure alignment. Computational methods in molecular biology. S. Salzberg, D. Searls, and S. Kasif Eds. Elsevier Press. Amsterdam, Chapter 12, pp. 227–283.
Lathrop, R.H., Rogers, R.G. Jr., Bienkowska, J., Bryant, B.K.M., Buturovic, L.J., Gaitatzes, C., Nambudripad, R., White, J.V., Smith, T.F. (1988) Analysis and algorithms for protein sequence-structure alignment. New Compr. Biochem. (Series title: Computational methods in molecular biology) 32: 337–355.
Lemer, C.M., Rooman, M.J. and Wodak, S.J. (1995) Protein structure prediction by threading methods: evaluation of current techniques. Proteins 23(3): 337–55.
Li, W. (1997) The study of correlation structures of DNA sequences: a critical review. Computer and Chemistry 21: 257–271.
Li, W.H. (1997) Molecular evolution. Sinauer Associates, Sunderland, Massachusetts.
Liew, A.W.C., Wu, Y., Yan, H. and Yang, M. (2004) A Study on the Effective Statistical Coding Features for Coding/Non-coding DNA Sequence Classification for Yeast, C. elegans and Human. Submitted.
Lippmann, R.P. (1987) An introduction to computing with neural nets. IEEE ASSP Magazine. 4(2): 4–22.
Lo Conte, L., Brenner, S.E., Hubbard, T.J.P., Chothia, C. and Murzin, A. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucl. Acid Res. 30: 264–267.
Lukashin, A.V., Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucl. Acids Res. 26: 1107–1115.
Madej, T., Gibrat, J.F. and Bryant, S.H. (1995) Threading a database of protein cores. Proteins, 23: 356–369.
Markel, S. and Leon, D. (2003) Sequence Analysis in a nutshell: a guide to common tools and databases. O’Reilly and Associates, Inc., USA
Martz, E. (2003) 3D molecular visualization with Protein Explorer. In: Introduction to Bioinformatics: A Theoretical and Practical Approach, (S.A. Krawetz, D.D. Womble eds.), Humana Press, Totowa, New Jersey
Maizel, J.V. Jr. and Lenk, R.P. (1981) Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences. Proc. Natl. Acad. Sci. USA. 78; 7665–7669
Mathe, C., Sagot, M.F., Schiex, T. and Rouze, P. (2002) Current methods of gene prediction, their strengths and weakness — survey and summary. Nucl. Acids Res. 30: 4103–4117
McGuffin, L.J., Bryson, K. and Jones D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16: 404–405.
Mirny, L.A. and Shakhnovich, E.I. (1998) Protein structure prediction by threading-Why it works and why it does not. J. Mol. Biol. 283(2): 507–526.
Mirny, L.A., Finkelstein, A.V. and Shakhnovich, E.I. (2000) Statistical significance of protein structure prediction by threading. Proc. Natl. Acad. Sci. USA. 97(18): 9978–9983.
Miyazaki, S., Sugawara, H., Gojobori, T. and Tateno, Y. (2003) DNA Data Bank of Japan (DDBJ) in XML. Nucl. Acids. Res. 31: 13–16.
Miyazaki, S., Sugawara, H., Ikeo, K., Gojobori, T. and Tateno, Y. (2004). DDBJ in the stream of various biological data. Nucl. Acids. Res. 32: D31–D34.
Mizuguchi, K., Blundell, T.L. (2000) Analysis of conservation and substitutions of secondary structure elements within protein superfamilies. Bioinformatics 16: 1111–1119.
Mount, D.W. (2001) Bioinformatics — Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, New York.
Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540.
Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Bio. 48: 443–453.
Notredame, C. and Higgins, D.G. (1996) SAGA: Sequence alignment by genetic algorithm. Nucl. Acids Res. 24: 1515–1524.
Orengo, C.A., Michie, A.D., Jones S., Jones D.T., Swindells M.B., and Thornton J.M. (1997). CATH-A Hierarchic Classification of Protein Domain Structures. Structure 5(8): 1093–1108.
Panchenko, A.R., Marchler-Bauer, A., Bryant, S.H. (2000) Combination of threading potentials and sequence profiles improves fold recognition. J. Mol. Biol. 296(5): 1319–1331.
Pearl, F.M.G., Lee, D., Bray, J.E,, Sillitoe, I., Todd A.E. and Harrison A.P., Thornton J.M., and Orengo C.A. (2000). Assigning genomic sequences to CATH. Nucl. Acids Res. 28(1): 277–282. Andrade MA, Brown NP, Leroy C, Hoersch S, de Daruvar A, Reich C, Franchini A, Tamames J, Valencia A, Ouzounis C, Sander C (1999) Automated genome sequence analysis and annotation. Bioinformatics 15: 391–412.
Pearson, W.R. (1990) Rapid and sensitive comparison with FASTP and FASTA. Methods Enzymol. (Series tile: Molecular evolution: computer analysis of protein and nucleic acid sequences) 183: 63–98.
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85: 2444–8.
Rost, B. (1996) PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 525–539.
Rost, B. and Sander, C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232: 584–599.
Rost, B. and Sander, C. (1994) Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19: 55–77.
Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986) Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1: Foundations. D.E. Rumelhart and J.L. McClelland Eds. MIT Press, pp 318–362.
Salamov, A.A. and Solovyev, V.V. (1995) Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiply sequence alignments. J. Mol. Biol. 247: 11–15.
Salamov, A.A. and Solovyev, V.V. (1997) Protein secondary structure prediction using local alignments. J. Mol. Biol. 268: 31–36.
Salzberg, S.L., Delcher, A.L., Kasif, S. and White, O. (1998a) Microbial gene identification using interpolated Markov models. Nucl. Acids Res. 26: 544–548.
Salzberg, S.L., Delcher, A.L., Fasman, K.H. and Henderson, J. (1998b) A decision tree system for finding genes in DNA. J. of Comp. Biol. 5: 667–680.
Sanger, F., Nicklen, S. and Coulson, A.R. (1977) DNA sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. USA. 74: 5463–5467.
Schwartz, R.M. and Dayhoff, M.O. (1978) Matrices for detecting distant relationships. Atlas of Protein Science and Structure, vol. 5,supplement 3, National Biomedical Research Foundation, Washington, DC, pp 353–358.
Serov, V.N. and Spirov, A.V., Samsonova MG (1998) Graphical interface to the genetic network database GeNet. Bioinformatics 14: 546–547.
Shapiro, L. and Harris, T. (2000) Finding function through structural genomics. Current Opinion in Biotechnology 11: 31–35.
Siddiqui, A.S., Dengler, U. and Barton, G.J. (2001) 3Dee: A database of protein structural domains. Bioinformatics 17: 200–201.
Sigrist, C.J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P.. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors.
Smith, T.F. and Waterman, M.S. (1981a) Identification of common molecular subsequences. J. Mol. Bio. 147: 195–197.
Smith, T.F. and Waterman, M.S. (1981b). Comparison of biosequences. Adv. Appl. Math. 2: 482–489.
Sonnhammer, E.L.L., Eddy, S.R., Birney, E., Bateman, A. and Durbin, R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucl. Acids Res. 26: 320–322.
Staden, R. (1990) Finding protein coding regions in genomic sequences. Methods Enzymol. (Series title: Molecular evolution: computer analysis of protein and nucleic acid sequences) 183: 163–80.
Staden R, McLachlan AD (1982) Codon preference and its use in identifying protein
States, D.J., Boguski, M.S. (1991) Similarity and homology. In: Sequence Analysis Primer, (ed. M. Gribskov and J. Devereux), pp. 92–124, Stockton Press, New York.
Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M. (1996) Phylogenetic inference. In Molecular Systematics 2nd ed., (ed. D.M. Hillis et al.), chap. 5, pp 407–514, Sinauer Associates, Sunderland, Massachusetts.
Tamura, K. and Nei, M. (1993) Estimation of the number of nucleotide substitutions in the control region of mitochandrail DNA in humans and chimpanzees. Mol. Bio. Evol. 10: 512–526.
Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K. and Saitou, N., Sugawara H, Gojobori T (2002) DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucl. Acids. Res. 30: 27–30.
Thanaraj, T.A. (2000) Positional characterisation of false positives from computational prediction of human splice sites. Nucl. Acids Res. 28: 744–754.
Thiele, R., Zimmer, R. and Lengauer, T. (1999) Protein threading by recursive dynamic programming. J. Mol. Biol. 290(3): 757–779.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22: 4673–4680.
Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The CLUSTAL X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucl. Acids Res. 25: 4876–4882.
Tiwari, S., Ramachandran, S., Bhattacharya, A., Bhattacharya, S., Ramaswamy, R. (1997) Prediction of probable genes by fourier analysis of genomic sequences. Computer Applications in the Biosciences 13: 263–270.
Uberbacher, E.C., Xu, Y. and Mural, R.J. (1996) Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol. (Series title: Computer methods for macromolecular sequence analysis) 266: 259–281.
Vilo, J., Kapushesky, M., Kemmeren, P., Sarkans, U. and Brazma, A. (2003) Expression Profiler. In: The analysis of gene expression data: methods and software (Parmigiani G; Garrett E; Irizarry R; Zeger S L, eds.), Springer, NY.
Wang Y., Zhang C.T., and Dong P. (2002). Recognizing shorter coding regions of human genes based on the statistics of stop codons. Biopolymers 63(3): 207–216.
Williams, G. (1999) Nucleic acid and protein sequence databases. In: Genetic Databases, (ed. M.J. Bishop), chap.2, pp. 11–37, Academic Press.
Wu, S., Liew, A.W.C. and Yan, H. (2003) Cluster Analysis of Gene Expression Data Based on Self-Splitting and Merging Competitive Learning. To appear in IEEE Transactions on Information Technology in Biomedicine.
Wu, Y., Liew, A.W.C., Yan, H. and Yang, M. (2003a) DB-Curve: A Novel 2D Method of DNA Sequence Visualization and Representation. Chem. Phys. Lett. 367: 170–176.
Wu, Y., Liew, A.W.C., Yan, H. and Yang, M. (2003b) Classification of short human exons and introns based on statistical features. Phys. Rev. E. 67(6): Art. No. 061916.
Zhang, M.Q. (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. USA. 94: 565–568.
Zhang, C.T., Wang, J. (2000) Recognition of protein coding genes in the Yeast genome at better than 95% accuracy based on the Z curve. Nucl. Acids Res. 28: 2804–2814.
Zhang, R. and Zhang, C.T. (1994). Z Curves, an Intuitive Tool for Visualizing and Analyzing DNA sequences. Journal Biomolecular Structure Dynamics 11: 767–782.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Hiedelberg
About this chapter
Cite this chapter
Liew, A.W.C., Yan, H., Yang, M. (2005). Data Mining for Bioinformatics. In: Chen, YP.P. (eds) Bioinformatics Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-26888-X_4
Download citation
DOI: https://doi.org/10.1007/3-540-26888-X_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20873-0
Online ISBN: 978-3-540-26888-8
eBook Packages: Computer ScienceComputer Science (R0)