Abstract
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.
Similar content being viewed by others
References
Acquisti, C., Allegrini, P., Bogani, P., Buiatti, M., Catanese, E., Fronzoni, L., Grigolini, P., Mersi, G., Palatella, L., 2004. In the search for the low-complexity sequences in prokaryotic and eukaryotic genomes: How to derive a coherent picture from global and local entropy measures. Chaos Solitons Fractals 201, 127–137.
Bell, S.J., Forsdyke, D.R., 1999. Accounting units in DNA. J. Theor. Biol. 197, 51–61.
Berryman, M.J., Allison, A., Abbott, D., 2004. Mutual information for examining correlations in DNA. arXiv:q-bio.PE/0404010v1.
Bolshoy, A., 2003. DNA sequence analysis linguistic tools: Contrast vocabularies, compositional spectra and linguistic complexity. Appl. Bioinform. 22, 103–112.
Bugaenko, N.N., Gorban, A.N., Sadovsky, M.G., 1998. Maximum entropy method in analysis of genetic text and measurement of its information content. Open Syst. Inform. Dyn. 53, 265–278.
Buldyrev, S.V., Dokholyan, N.V., Havlin, Sh., Stanley, H.E., Stanley, R.H.R., 1999. Expansion of tandem repeats and oligomer clustering in coding and noncoding DNA sequences. Physica A 273, 19–32.
Bultrinia, E., Pizzia, E., del Giudice, P., Frontali, C., 2003. Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster. Gene 304, 183–192.
Bussemaker, H.J., Hao, L., Siggia, E.D., 2000. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. PNAS 97, 10096–10100.
Crochemore, M., Verin, R., 1999. Zones of low entropy in genomic sequences. Comput. Chem. 23, 275–282.
Current Topics in Computational Molecular Biology, 2002. In: Jiang, T., Xu, Y., Zhang, M.Q. (Eds.). MIT Press, Cambridge, MA, 540 p.
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G., 1998. Biological sequence analysis: Probabilistic models of protein and nucleic acids. Cambridge University Press, New York, 426 p.
Gelfand, M.S., Kozhukhin, C.G., Pevzner, P.A., 1992. Extendable words in nucleotide sequences location. Comput. Appl. Biosci. 8, 129–135.
Gene, Y., Burge, Ch.B., 2003. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. In: Proceedings of the Seventh Annual International Conference on Computational Molecular Biology. ACM, New-York, pp. 322–331.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 1994. Redundancy of genetic texts and mosaic structure of a genome. Russ. Mol. Biol. 282, 313–322.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 1996. Viral genes are less redundant than the human genes. Russ. J. Genet. 322, 281–294.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 1998. Automatic classification of nucleotide sequences and its relation to natural taxonomy and protein function. In: Proceeedings of the International Conference on Bioinformatics of Genome Regulation and Structure, vol. II, Novosibirsk, Russia, August 24–27, 1998, pp. 314–317.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 2000. Classification of symbol sequences over their frequency dictionaries: Towards the connection between structure and natural taxonomy. Open Syst. Inform. Dyn. 71, 1–17.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 2003. Classification of nucleotide sequences over the frequency dictionaries reveals a relation between taxonomy and the structure of the dictionaries. Russ. J. Gen. Biol. 641, 51–63.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., Wunsch, D.C., 2001. Information content of the frequency dictionaries, reconstruction, transformation and classification of dictionaries and genetic texts. In: Intelligent Engineering Systems through Artificial Neural Networks, vol. 11: Smart Engineering System Design. ASME, New York, pp. 657–663.
Hao, B., Xie, H., Yu, Z., Chen, G., 2000. Avoided strings in bacterial complete genomes and a related combinatorial problem. Ann. Comb. 4, 247–255.
Hua, R., Wanga, B., 2001. Statistically significant strings are related to regulatory elements in the promoter regions of Saccharomyces cerevisiae. Physica A 290, 464–474.
Kirkwood, J., Boggs, E., 1942. The radial distribution function in liquids. J. Chem. Phys. 106, 394.
Loewenstern, D., Yianilos, P.N., 1999. Significantly lower entropy estimates for natural DNA sequences. J. Comput. Biol. 6, 125–142.
Mamonova, M.A., Sadovsky, M.G., 2003. Information value of triplets of some genetic systems. Russ. J. Gen. Biol. 645, 421–433.
Makarova, M.A., Sadovsky, M.G., 2004. The informational approach to the structure–function relationship in biological macromolecules. Dokl. Biochem. Biophys. 61, 236–238.
Popova, T.G., Sadovsky, M.G., 1995. Redundancy of genes decreases due to splicing. Russ. Mol. Biol. 293, 500–506.
Popova, T.G., Sadovsky, M.G., 1995. Introns differ from exons in their redundancy. Russ. J. Genet. 3110, 1365–1369.
Ragosta, M., Cosmi, C., Cuomo, V., Macchiato, M., 1992. An application of maximum entropy techniques to determine homogeneous sets of nucleotidic sequences. J. Theor. Biol. 155, 129–136.
Sadovsky, M.G., 2002. Towards the problem of redundancy of prokaryotic and viral genomes. Russ. J. Genet. 385, 695–701.
Sadovsky, M.G., 2003. The method of comparison of nucleotide sequences based on the minimum entropy principle. Bull. Math. Biol. 65, 309–322.
Sadovsky, M.G., 2005. Information capacity of biological macromoleculae reloaded. arXiv:q-bio.GN/0501011.
Schneider, T.D., 2000. Evolution of biological information. Nucleic Acids Res. 2814, 2794–2799.
Stanley, H.E., 2000. Exotic statistical physics: Applications to biology, medicine, and economics. Physica A 285, 1–17.
Sussillo, D., Kundaje, A., Anastassiou, D., 2004. Spectrogram analysis of genomes. EURASIP J. Appl. Signal Process. 1, 29–42.
Weiß, O., Jiménez-Montaño, M.A., Herzel, H., 2000. Information content of protein sequences. J. Theor. Biol. 206, 379–386.
Xiao, M., Zhu, Z., Liu, J., Zhang, C.-Y., 2002. A new method based on entropy theory for genomic sequence analysis. Acta Biotheor. 503, 155–165.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sadovsky, M.G. Information capacity of nucleotide sequences and Its applications. Bull. Math. Biol. 68, 785–806 (2006). https://doi.org/10.1007/s11538-005-9017-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-005-9017-0