Information capacity of nucleotide sequences and Its applications

Sadovsky, M. G.

doi:10.1007/s11538-005-9017-0

Information capacity of nucleotide sequences and Its applications

Original Paper
Published: 07 April 2006

Volume 68, pages 785–806, (2006)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

M. G. Sadovsky¹

110 Accesses
13 Citations
Explore all metrics

Abstract

The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Acquisti, C., Allegrini, P., Bogani, P., Buiatti, M., Catanese, E., Fronzoni, L., Grigolini, P., Mersi, G., Palatella, L., 2004. In the search for the low-complexity sequences in prokaryotic and eukaryotic genomes: How to derive a coherent picture from global and local entropy measures. Chaos Solitons Fractals 201, 127–137.
Article MathSciNet MATH Google Scholar
Bell, S.J., Forsdyke, D.R., 1999. Accounting units in DNA. J. Theor. Biol. 197, 51–61.
Article PubMed Google Scholar
Berryman, M.J., Allison, A., Abbott, D., 2004. Mutual information for examining correlations in DNA. arXiv:q-bio.PE/0404010v1.
Bolshoy, A., 2003. DNA sequence analysis linguistic tools: Contrast vocabularies, compositional spectra and linguistic complexity. Appl. Bioinform. 22, 103–112.
Google Scholar
Bugaenko, N.N., Gorban, A.N., Sadovsky, M.G., 1998. Maximum entropy method in analysis of genetic text and measurement of its information content. Open Syst. Inform. Dyn. 53, 265–278.
Article MATH Google Scholar
Buldyrev, S.V., Dokholyan, N.V., Havlin, Sh., Stanley, H.E., Stanley, R.H.R., 1999. Expansion of tandem repeats and oligomer clustering in coding and noncoding DNA sequences. Physica A 273, 19–32.
Article Google Scholar
Bultrinia, E., Pizzia, E., del Giudice, P., Frontali, C., 2003. Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster. Gene 304, 183–192.
Article PubMed Google Scholar
Bussemaker, H.J., Hao, L., Siggia, E.D., 2000. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. PNAS 97, 10096–10100.
Google Scholar
Crochemore, M., Verin, R., 1999. Zones of low entropy in genomic sequences. Comput. Chem. 23, 275–282.
Article PubMed Google Scholar
Current Topics in Computational Molecular Biology, 2002. In: Jiang, T., Xu, Y., Zhang, M.Q. (Eds.). MIT Press, Cambridge, MA, 540 p.
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G., 1998. Biological sequence analysis: Probabilistic models of protein and nucleic acids. Cambridge University Press, New York, 426 p.
MATH Google Scholar
Gelfand, M.S., Kozhukhin, C.G., Pevzner, P.A., 1992. Extendable words in nucleotide sequences location. Comput. Appl. Biosci. 8, 129–135.
PubMed Google Scholar
Gene, Y., Burge, Ch.B., 2003. Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. In: Proceedings of the Seventh Annual International Conference on Computational Molecular Biology. ACM, New-York, pp. 322–331.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 1994. Redundancy of genetic texts and mosaic structure of a genome. Russ. Mol. Biol. 282, 313–322.
Google Scholar
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 1996. Viral genes are less redundant than the human genes. Russ. J. Genet. 322, 281–294.
Google Scholar
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 1998. Automatic classification of nucleotide sequences and its relation to natural taxonomy and protein function. In: Proceeedings of the International Conference on Bioinformatics of Genome Regulation and Structure, vol. II, Novosibirsk, Russia, August 24–27, 1998, pp. 314–317.
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 2000. Classification of symbol sequences over their frequency dictionaries: Towards the connection between structure and natural taxonomy. Open Syst. Inform. Dyn. 71, 1–17.
Article MATH Google Scholar
Gorban, A.N., Popova, T.G., Sadovsky, M.G., 2003. Classification of nucleotide sequences over the frequency dictionaries reveals a relation between taxonomy and the structure of the dictionaries. Russ. J. Gen. Biol. 641, 51–63.
Google Scholar
Gorban, A.N., Popova, T.G., Sadovsky, M.G., Wunsch, D.C., 2001. Information content of the frequency dictionaries, reconstruction, transformation and classification of dictionaries and genetic texts. In: Intelligent Engineering Systems through Artificial Neural Networks, vol. 11: Smart Engineering System Design. ASME, New York, pp. 657–663.
Hao, B., Xie, H., Yu, Z., Chen, G., 2000. Avoided strings in bacterial complete genomes and a related combinatorial problem. Ann. Comb. 4, 247–255.
Article MathSciNet MATH Google Scholar
Hua, R., Wanga, B., 2001. Statistically significant strings are related to regulatory elements in the promoter regions of Saccharomyces cerevisiae. Physica A 290, 464–474.
Article Google Scholar
Kirkwood, J., Boggs, E., 1942. The radial distribution function in liquids. J. Chem. Phys. 106, 394.
Article Google Scholar
Loewenstern, D., Yianilos, P.N., 1999. Significantly lower entropy estimates for natural DNA sequences. J. Comput. Biol. 6, 125–142.
Article PubMed Google Scholar
Mamonova, M.A., Sadovsky, M.G., 2003. Information value of triplets of some genetic systems. Russ. J. Gen. Biol. 645, 421–433.
Google Scholar
Makarova, M.A., Sadovsky, M.G., 2004. The informational approach to the structure–function relationship in biological macromolecules. Dokl. Biochem. Biophys. 61, 236–238.
Google Scholar
Popova, T.G., Sadovsky, M.G., 1995. Redundancy of genes decreases due to splicing. Russ. Mol. Biol. 293, 500–506.
Google Scholar
Popova, T.G., Sadovsky, M.G., 1995. Introns differ from exons in their redundancy. Russ. J. Genet. 3110, 1365–1369.
Google Scholar
Ragosta, M., Cosmi, C., Cuomo, V., Macchiato, M., 1992. An application of maximum entropy techniques to determine homogeneous sets of nucleotidic sequences. J. Theor. Biol. 155, 129–136.
Article PubMed Google Scholar
Sadovsky, M.G., 2002. Towards the problem of redundancy of prokaryotic and viral genomes. Russ. J. Genet. 385, 695–701.
Article Google Scholar
Sadovsky, M.G., 2003. The method of comparison of nucleotide sequences based on the minimum entropy principle. Bull. Math. Biol. 65, 309–322.
Article PubMed Google Scholar
Sadovsky, M.G., 2005. Information capacity of biological macromoleculae reloaded. arXiv:q-bio.GN/0501011.
Schneider, T.D., 2000. Evolution of biological information. Nucleic Acids Res. 2814, 2794–2799.
Article PubMed Google Scholar
Stanley, H.E., 2000. Exotic statistical physics: Applications to biology, medicine, and economics. Physica A 285, 1–17.
Article MathSciNet MATH Google Scholar
Sussillo, D., Kundaje, A., Anastassiou, D., 2004. Spectrogram analysis of genomes. EURASIP J. Appl. Signal Process. 1, 29–42.
Article Google Scholar
Weiß, O., Jiménez-Montaño, M.A., Herzel, H., 2000. Information content of protein sequences. J. Theor. Biol. 206, 379–386.
Article PubMed Google Scholar
Xiao, M., Zhu, Z., Liu, J., Zhang, C.-Y., 2002. A new method based on entropy theory for genomic sequence analysis. Acta Biotheor. 503, 155–165.
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Biophysics of Siberian Division of Russian Academy of Sciences, Akademgorodok, Krasnoyarsk, 660036, Russia
M. G. Sadovsky

Authors

M. G. Sadovsky
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sadovsky, M.G. Information capacity of nucleotide sequences and Its applications. Bull. Math. Biol. 68, 785–806 (2006). https://doi.org/10.1007/s11538-005-9017-0

Download citation

Received: 21 April 2004
Accepted: 10 March 2005
Published: 07 April 2006
Issue Date: May 2006
DOI: https://doi.org/10.1007/s11538-005-9017-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information capacity of nucleotide sequences and Its applications

Abstract

Access this article

Similar content being viewed by others

Introduction to Bioinformatics

Centrality measures in networks

Thermodynamics of protein folding: methodology, data analysis and interpretation of data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Information capacity of nucleotide sequences and Its applications

Abstract

Access this article

Similar content being viewed by others

Introduction to Bioinformatics

Centrality measures in networks

Thermodynamics of protein folding: methodology, data analysis and interpretation of data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation