Introduction

Biological and digital communication systems have similarities with respect to the corresponding procedures used to convey the biological and digital information from one point to another, as well as in the data storage of digital media in a redundant array of independent disks (RAID)1 and the storage of genetic information in chromosomes. These similarities enable the use of algorithms in the modeling and analyses of biological systems and data. For instance, in eukaryotic cells, the information contained in the DNA is transmitted through RNA to produce the proteins needed at a precise moment and in specific compartments in the cell. Many enzymes and complex molecules coordinate their transport and are often assisted by protein intermediates in the cytosol and organellar membranes, thus identifying the correct location of a protein. In the same way, the transmission of flawless data through noisy channels in digital communication systems can be reliably achieved if, in addition to using an error-correcting code (ECC), extensive signal processing techniques are also employed2.

For quite some time there have been attempts to confirm the existence of an error-control mechanism in biological sequences similar to the ECC employed in digital sequences3 and although relevant, such studies have yet to provide a definitive answer. Recently our group developed an algorithm, known as DNA Sequence Generator Algorithm, which verifies whether a given DNA sequence can be identified as a codeword of an ECC. This goal was achieved when many distinct DNA sequences were identified as code words of G-linear codes (consisting of specific mappings and the underlying BCH codes)4,5,6,7 an important subclass of cyclic codes.

BCH codes were first proposed by Hocquenghem8 and independently rediscovered by Bose and Chaudhuri9; therefore, the acronym is made up of the initials of Bose, Chaudhuri and Hocquenghem. When an underlying BCH code over Galois ring extension and/or Galois field extension identifies a given DNA sequence, two things may occur: 1) the given DNA sequence is a codeword of a G-linear code; or 2) it is a sequence belonging to the set of neighboring sequences differing by at least one nucleotide from the corresponding codeword of a G-linear code. This set of neighboring sequences is referred to as the “cloud” of a codeword.

When the DNA sequence generation algorithm identifies a DNA sequence belonging to the cloud of a codeword, it differs in a single nucleotide from the original sequence. Similar to biological DNA, this generated codeword may represent a silent mutation causing no effect on the translated amino acid or it may cause a residue change affecting for instance the protein structure and activity and consequently impairing its interactions with other proteins. Furthermore, the single nucleotide alteration can be restored, or equivalently, the codeword can be reverse engineered, returning it to its original sequence by applying one of the following algorithms: the Berlekamp-Massey decoding algorithm for codes over Galois field extensions10,11 or the Modified Berlekamp-Massey decoding algorithm for codes over Galois ring extensions12,13, together with the corresponding labeling associated with each analyzed sequence.

Recently, Ivanova and colleagues14 used a metagenomics approach to survey the prevalence of stop codon reassignment in naturally occurring microbial populations and proposed that the canonical genetic code may contain some deviations. Similarly, studies of the evolution of the genetic code have developed a hypothesis that differs from a frozen universal code15,16,17,18,19 and even the universality of the code20,21. It has been observed that each deviant genetic code contains codons that are associated with different amino acids and also with the canonical genetic code. Consequently, one may infer that such a process may have evolved from a standard code16. Such deviant genetic codes can be found in nuclear and mitochondrial genomes, in which mechanisms of codon reassignment have led to the differential reading of certain codons22,23,24. The evolution of the genetic code plays an important role in understanding the differences between the response of the DNA sequence identification process and the given DNA sequence because these differences can be related to either the canonical genetic code or to the several deviant genetic code4,5,6,7,22. In another example, Inomata and colleagues25 using multiple sequence alignment and test of neutrality, have demonstrated that a single replacement of guanine with adenine (position 926 of the gene) in Drosophila melanogaster, resulting on threonine at the 218 - amino acid position, was the ancestral form of the Gr5a gene in D. melanogaster25 and this single amino acid polymorphism (ALA218THR) represents a key impact on the trehalose sensitiveness.

The proposal of mathematical models describing such biological systems provides the needed tools for the development of systematic approaches for studies of mutations and polymorphisms and has applications in genetic engineering.

Thus, if an ECC can identify differences in a DNA sequence with a one-nucleotide resolution, the questions that should be addressed are as follows: if there is an error-correcting code underlying the DNA sequences, what are the biological implications regarding the single nucleotide (SNP) difference? And, is there a biological reasoning for such a difference? In the present study, we used the ECC approach proposed in references4,5,6,7 to evaluate whether the nucleotide difference between the original DNA sequence and the sequence identified as the codeword of the ECC is biologically significant in terms of evolution of this identified polymorphism.

Results and Discussion

In this study the DNA sequence generation algorithm was applied as a computational tool to provide strong evidence of the evolution of the genetic code, in special on nucleotide and amino acid site specific polymorphism, by showing the existence of a mathematical structure underlying the actual DNA sequences and by investigating the real biological meaning of the difference in the specific position pointed out by the code-generated sequences.

The code-generated sequences that had a single nucleotide alteration, causing a residue change in the translated protein, were used in a Blastx analysis to verify if the alteration suggested by the ECC could be found in other sequences.

Analyses were run for the code generated sequences of the Saccharomyces YMR193 gene (GI 45269853), the Triticum aestivum wPR4 gene (GI 78096542), the Nicotiana tabacum antifungal CBP 20 gene (GI 632733), the Citrus sinensis chlorophyllase gene (GI 7328566), the Arabidopsis thaliana hevein-like protein PR4 gene (GI 186509758), the Saccharomyces cerevisiae OXA gene (GI 832917) and the Homo sapiens F1F0 ATP-synthase gene (GI 12587). These sequences are shown in Table S01.

As a result of this search approach, a number of different genes were found to contain the same nucleotide at the same altered position suggested by the error-correcting code, see Tables 1 and S02.

Table 1 Polynomial-based DNA sequences generated by BCH codes over Galois ring and field extensions.

In some of the results, the suggested polymorphism could be found in DNA sequences of taxa that were closely related to the query sequence. For example, for the code-generated sequence of the YMR 193 gene from Saccharomyces cerevisiae (Tables S02a and S02b), a mitochondrial protein involved in the large ribosomal subunit, the same residue was also found in other Ascomycota sequences. The results for the code-generated antifungal CPB 20 gene from Nicotiana tabacum were similar to those from other eudicots. The results for the chlorophyllase gene in Citrus sinensis were also found in sequences in Populus spp. And the code-generated sequence for the OXA gene, which is involved in cytochrome oxidase biogenesis in Saccharomyces cerevisiae, showed the same residue in the altered position in other ascomycete sequences. However, different cases were found as well, such as the F1F0 ATP-synthase gene from Homo sapiens, in which the code-generated polymorphism of His to Gln could only be found at the same position in certain fungi sequences, a very distantly related taxa to H. sapiens, which may instead provide evidence that this algorithm may be describing ancient site specific sequences in which evolution acted to influence the current appearance of the gene. This was also observed in the wPR4 gene, which is involved in vacuolar defense in Triticum aestivum, in which the residue alterations in the positions suggested by the algorithm were found in eudicots, monocots and other more distant taxa. The code-generated nucleotide sequence of the hevein-like protein PR4 of Arabidopsis thaliana showed an alteration that was also found in eudicots and monocots. Based on these results, one may infer the possibility that the ECC generated sequence might represent a plesiomorphic state of the SNP on DNA sequences of interest, and, may be viewed as an alternative generator of the canonical genetic code.

This SNP plesiomorphic state evidence is supported by a Bayesian analysis and divergence time calculation for the code-generated and homologous sequences of the Arabidopsis thaliana malate dehydrogenase gene. The analyses showed that the A. thaliana malate dehydrogenase sequences form a monophyletic group rooted in the sequence generated by the ECC (Fig. 1). This sequence was generated by the Klein-linear code ((1023, 1013, 3) BCH code over Z4 with the generator polynomial g(x) = x10 + x9 + x8 + 3x7 + x6 + x4 + x3 + 3x+1 and labeling C, Tables 2 and 3) and was recovered as an external group for A. thaliana clade (Fig. 1). The divergence time analysis indicates that the MDH code-generated sequence node has diverged approximately 1 MYA before the development of any A. thaliana paralogous sequences. These observations suggest that the sequence generated by the code might be more closely related to the ancestor of A. thaliana malate dehydrogenase rather than to other paralogous genes, evidencing that the ECC code generated sequence has a SNP that may be indicating the ancient state of this sequence. The application of an ECC does not aim to reconstruct full ancestral sequences from a given phylogenetic tree and aligned gene sequences of some current species; here we describe an ancestral site specific reconstruction based solely on DNA primary structure recovered from coding and decoding gene sequences.

Table 2 A. thaliana - Mitochondrial - Malate dehydrogenase 1 – GI number 30695458.
Table 3 A. thaliana - Mitochondrial - Malate dehydrogenase 1 – GI number 30695458.
Figure 1
figure 1

The malate dehydrogenase (MDH) phylogenetic proposal and date inference used to estimate the divergence time among malate dehydrogenase (MDH) sequences.

The number on the nodes indicates the posterior probability and the number along the length of the branch indicates the age in millions of years ago (MYA). The asterisk indicates the sequence analyzed by the G-linear code.

Among the analyzed sequences, we identified several single nucleotide polymorphisms that were pointed out by the ECC as leading to a codon alteration (and also an amino acid alteration in the translated sequence), but in the deviant genetic codes, these altered codons correspond to the same amino acids that were found in the original sequence15,26,27.

In noncanonical genetic codes, alterations in the components of the translation mechanism confer different meanings to specific codons. For example, TGA is read as Trp17,27,28,29,30,31,32,33,34,35,36,37,38, AGA as Ser33,34,35,36,37, ATA as Met17,31,35,37,38,39,40 and TGA as Cys17,23.

When the F1 ATPase gene from Ipomoea batatas (GI 217937) was applied to the DNA Sequence Generator Algorithm, the output sequence presented an alteration in the sense codon TGG (encoding Trp) to become the stop codon TGA (Table 4a). A similar example is observed in the BRCA1 gene sequence in H. sapiens (GI 25140446), which is altered by the code from Cys to a stop codon (Table 4b). Interestingly, in the mitochondrial genetic code of most organisms, aside from green plants, the codon TGA is associated with tryptophan and studies have shown that in the primary structure of mitochondrial and nuclear genomes, the TGA codon does not signal for the release of the transcription factors but instead codes for Trp17,28,30,38.

Table 4 DNA sequences generated by BCH code.

The code generated sequence for the Allergen Pol d5 gene of Polistes dominulus (GI 51093376) showed an alteration from AGT (Ser) to AGA (Arg), the same happening with the code generated sequence for the anti-epilepsy peptide precursor of Mesobuthus martensii (GI 16740522) showed an alteration from ATG (Met) to ATA (Ile) (Table 4b,c). In noncanonical genetic codes, AGA codes for Ser33,34,35,36,37 and ATA for Met17,35,37,39,40,41; therefore, these alterations could modify the folding and activity of the subsequent protein due to a change in the charge and hydropathy of the residues.

Often times, the same codon reassignment may independently occur multiple times in different taxa. The mechanisms leading to codon reassignment have yet to be fully elucidated and may be due to factors such as codon disappearance, an ambiguous intermediate, or unassigned codons16,17,42,43,44. Deviant genetic codes are an example of how populations cross over maladaptive valleys from one adaptive peak to another, in respect to error minimization, via adaptive bridges45. Therefore, the algorithm may underlie any of the stages of information transmission, representing an earlier stage of the evolution of the universal/canonical code.

The characters, character states and the evolution of ancient genes or proteins can hardly be directly studied, because such molecule are rarely preserved over the evolutionary time or from any ancestral, living or preserved, has not been gathered from the nature. Pauling and Zuckerkandl once proposed that ancestral molecules could one day be “resurrected” by digging out from the evolution their ancient form46. Since then, different methods of ancestral sequence reconstruction (ASR) have emerged based on parsimony47, Bayesian inference48 or maximum likelihood49. Independently of the methodology used all these approaches rely on multiple sequence alignment with the aim of elucidating the complete and distant sequences (Supplementary material 1 presents a maximum likelihood analysis for the Arabidopsis thaliana Malate Dehydrogenase). Here, we hypothesize that the G-linear code may identify the original molecular primary structure of the sequence using only the intrinsic nucleotide composition. The DNA sequence generation algorithm can describe the plesiomorphic state of certain DNA character state sequences, as the suggested single nucleotide alteration often occurs in distant taxa and is maintained by alternative codon patterns in noncanonical genetic codes.

In summary, the G-linear code, commonly associated with reliable digital transmission, even with all the constraints inherent to the construction of the ECC4,5,6,7, unwraps the molecular component of every living cell when it is applied to the primary structure of DNA, thus revealing ancient information that may have been silenced by assorted evolutionary pressures that have shaped the present forms of life. This code generates point mutations that can be found in actual (real) sequences and the DNA sequence generation algorithm can be used in computer simulations for the analysis of polymorphisms and mutations.

Methods

Identification of the DNA sequences

Although several DNA-encoding sequences (organelle-targeting sequences, introns, protein motifs and full proteins) were identified by the corresponding G-linear codes over finite Galois rings and fields, as shown in Table 1, the majority of these DNA sequences were identified by the G-linear codes over rings. One possible explanation is that the latter algebraic structure may be more flexible than the algebraic structure of fields. As a consequence, the sequences identified by the corresponding G-linear codes over fields exhibit less adaptability than those offered by G-linear codes over rings. This observation suggests that it is possible to classify the proteins according to their stability in the mutation index, allowing a new approach for the classification of DNA sequences from a mathematical point of view.

All of the DNA sequences analyzed by the DNA sequence generation algorithm were identified as belonging to the “cloud” of the corresponding code words of the ECC. In other words, the actual DNA sequences differ from the corresponding code words of the ECC by a single nucleotide. The code-generated sequences in which the single nucleotide alteration led to an amino acid change in the translated protein were further analyzed.

These code-generated sequences were used as queries in a Blastx search, with the results filtered for green plants, fungi, bacteria, Archaea, algae and monocots from the NCBI non-redundant protein sequence database. The Blastx results were then aligned with Muscle50,51 (CLC Bio Genomics workbench plugin) and the position of the altered amino acid was compared with these results.

Several codons with the same meaning have been reassigned in independent lineages, which could mean that there is an underlying predisposition towards certain reassignments43. As an example of how the DNA sequence generation algorithm could be determining the ancient codon patterns in the analyzed species, we searched for codons in the code-generated sequences that were related to meaningful biological parts of nuclear or mitochondrial genomes16,27 and had a codon reassignment in other species.

Evolutionary proposal and estimation of the divergence time based on Bayes approach - Estimates of divergence time among malate dehydrogenase (MDH) sequences

The divergence time between fungi and green plants52, mosses and vascular plants52,53 and eudicot rosids and asterids54 was used to estimate a divergence time for the Arabidopsis thaliana group. Species-level phylogenies were generated using a Bayesian uncorrelated lognormal relaxed clock model in Beast version 1.4.855. The dataset followed the GTR + Γ model of substitution implemented in Beast and two Monte Carlo Markov chains were run for 90,000,000 generations, using the Yule speciation model, using a 10% burn-in with sampling trees generated every 10,000 generations.

Additional Information

How to cite this article: Brandão, M. M. et al. Ancient DNA sequence revealed by error-correcting codes. Sci. Rep. 5, 12051; doi: 10.1038/srep12051 (2015).