Evolution of the genetic code

What do the observed patterns tell us about the evolution of the genetic code? The so-called biosynthetic theory assumes that the genetic code evolved from a simpler form that encoded fewer amino acids (Crick 1968). A special version of this theory has been given by Wong (1975) who proposes that the genetic code coevolved with the invention of biosynthetic pathways for new amino acids. Although it has been shown that his analyzes rest on wrong assumptions (Ronneberg et al. 2000), it is generally accepted that one can discriminate evolutionary old and new amino acids (Alberts et al. 2002). Of course it could be that the binding allocation between nucleic acid molecules (RNAs or even PNAs (Knight and Landweber 2000b)) and amino acids did not start until all 20 amino acids were available; but it seems simpler to assume that as soon as there were amino acids and nucleic acids available (produced abioticly), both began to bind to each other. It now seems clear that “the code probably underwent a process of expansion from relatively few amino acids to the modern complement of 20” (Knight and Landweber 2000b).

            Does our scheme yield some hints as to the evolution of the code? We already noted that the third nucleotide is nearly always  (two exceptions in the standard code) analyzed just in a binary manner. Taking this for granted, we can reduce our originally 8x8 scheme to a 8x4 scheme. Looking at this scheme, we observe a high redundancy for each second row. Therefore, it is tempting to speculate that there was a period during code evolution where the third position was not needed at all. Assuming this, we can cancel each second row and are left with a pure doublet code that encodes 4x4=16 amino acids (or 15 plus a termination codon). Perhaps then, a doublet code preceded the triplet code, as already had been speculated (Jukes 1973, Hayes 1998).

Conceivably, codon expansion from doublet to triplet could have arisen before this, or possibly not until all 16 amino acids were encoded. If one assumes the latter, then it is interesting to postulate for each doublet the corresponding old amino acid. Met (Wong 1975),

Trp, Gln, Asn (Knight and Landweber 2000b), and Tyr (Alberts et al. 2002) seem to be newer amino acids. As mentioned above, Szathmary (1999) proposed an evolutionary mechanism of tRNA formation. In principle, this mechanism could also work starting with doublets instead of triplets. It should be possible to gain experimental evidence for a doublet code by studying amino acid – nucleic acid doublet binding in the same way as has been done for triplets. Knight and Landweber (2000a) showed that Arg triplet codons alone significantly associate with arginine binding sites. Perhaps the doublets show a higher specificity.

Binary doublet code representation: 00*,10*,01*,11*:

a) * is pyrimidine

 b) * is purine


However, by proposing a doublet code one faces the frameshifting problem. It seems to be unthinkable that a sudden transition from a two-letter to a three-letter frame ever occurred. Instead, one can imagine a gradual evolution with an ancient three-letter reading frame where just the first two letters have been analyzed by an ancient translation machinery. However, one then wonders about such inefficient use of coding space. Perhaps the ancient translation machinery could simply for stereochemical reasons not analyze a two-letter frame. In this context it is also interesting to note that even our contemporary code is somehow ‘inefficient’: already a quaternary doublet code can encode 16 amino acids (or 15 plus a termination codon). For just four (or fife) further amino acids a third letter is necessary. Of course, this inefficiency has the advantage of robustness enhancing redundancy.

            Szathmary (1992, 2003) proposed a model which yields the result that two different base pairs represent an optimal compromise between the overall copying fidelity and an overall reproduction rate (metabolic efficiency). He assumed that the genetic code was developed before evolution invented proofreading. For higher copying fidelity (due to proofreading, etc.), the model predicts that three different base pairs are better than just two. It is tempting to speculate that in the earliest phases of biological evolution with the lowest copying fidelity just one base pair could have worked as well (The copying fidelity is always highest for just one base pair. Nevertheless, Szathmary’s simple model gives no one base pair optimum, but a more detailed model for the metabolic efficiency could do so.). So, perhaps, nucleic acid – amino acid mapping started with a binary code. This is in accordance with earlier speculations that the first genetic material contained only a single base-pairing unit (Crick 1968, Orgel 1968). An important argument in this context is the chemical instability of cytosine, so that it may be difficult to establish a genetic system with G-C base pairing (Levy and Miller 1998).

Wächtershäuser (1988) proposed an all-purine precursor of nucleic acids. However, for the sake of self-replication it is more obvious to assume a two-letter code that can give rise to complementary base pairing. Jimenez-Sanchez (1995) argued for an early (binary) A-U coding. Recently, a ribozyme composed of only two different nucleotides has been found by in vitro evolution that contained the pyrimidine uracil and the purine 2,6-diaminopurine (Reader and Joyce 2002).  Note that uracil is the biosynthetic precursor of the pyrimidines cytosine and thymine (the corresponding precursor of the purines adenine and guanine is hypoxanthine).

Of course, a binary encoding also would be the most aesthetic version from a purely mathematical point of view. A binary triplet code would represent just one column in our scheme (Fig.2). Given the high redundancy between the rows, it is unlikely that this ever happened. However, an even simpler coding, a binary doublet code, seems conceivable. It is tempting to speculate which four amino acids, one per two consecutive rows, were the first encoded ones. In the first two rows (two pyrimidines, i.e. 00) Ser seems to be the oldest amino acid, and in the third and fourth row (10) Ala (Wong 1975).  On the other hand the 01 rows obviously contain no really old amino acid while the 11 rows contain more than one: Gly, Asp, Glu (Wong 1975).

One could speculate that the termination marker was important from the very beginning and resulted in coding by the 01 binary doublet. It has been noted that the five amino acids coded by G**  (Ala, Val, Gly, Asp, Glu) are all at or near the head of the amino acid synthesis pathways (Taylor and Coates 1989) and also the most abundantly formed ones in abiotic synthesis experiments (Miller 1953, 1987). Furthermore, it has been shown recently by extensive statistical analyzes that the frequencies of all fife G** amino acids are significantly greater in evolutionary conserved residues and it has been concluded that “these amino acids may have been the first introduced into the genetic code” (Brooks and Fresco 2002, 2003, Brooks et al. 2002). This is also consistent with physicochemical arguments proposing that the first sense codons had the form G** (Eigen and Schuster 1978). However, Gly is biochemically built from Ser, so Ser can be assumed as prior. It could be that in the beginning of nucleic acid – amino acid assignment Asp and Glu competed for the 11 doublet. Of course, code transfer from one amino acid to another one might also have occurred (Wong 1975).

Another scenario consistent with a binary doublet code has been given by Fitch’s “ambiguity reduction” hypothesis (Fitch and Upper 1987). It states that early in evolution there was an ambiguity in the charging of amino acids to anticodon acceptors: in a first step just *pyrimidine* codons (*0*), coding for hydrophobic amino acids, and *purine* codons (*1*), coding for hydrophilic amino acids, has been distinguished (binary singulet code). In a second step the more refined binary doublet code (00*, 01*, 10*, 11*) evolved.

 The idea that the doublet code was just the second state in the evolution of the genetic code and that this evolution started with just the mid-base as coding, has been worked out by others, who termed  it “simplet” code (McClendon 1986, Schwemmler 1994). However, in this hypothesis both old amino acids Ser (UC*) and Ala (GC*), as well as Asp (GA*) and Glu (GA*), cannot be discriminated. We therefore suggest that the first two positions were equally important from the very beginning. Although our suggestion also does not allow  discrimination between the related amino acids Asp and Glu, it nevertheless allows discrimination between the functionally divergent amino acids Ser and Ala. A further argument for the evolutionary importance of the first two nucleotides is the strong correlation observed between codon strength and the amino acid properties.