Building a Programming Language for Genetic Engineering — Part 2 — Introduction to the Grammar of DNA

7 min readMay 26, 2021

Building a Programming Language for Genetic Engineering — Part 2 — Introduction to the Grammar of DNA

Cells are like 3D printers

Cells are just like 3D printers, they read a set of instructions written in DNA and assemble molecules in different shapes and properties that define their function.

In genetic engineering we can utilize cells as bio-factories to produce these molecules (known as recombinant proteins) in a process called fermentation. These molecules can then be extracted (or purified) in different methods.

How molecules 3D printing works inside the cells

The genetic instructions are stored in the cell as double stranded DNA sequences. Part of these sequences are then converted into copies of a single-stranded pre-mRNA molecules, in a process called “Transcription”.

The non coding regions (introns) of the pre-mRNA sequences are then removed and exons (coding regions) are joined together to form a mature messenger RNA (mRNA) in a process called “RNA Splicing”.

The copies of mRNA are then bind to ribosomes which are like small 3D printers inside the cell. The ribosomes read the copies of instructions and translate them to amino acid sequences in a process called “Translation”. These amino acid chain copies are then spontaneously but not randomly fold into a specific 3D structure. The result of the folded chain is a protein that has electrical and mechanical properties that define it’s function inside or outside the cell.

DNA is a low level machine language

DNA is a double stranded molecule. It’s a large chain of instructions embedded in sequences that has no function (junk DNA).

In English we have 26 letters A-Z. DNA is made of sequences of 4
letters/nucleotides:

Adenine (a)
Cytosine (c)
Guanine (g)
Thymine (t)

Or just ‘a’, ‘c’, ‘g’, and ‘t’ in short.

(It is common to use lower case letters for DNA sequences)

DNA sequences also have direction, in English we write from left-to-right, in DNA we write from from 5' (5 prime) to 3' (3 prime) (more on this later).

DNA has a grammar. In genetic engineering, instructions are made of genetic LEGO parts called biobricks that can be combined one after the other to form a genetic program.

One of the most important features of DNA for life in general and for Genetic Engineering in particular is that the double stranded sequences are complementary.

‘a’ binds (pairs) with ‘t’
‘t’ binds (pairs) with ‘a’
‘c’ binds (pairs) with ‘g’
‘g’ binds (pairs) with ‘c’

A pair of letters (nucleotides), one at each helix of the double helix is called a ‘base-pair’ (or ‘bp’ in short).

Thanks to this remarkable feature we can do “cut”, “copy” and “paste”, chemically, inside or outside the cells (more on this in later chapters).

DNA is made of 4 letters: ‘a’, ‘c’, ‘g’, and ‘t’

This is a how a raw sequence of DNA letters looks like:

tgaaggtgatacccttgttaatagaatcgagttaaaaggtattgacttcaaggaagatggcaacattctgggacacaaattggaatacaactataactcacacaatgtatacatcatggcagacaaacaaaagaatggaatcaaagtgaacttcaagacccgccacaacattgaagatggaagcgttcaactagcagaccattatcaacaaaatactccaattggcgatggccctgtccttttaccagacaaccattacctgtccacacaatctgccctttcgaaagatcccaacgaaaagagagaccacatggtccttcttgagtttgtaacagctgctgggattacacatg

An ‘Hello World’ — basic genetic program structure

In C or Java we have syntax constructs like “if else”, “while”, “print”, “function declaration”, and “end of line”.

In genetic engineering, instead of looking at raw DNA sequences, we can combine genetic parts one after the other, to form a genetic program, where each genetic part encapsulates a raw DNA sequence and a grammatical meaning.

For example:

5' promoter(…) rbs(…) cds([…]) terminator(…) 3'

The sequence starts from the 5 prime end and ends at the 3 prime end (in our case from left to right).
A Promoter is a genetic part that determines how much or when to manufacture a protein. It’s like a while(condition)
A Ribosome Binding Site (RBS) is a sequence that determines where a ribosome will be attached to the mRNA sequence. It also controls how many copies of the protein to produce, but in a different mechanism. It’s like a sleep(milliseconds).
A Protein Coding Sequence (CDS) is the instruction to the ribosome — “which protein to manufacture” or the “protein recipe”. For example an Insulin, or a green pigment. It’s like a print() or G-CODE.
A Terminator sequence indicates where the gene ends (the point where transcription to mRNA stops). It’s like a semicolon ‘;’

Compared to a computer code, it’s something like:

while(condition)
{
sleep(milliseconds)
print(green pigment)
}

The compiled result is a raw DNA sequence:

The raw DNA sequence of a sample gene in FASTA format. The first line is a comment and the rest is the raw sequence

Genetic constructs can also be visualized in multiple graphical representations:

SBOL notation:

2. iGEM BioBricks notation:

3. Circular (‘Plasmid’) notation:

The ‘Genetic Code’ — the grammar of protein coding sequences

In addition to the grammar of DNA there is also a specific subset grammar (called the ‘Genetic Code’) that is used by living cells to translate information encoded within a specific, enclosed genetic material that describes to the cell how a protein structure should look like.

The ribosomes work like interpreters, a ribosome reads a mRNA sequence and write/print a chain of amino acids in accordance, in a process called ‘Translation’.

The amino acid chains (polypeptides) are then fold spontaneously but not randomly into a specific shape. A short folded amino acid chain (a chain of 2–50 amino acids) is called a peptide, a long folded chain (a chain of 50 or more amino acids) is called a protein (a protein can also be a part of a larger more complex protein).

There are 22 types of amino acids in living cells that are used to form a protein:

List of available amino acids in living cells, used to form a protein — source: FAO

The DNA sequence that is transcribed into corresponding mRNA sequence copies that are translated to copies of corresponding amino acid chains that fold to peptides or proteins, is called a protein coding sequence (or ‘CDS’ in short).

Or in other words — A coding sequence (CDS) is a DNA sequence that encodes to a specific peptide or protein.

Each 3 adjacent letters/nucleotide triplet (termed a ‘codon’) encodes a single amino acid. The mapping of a specific codon to a specific amino acid is a chassis dependent.

A ribosome reads a sequence of nucleotide triplets (codons) and translate them to a sequence of amino acids

Cells of different chassis can map a codon to a different amino acid or use codons in a different frequency (multiple codons can encode to the same amino acid in the same chassis). Thus in order to improve or enable protein expression levels, codon sequences can be altered (‘optimized’) for a specific organism.

Standard DNA codon table — source: Wikipedia

Standard DNA codon table in a circular presentation— source: Wikipedia

The coding sequence has a beginning, a middle and an end:

Every coding sequence starts with an ‘atg’ codon that encodes to Methionine (Met or just ‘M’ in short).
At the middle we have codons that encode to the amino acids chain.
At the end of the coding sequence we have stop codons that tell the ribosome to stop the translation. There are 3 stop codons: ‘tga’, ‘taa’, and “tag” (these codons do not encode to an amino acid).

To summarize — a coding sequence might look as follows:

5' (atg)(aaa)(aac)(aga)(taa)(tag) 3'

or in amino acids terminology:

5' (M)(K)(N)(R)(STOP)(STOP) 3'

The sequence starts from the 5 prime end and ends at the 3 prime end (in our case from left to right).
the first codon is a start codon, it’s an ‘atg’ sequence that encodes to ‘M’.
In the middle we have a sequence of codons that encode to the amino acid chain.
At the end we have in this example two stop codons, ‘taa’ and ‘tag’, that tell the ribosome to stop translation.

Registering a genetic App in the cell’s App Store

Chromosomes are like the operating system source code of the cell. They incorporate most of the genetic instructions required for the cell’s lifecycle.

The chromosome of E.Coli (type of bacteria that normally lives in the intestines) is a circular 4 million letters (nucleotides) chain.

Usually, due to the complementary DNA double-helix structure, we use the term ‘base-pair’ (or ‘bp’ in short) to describe how many letters there are. So in case of E.Coli the chromosome size is a 4 million bp.

In addition to the cell’s operating system, small apps can coexist in a circular double-helix DNA structure called a ‘Plasmid’.

In genetic engineering — existing off-the-shelf plasmids can be modified or created chemically completely from scratch. These plasmids can serve as packages that incorporate our designed genetic program.

Plasmids can be deployed to cells in a chemical process called ‘Transformation’. When the deployment succeeds, the cell will start to serve as a bio-factory, it will interpret and execute our genetic program to produce our designed proteins.

Usually, only a few types (1–3) of different plasmid copies can be successfully deployed into cells. More than that can exhaust the resources of the cell and reduce or eliminate completely the execution of our code.

Other methods to deploy or to alter genetic code will be discussed in future articles.

pUC19 is a common plasmid that can serve as a package to our genetic program. Source: addgene

Ultrasound Simulator