drosopyla files: Dodging the central dogma of life

In previous article we explored the Central Dogma of Life. To recapitulate, simply it says the following:

DNA makes RNA, and RNA makes protein.

It presents the natural order of things, however very often it is required to go backwards, for example to find the region in the RNA or DNA that is encode a specific peptide. To do that methods that are able to convert Protein to RNA and RNA to DNA are required. The drosopyla package is implementing two methods of doing this, the back_translation of the RNASequence class and ProteinToRNA class. This new path backwards is illustrated in the Figure 1 below:

Additionally this article will present and additional module for directly finding a protein encoding region in a sequence of DNA or RNA. Let start by following a bottom-up approach.

Generating all RNA sequences encoding a protein.

As we studied in the last article because of the genetic code redundancy some amino acid can be encoded by multiple RNA codons. This means that the same peptide can be encoded by a vast number of RNA sequences, and this number is growing is exponentially with the number of amino acids in the peptide sequence. Without a reference genome or DNA/RNA region it is impossible to select the correct encoding RNA sequence that would really encode a peptide from all variations of RNA sequence. Because of this the ProteinToRNA class from the protein module generates all possible versions of RNA that encodes a peptide.

For this article will be used a small regulatory peptide baxL and the genome of E.coli bacteria. Escherichia coli is one of the powerhouses of bacterial and genetics information. It is a bacteria of the genus Escherichia that is commonly found in the lower intestine of warm-blooded organisms. It’s genome is openly available, and more than that it is available for multiple strands. In this article we are going to used the strand K-12, whose genome can be found here.

baxL is one of the presumably transcription controlling factors in this organisms. It together with other controlling factors were published this article. Information about this peptide can be found on uniprot website, and it is MIKSDQET. Let’s start with defining the protein.

# Importing all needed modules.
from drosopyla.sequences import Protein
baxL = Protein("MIKSDQET")

To generate all RNA sequences able to encode the baxL we will import and create a ProteinToRNA instance, which will be used for this purpose as shown bellow:

# Importing and creatin the protein to RNA module.
from drosopyla.protein import ProteinToRNA

protein_to_rna = ProteinToRNA()

# Getting all versions of RNA encoding the baxL
baxL_encoding_rnas = protein_to_rna.create(baxL)

# Showing the number of RNA sequences able to encode baxL
print(f"The baxL ({baxL.string()}) can be encoded by {len(baxL_encoding_rnas)} RNA sequences.")

The baxL (MIKSDQET) can be encoded by 1152 RNA sequences.

In this way it is possible to get all RNA sequences encoding a peptide. The next step backward is the transformation of the RNA to DNA.

Also take into account that it is possible to use the 2 non-standard amino acids that we learned previously in the last article. For this it is required to update the amino acid to codon relationships in the standard AMINO_ACIDS2RNA dictionary from the utils module in the following manner.

# Importing the Amino Acid to RNA codons mapper.
from drospyla.utils import AMINO_ACIDS2RNA
new_aa2rna = AMINO_ACIDS2RNA.copy()
new_aa2rna.update({"U" : "UGA", "O" : "UAG"})

# Getting all versions of RNA encoding the baxL
baxL_encoding_rnas = protein_to_rna.create(
    baxL,
    aa_codon_mapper = new_aa2rna
)

Inferring the DNA that was translated to a RNA sequence.

In case we are working with an RNA genome this step might be useless for finding protein encoding region. However when this region should be found in a DNA genome or region, it is required to translate back the RNA to DNA. This functionality is implemented directly in the RNASequence class and it’s application is shown below:

# Getting the DNA region that is getting translated in the first version 
# of RNA encoding the baxL
baxL_dna = baxL_encoding_rnas[0].back_translation()

print(baxL_dna)
print(baxL_dna.string())

<DNA: ATGATCAAGT...TCAGGAAACC>
ATGATCAAGTCCGATCAGGAAACC

Finding the location of ball encoding regions in the E.coli genome.

All these steps are grouped together in the ProteinEncodingRegionFinder class from the protein module of the package. This call can be used to find regions encoding a protein in both the DNA and RNA by just providing the DNA or RNA sequence and the Protein of interest. From the provided data type of search region it will understand in what type of molecule to search. Also if a DNASequence is provided it will search for the encoding region in both strands.

The output of calling find() function of the object is a list of tuples. Every tuple consisting of a DNASequence or RNASequence object and a string showing the strand on which it was found - “→” or “←”. The application of the code and it’s output for the baxL peptide and E.coli genome are listed below:

# Importing all needed modules.
from drosopyla.sequences import DNASequence
from drosopyla.protein import ProteinEncodingRegionFinder

# Loading the E.coli-K12 genome.
genome = open(r"E.coli_K-12.txt", "r").readlines()
genome = "".join(
    [genome_line.replace("\n", "") for genome_line in genome]
)
genome = DNASequence(genome)

# Creating the P.E.R.F. object and searching for the region encoding baxL
perf = ProteinEncodingRegionFinder()
regions = perf.find(genome, baxL)

print("The output of the P.E.R.F.:")
print(regions)
print()

# Getting the strand and location of the encoding region.
strand = "5' -> 3'" if regions[0][1] == "->" else "3' -> 5'"
if strand == "5' -> 3'":
    start_index = genome.get_locations(regions[0][0].string())
    end_index = start_index + len(regions[0][0])
else:
    start_index = genome.get_reverse_complement().get_locations(regions[0][0].string())
    end_index = len(genome) - start_index[0]
    start_index = end_index - len(regions[0][0])
 
# Formating the output.
print("E.coli K-12 strand peptides encoding regions:")
print(f"strand = {strand}, start = {start_index}, end = {start_index + len(regions[0][0])}")
print(f"baxL - {baxL.string()}")
print(f"DNA - {regions[0][0].string()}")

The output of the P.E.R.F.:
[(<drosopyla.sequences.DNASequence object at 0x0000021B3B109780>, '<-')]

E.coli K-12 strand peptides encoding regions:
strand = 3' -> 5', start = 3737179, end = 3737203
baxL - MIKSDQET
DNA - ATGATTAAATCCGACCAGGAGACC

As it can be seen in the formatting step we have to post-process the output of P.E.R.F. mainly because of region coordinates that are depending on the location within strand. If the region is located on the main strand (5’ → 3’) than we are just returning the starting and ending indexes. If the region is located on the reversed strand we need convert the indexes of the region from the reversed complement to the forward complement as shown above.

Also in the same manner as with ProteinToRNA instance you can provide your new genetic code mapping amino acids to codon to the P.E.R.F.’s find function to find the encoding regions for peptides containing non-standard amino acids as shown below:

# Importing the Amino Acid to RNA codons mapper.
from drospyla.utils import AMINO_ACIDS2RNA
new_aa2rna = AMINO_ACIDS2RNA.copy()
new_aa2rna.update({"U" : "UGA", "O" : "UAG"})

# Getting all versions of RNA encoding the baxL
regions = perf.find(genome, baxL, aa_codon_mapper = new_aa2rna)

Conclusion

As shown above the modules from the drosopyla package is allowing to find regions encoding peptides in a genome that is DNA or RNA based with ease. Also it is possible to build on top of the package’s output beautiful formatted results with little additional work. You can integrate this modules in different pipelines and focus on the esthetic part of results. In one of the future articles we will see how is possible to create a pipeline mapping peptides identified from spectrums on a defined genome.

Dodging the central dogma of life.

Generating all RNA sequences encoding a protein.

Inferring the DNA that was translated to a RNA sequence.

Finding the location of ball encoding regions in the E.coli genome.

Conclusion

Comments

The allele dominance law

Frequency Tables of k-mers

More from this blog

Frequency Tables of k-mers

Command Palette

Generating all RNA sequences encoding a protein.

Inferring the DNA that was translated to a RNA sequence.

Finding the location of ball encoding regions in the E.coli genome.

Conclusion

Comments

The allele dominance law

Frequency Tables of k-mers

More from this blog