Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

The genome of a newly discovered bacterial species ( Bacillus sanfranciscus ) wa

ID: 166876 • Letter: T

Question

The genome of a newly discovered bacterial species (Bacillus sanfranciscus) was sequenced and found to have a circular genome of 4 x 106 base pairs (bp). Open reading frame (ORF) analysis indicated the presence of 3,190 ORFs that encode proteins with a median length of 270 amino acids (aa) and an average length of 360 aa.  

a. What is the information content of this genome (i.e. – how much information can be encoded in this length of DNA)? Since the genetic code is digital in nature, let’s convert the information content of base pairs into bytes and compare this value with the information content of a digital device many of us routinely use. The iPhone operating system, iOS8, requires approximately 5 GB of information to perform its functions. To compare the information content of B. sanfranciscus and the iPhone OS, the following assumptions about the digital content of DNA-encoded information might be helpful. The double helix can potentially encode information in both strands but this is not usually the case; most stretches of DNA encode information in only one strand, although for any given gene it can be either of the two strands. So it is therefore reasonable to assume that each base pair of DNA encodes 2 bits of information (since there are 4 possible nucleotides). Keeping in mind that 1 byte = 8 bits and 1 GB = 109 bytes, the calculation is pretty straightforward from there. Express your answer as iPhone iOS units.

b. Now calculate the percentage of the bacterial genome that encodes the cell’s complete proteome. Assume that: a) all of the predicted ORFs actually encode proteins, b) each gene is encoded by only one of the two strands of DNA, and c) there are no overlapping genes (i.e. - no region of DNA encodes more than a single gene).

Explanation / Answer

(a) Comparing the genome to computer data storage

In order to represent a DNA sequence on a computer, we need to be able to represent all 4 base pair possibilities in a binary formate (0 and 1). These 0 and 1 bits are usualy grouped together to form a larger unit, with the smallest being a "byte" that represents 8 bits. We can denote each base pair using a minimum of 2 bits, which yields 4 different bit combination would represent one DNA base pair. A single byte (or 8 bits) can represent 4 DNA base pairs. In order to represent the entire diploid human genome in terms of bytes, we can perform the following calculations:

6*10^9 base pairs/diploid genome *1byte/4 base pairs = 1.5*10^9 bytes or 1.5 Gigabytes, about 2 CDs worth of space! Or small enough to fot 3 seperate geomes on standard DVD!

Data storage across the whole organism

some intersting question could follow for example, how many megabytes of genetic data are stored in the human body? for simpilsity's sake, let's ignore the microbiome (all non-human cells that live in our body), and focus only on the cells that make up ourbody. Estimates for the number of cells in the human body range between 10 trillion and 100 trillion. Let us take 100 trillion cells as the generally accepted estimate. So, given that each diploid cell contains 1.5 GB of data (this is very approximate, as i am only accounting for the diploid cells and ignoring the haploid sperm and egg cells in our body, the approximate amount of data stored in the human body is:

1.5 Gbytes*100 trillion cells = 150 trillion Gbytes or 150*10^12*10^9 bytes =150 Zettabytes (10^21)!!!

(b). The open reading frame is the total amount of a gene which is given over to makeing protein, from the start codon to the stop codon.In order to work out the percentage, you need to work out the sum total length of all the ORFs in the genome, and divide that by the total length of the genome (and multiply by 100 to get a percentage value).

you've been given the total number of ORFs, the median length of the ORFs and average (arthmetic mean) length of the ORFs. THe median (the value which is in the middle of the dataset when ranked in order of size) is information about the distribution of ORF lengths, but is not relevant here. the mean is given by the total cumulative length of all ORFs divided by the total number of ORFs. Notice that we're looking for thr total length of all ORFs, so you get that by multiplying th number of ORFs by the mean.

So in summary , you take the mean ORF lenth, multiply that by the total number of ORFs to get the total length of all ORFs in the genome, divide that by the total length of the genome, and multiply by 100.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote