For each biological sequences task, (A) write the type of sequence analysis algo
ID: 670395 • Letter: F
Question
For each biological sequences task, (A) write the type of sequence analysis algorithm that would best accomplish the task, and (B) explain why:
i. Visualize a sequence inversion that occurred between two closely related viral strains.
ii. Given highly conserved ribosomal RNA sequences from five insect species, determine quantitatively which two species are the most closely related.
iii. Evaluate a new sequencing technology by comparing the results of a 300 BP read to a previous result based on Sanger sequencing. The beginning of the Sanger sequence is known to contain some “junk”—do not penalize for this.
iv. Identify a shared DNA binding domain in two otherwise unrelated proteins.
Explanation / Answer
i.
Detecting genes in viral genomes is a complex task. Due to the biological necessity of them being constrained in length, RNA viruses in particular tend to code in overlapping reading frames. Since one amino acid is encoded by a triplet of nucleic acids, up to three genes may be coded for simultaneously in one direction. Conventional hidden Markov model (HMM)-based gene-finding algorithms may typically find it difficult to identify multiple coding regions, since in general their topologies do not allow for the presence of overlapping or nested genes.
GeneMark relies on organism-specific recognition parameters to partition the DNA sequence into coding and non-coding regions and thus requires a sufficiently large training set of known genes from a given organism for best performance. The program has been repeatedly updated and modified and now exists in separate variants for gene prediction in prokaryotic, eukaryotic, and viral DNA sequences.
Approaching the problem of comparative gene finding within multiple coding viral genomes from an HMM point of view has been deemed a difficult and computationally expensive task. Prior comparative HMM methodologies for viruses have used conventional single coding methods to search through the genome on different reading frames.
ii.
RNA polymerase I synthesizes ribosomal RNA. It is found in the nucleolus, an organelle in which ribosomal RNA is synthesized. Further confirmation of this conclusion is the finding that only purified RNA polymerase I is capable of correctly initiating transcription of ribosomal RNA in vitro. RNA polymerase II is the polymerase responsible for most synthesis of messenger RNA. RNA polymerase III is less sensitive to -amanitin than is RNA polymerase II, but it is sufficiently sensitive that its in vivo function can be probed.
The ribosomal RNA gene complexes were a convenient system for these measurements. Each of these seven nearly identical gene complexes consists of two closely spaced promoters, a gene for the 16S ribosomal RNA, a spacer region, a tRNA gene, the gene for the 23S ribosomal RNA, and the gene for the 5S ribosomal RNA.
iii.
Current second-generation sequencing (SGS) technologies produce read lengths ranging from 35 to 400 bp, at far greater speed and much lower cost than Sanger sequencing. However, as reads get shorter, coverage needs to increase to compensate for the decreased connectivity and produce a comparable assembly. Certain problems cannot be overcome by deeper coverage: If a repetitive sequence is longer than a read, then coverage alone will never compensate, and all copies of that sequence will produce gaps in the assembly.
In conventional Sanger sequencing, a “long” paired-end protocol starts with DNA templates ranging from 5000 to 35,000 bp. These fragments are cloned into a vector, which is then amplified in Escherichia coli prior to sequencing. The vectors are subsequently extracted and then both ends of the vector inserts are sequenced. One drawback to this traditional method is that the E. coli cloning step introduces a bias, making it difficult to capture some regions of a genome.
Conventional Sanger sequencing uses cloning steps that amplify the genome in E. coli, which does not amplify all sequences equally well. SGS technologies avoid cloning in E. coli, but they too seem to have biases. Therefore any genome sequenced with just one technology, regardless of the depth of coverage, is liable to contain gaps due to bias. One way to overcome these biases and to close many gaps is to generate deep coverage in two or more sequencing technologies.
iv.
They contain a large number of deoxyribose nucleic acid (DNA) and protein sequences with their associated biological and bibliographical information. The DNA sequences consist of a string of nucleotides identified by the base they contain. DNA contains the complete genetic information that defines the structure, function, development, and reproduction of an organism. Genes are regions of DNA and proteins are the products of genes. DNA is usually found in the form of a double helix with two chains wound in opposite directions around a central axis.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.