Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Bioinformatics is the application of computer science techniques to problems in

ID: 3649735 • Letter: B

Question

Bioinformatics is the application of computer science techniques to problems in biology. Many bioinformatics applications deal with the processing of DNA. DNA, which contains the genetic codes for living things, can be represented by an incredibly long string of single characters (billions). This is possible because DNA is made up of a long chain of four basic chemicals called nucleotides: adenine, guanine, cytosine and thymine. For computational purposes, these are abbreviated A, G, C, and T. For example, a very short fragment of DNA could be represented by the string: ATGGCTATTGCTATTGATCGG.
Cells use this code to, among other things; create chemicals (proteins) that perform many tasks within your body. Therefore, certain sub-sequences of DNA are essentially a code for certain proteins. When a certain protein is needed by your body (say to help digest food or to make part of a new cell), a copy of a certain part of the DNA is used to create the correct protein.
Proteins are made from a sequence of amino acids, of which there are 20. Each three nucleotide sequence in DNA (called a triplet) is a code for a specific amino acid. Here's a table that shows the amino acid for each triplet (followed by its abbreviation):
Notice that three of the triplets (TAA, TAG and TGA) are not associated with any amino acids, but are labeled STOP in the table.
The amino acid sequence for a particular protein corresponds to a sequence of nucleotides in DNA. A DNA sequence for a particular protein always starts with the triplet ATG (the initiation), and always ends with either TAA, TAG or TGA (stops). Stops are not codes for any amino acids, they simply mark the end of the sequence for a particular protein. Here's an example of a DNA nucleotide sequence for a simple protein:
ATGGCAACGTGA
The protein code starts with ATG, then has triplets GCA and ACG, and is terminated by the stop triplet TGA. The amino acid sequence is (using the abbreviations): Met-Ala-Thr
DNA has lots of nucleotides that aren't used for anything, so finding a protein sequence means searching for the initiation triplet (ATG), then scanning triplets for one of the stops. For example, the sequence in the example above might be buried in the middle of several useless nucleotides, such as:
TGACCTAAGCATGGCAACGTGACTTTATCTCTGGATC
A single piece of DNA may contain several protein sequences too. For example,
TACGTCTAGATTTAACAGATATAACATGAACTTCAGTTCTTAATGAATCAATGCCCTTGGGGCTACGTGAATAGCGGCTGAATTGAC
contains the two amino acid sequences (which probably aren't real proteins):
Met-Asn-Phe-Ser-Ser
Met-Pro-Leu-Gly-Leu-Arg-Glu
Notice that the initiation triplet does not necessarily start as a multiple of three nucleotides from the beginning of the sequence or from the end of a previous sequence


TTT Phenylalanine (Phe) TCT Serine (Ser) TAT Tyrosine (Tyr) TGT Cysteine (Cys)
TTC Phenylalanine (Phe) TCC Serine (Ser) TAC Tyrosine (Tyr) TGC Cysteine (Cys)
TTA Leucine (Leu) TCA Serine (Ser) TAA (STOP) TGA (STOP)
TTG Leucine (Leu) TCG Serine (Ser) TAG (STOP) TGG Tryptophan (Trp)
CTT Leucine (Leu) CCT Proline (Pro) CAT Histidine (His) CGT Arginine (Arg)
CTC Leucine (Leu) CCC Proline (Pro) CAC Histidine (His) CGC Arginine (Arg)
CTA Leucine (Leu) CCA Proline (Pro) CAA Glutamine (Gln) CGA Arginine (Arg)
CTG Leucine (Leu) CCG Proline (Pro) CAG Glutamine (Gln) CGG Arginine (Arg)
ATT Isoleucine (Ile) ACT Threonine (Thr) AAT Asparagine (Asn) AGT Serine (Ser)
ATC Isoleucine (Ile) ACC Threonine (Thr) AAC Asparagine (Asn) AGC Serine (Ser)
ATA Isoleucine (Ile) ACA Threonine (Thr) AAA Lycine (Lys) AGA Arginine (Arg)
ATG Methionine (Met) ACG Threonine (Thr) AAG Lycine (Lys) AGG Arginine (Arg)
GTT Valine (Val) GCT Alanine (Ala) GAT Aspartic Acid (Asp) GGT Glycine (Gly)
GTC Valine (Val) GCC Alanine (Ala) GAC Aspartic Acid (Asp) GGC Glycine (Gly)
GTA Valine (Val) GCA Alanine (Ala) GAA Glutamic Acid (Glu) GGA Glycine (Gly)
GTG Valine (Val) GCG Alanine (Ala) GAG Glutamic Acid (Glu) GGG Glycine (Gly)

For this lab you are to:
1. Write a class definition called ProteinFinder as outlined below:

Method Name Description Returns
ProteinFinder() Constructor ProteinFinder
nextProtein() Finds and returns the next protein in the DNA sequence java.lang.String
position() Returns the position in the DNA sequence of where the last protein was found int
setDNA(String) Places a DNA sequence into object void

2. Write a Class (Program) that using a GUI allows a user to enter DNA sequences and then finds and displays all of the amino acid sequences they contain.

im confuse and can use an expert help, its in java.
my first question is how to do substring of string?
how to use all methods in proteinfinder class?

Explanation / Answer

Sequence comparison is possibly the most useful computational tool to emerge for molecular biologists. The World Wide Web has made it possible for a single public database of genome sequence data to provide services through a uniform interface to a worldwide community of users. With a commonly used computer program called fsBLAST, a molecular biologist can compare an uncharacterized DNA sequence to the entire publicly held collection of DNA sequences. In the next section, we present an example of how sequence comparison using the BLAST program can help you gain insight into a real disease.