Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

The purpose of this project is to write a program to compute the optimal sequenc

ID: 3727918 • Letter: T

Question

The purpose of this project is to write a program to compute the optimal sequence alignment of two DNA strings. This program will introduce you to the emerging eld of computational biology in which computers are used to do research on biological systems. Further, you will be introduced to a powerful algorithmic design paradigm known as dynamic programming.
Biology Review A genetic sequence is a string formed from a four-letter alphabet Adenine (A), Thymine (T), Guanine (G), Cytosine (C) of biological macromolecules referred to together as the DNA bases. A gene is a genetic sequence that contains the information needed to construct a protein. All of your genes taken together are referred to as the human genome, a blueprint for the parts needed to construct the proteins that form your cells. Each new cell produced by your body receives a copy of the genome. This copying process, as well as natural wear and tear, introduces a small number of changes into the sequences of many genes. Among the most common changes are the substitution of one base for another and the deletion of a substring of bases; such changes are generally referred to as point mutations. As a result of these point mutations, the same gene sequenced from closely related organisms will have slight dierences.
The Problem Through your research you have found the following sequence of a gene in a previously unstudied organism.
A A C A G T T A C C
What is the function of the protein that this gene encodes? You could begin a series of uninformed experiments in the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a known gene in a previously studied organism. Since biologists and computer scientists have laboriously determined (and published) the genetic sequence of many organisms (including humans), you would like to leverage this information to your advantage. We’ll compare the above genetic sequence with one which has already been sequenced and whose function is well understood.
T A A G G T C A
If the two genetic sequences are similar enough, we might expect them to have similar functions. We would like a way to quantify “similar enough.”
Edit Distance In this assignment we will measure the similarity of two genetic sequences by their edit distance, a concept rst introduced in the context of coding theory, but which is now widely used in spell checking, speech recognition, plagiarism detection, le revisioning, and computational linguistics. We align the two sequences, but we are permitted to insert gaps in either sequence (eg, to make them have the same length). We pay a penalty for each gap that we insert and also for each pair of characters that mismatch in the nal alignment. Intuitively, these penalties model the relative likeliness of point mutations arising from deletion/insertion and substitution. We produce a numerical score according to the following table, which is widely used in biological applications:
operation cost insert a gap 2 align two characters that mismatch 1 align two characters that match 0
Here are two possible alignments of the strings x = ’AACAGTTACC’ and y = ’TAAGGTCA’:
x y cost x y cost ------------ -----------A T 1 A T 1 A A 0 A A 0 C A 1 C - 2 A G 1 A A 0 G G 0 G G 0 T T 0 T G 1 T C 1 T T 0 A A 0 A - 2 C - 2 C C 0 C - 2 C A 1 --- --8 7

The rst alignment has a score of 8, while the second one has a score of 7. The edit distance is the score of the best possible alignment between the two genetic sequences over all possible alignments. In this example, the second alignment is in fact optimal, so the edit distance between the two strings is 7. Computing the edit distance is a nontrivial computational problem because we must nd the best alignment among exponentially many possibilities. For example, if both strings are 100 characters long, then there are more than 1075 possible alignments.
We will explain a recursive solution which is an elegant approach. However it is far too inecient because it recalculates each subproblem over and over. Once we have dened the recursive denition we can redene the solution using a dynamic programming approach which calculates each subproblem once.
A Recursive Solution We will calculate the edit distance between the two original strings x and y by solving many editdistance problems on smaller suxes of the two strings. We use the notation x[i] to refer to character i of the string. We also use the notation x[i..M] to refer to the sux of x consisting of the characters x[i], x[i + 1], ..., x[M - 1]. Finally, we use the notation opt[i][j] to denote the edit distance of x[i..M] and y[j..N]. For example, consider the two strings x = ’AACAGTTACC’ and y = ’TAAGGTCA’ of length M = 10 and N = 8, respectively. Then, x[2] is ’C’, x[2..M] is ’CAGTTACC’, and y[8..N] is the empty string. The edit distance of x and y is opt[0][0].
Now we describe a recursive scheme for computing the edit distance of x[i..M] and y[j..N]. Consider the rst pair of characters in an optimal alignment of x[i..M] with y[j..N]. There are three possibilities:
1. The optimal alignment matches x[i] up with y[j]. In this case, we pay a penalty of either 0 or 1, depending on whether x[i] equals y[j], plus we still need to align x[i + 1..M] with y[j + 1..N]. What is the best way to do this? This subproblem is exactly the same as the original sequence alignment problem, except that the two inputs are each suxes of the original inputs. Using our notation, this quantity is opt[i + 1][j + 1].
2. The optimal alignment matches the x[i] up with a gap. In this case, we pay a penalty of 2 for a gap and still need to align x[i + 1..M] with y[j..N]. This subproblem is identical to the original sequence alignment problem, except that the rst input is a proper sux of the original input.
3. The optimal alignment matches the y[j] up with a gap. In this case, we pay a penalty of 2 for a gap and still need to align x[i..M] with y[j + 1..N]. This subproblem is identical to the original sequence alignment problem, except that the second input is a proper sux of the original input.
The key observation is that all of the resulting subproblems are sequence alignment problems on suxes of the original inputs. To summarize, we can compute opt[i][j] by taking the minimum of three quantities:
opt[i][j] = min{opt[i + 1][j + 1] + 0 or 1, opt[i + 1][j] + 2, opt[i][j + 1] + 2
This equation works assuming i < M and j < N. Aligning an empty string with another string of length k requires inserting k gaps, for a total cost of 2k. Thus, in general we should set opt[M][j] = 2(N - j) and opt[i][N] = 2(M - i). For our example, the nal matrix is:
| 0 1 2 3 4 5 6 7 8 xy | T A A G G T C A ----------------------------------0 A | 7 8 10 12 13 15 16 18 20 1 A | 6 6 8 10 11 13 14 16 18 2 C | 6 5 6 8 9 11 12 14 16 3 A | 7 5 4 6 7 9 11 12 14 4 G | 9 7 5 4 5 7 9 10 12 5 T | 8 8 6 4 4 5 7 8 10 6 T | 9 8 7 5 3 3 5 6 8 7 A | 11 9 7 6 4 2 3 4 6 8 C | 13 11 9 7 5 3 1 3 4 9 C | 14 12 10 8 6 4 2 1 2 10 - | 16 14 12 10 8 6 4 2 0
By examining opt[0][0], we conclude that the edit distance of x and y is 7.

The purpose of this project is to write a program to comuute the optimal meewo alissa program will introduce you to the emerging field of compulational biolsey in which cotes a biological systems. Further, you will be introduced to a powerful algorithule dslg This Biology Review A genctic sequence is a string formed from a four letiter alphabet Adeve (A) Cytosine (C) of biological macromolecules referred to together as the DNA baes A en the information needed to construct a protein. All of your genes taken togethes are solbers blueprint for the parts needed to construct the proteins that form your cells. Eash wew to as the n goneea ceoll jproduced by your beolty w sequences of many genes. Among the most comoon changes are the substitutih of a substring of bases; such changes are generally referred to as point mulations same gene sequenced from closely related organistms will have slight diffe copying process, as well as natural woar and tear·introduce.. small numher of chane" thu of one bse for another and the deletion As a resalt of these polet wistathons, the The Problem What is the function of the protein that this gene encodes? You could begin a series of usinformed experiments tn the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a knwwn gne tn studied organism. Since biologists and computer scientists have laborionsly determined (and pablished) the genetie sequence of many organisms (including humans), you would like to leverage this information to your advantage We 1l compare the above genetic sequence with one which has already been sequenced and whowe function is well understos If the two genetic sequences are similar enough, we might expect then to have snuilar functions We woul ke a ww to quantify "similar enough. Edit Distance In this assignment we will measure the similarity of two genetic sequenos by uhrir edu 1anm a ereeqt first introduced in the context of coding theory, but which is now widely used in spell checking, speech recognition plagiaris detection, file revisioning, and computational linguistics. We align the two sequences, but we are permitted to imsert gops in either sequence (eg, to make them have the same length). We pay a penalty for each gap that we insert and also fo each pair of characters that mismatch in the final alignment. Intuitively, these penalties model the relative ikeliness of point mutations arising from deletion/insertion and subetitution. We produce a munmerical score accurndling to the Rdlowing table which is widely used in biological applications: n gap align two characters that mismatch align two characters that match Here are two possible alignments of the strings x-Aarrace, and y·maorar: Ycost Xycost

Explanation / Answer

Alignment.py
-------------------------------------------------
import stdarray
import stdio

# Read x, y, and opt from standard input.
x = stdio.readLine()
y = stdio.readLine()
waste = stdio.readLine()
M, N = len(x), len(y)

# Initializes opt:
opt = stdarray.create2D(M + 1, N + 1, 0)
for i in range(0, M + 1):
    for j in range(0, N + 1):
        opt[i][j] = stdio.readInt()

# Write edit distance between x and y.
stdio.writef('Edit distance = %d ', opt[0][0])

# Recover and write an optimal alignment.
i, j = 0, 0
while i < M or j < N:
    if i == M or j == N:
        if i > j:
            stdio.writef('- %s 2 ', y[j])
        elif j > i:
            stdio.writef('%s - 2 ', x[i])
        break
    if opt[i][j] == opt[i + 1][j] + 2:
        stdio.writef('%s - 2 ', x[i])
        i += 1
    elif opt[i][j] == opt[i][j + 1] + 2:
        stdio.writef('- %s 2 ', y[j])
        j += 1
    elif x[i] == y[j]:
        stdio.writef('%s %s 0 ', x[i], y[j])
        i += 1
        j += 1
    else:
        stdio.writef('%s %s 1 ', x[i], y[j])
        i += 1
j += 1
--------------------------------------------------------------------------------------------------
edit_distance.py
---------------------------------
import stdarray
import stdio

# Read x and y.
x = stdio.readLine()
y = stdio.readLine()

# Create (M + 1) x (N + 1) matrix opt with elements initialized to 0, where
# M and N are lengths of x and y respectively.
M, N = len(x), len(y)
opt = stdarray.create2D(M + 1, N + 1, 0)

# Initialize opt[M][j] (j < N) and opt[i][N] (i < M) to appropriate values.
for i in range(0, M + 1):
    opt[i][N] = 2 * (M - i)
for j in range(0, N + 1):
    opt[M][j] = 2 * (N - j)

# Compute the rest of opt.
for i in range(M - 1, -1, -1):
    for j in range(N - 1, -1, -1):
        if x[i] == y[j]:
            opt[i][j] = min(opt[i + 1][j + 1], opt[i + 1][j] + 2,
                            opt[i][j + 1] + 2)
        else:
            opt[i][j] = min(opt[i + 1][j + 1] + 1, opt[i + 1][j] + 2,
                            opt[i][j + 1] + 2)
# Write x, y, dimensions of opt, and opt.
stdio.writeln(x)
stdio.writeln(y)
stdio.writeln(str(M + 1) + ' ' + str(N + 1))
for i in range(0, M + 1):
    for j in range(0, N + 1):
        if j == N:
            stdio.writef('%3d ', opt[i][j])
        else:
stdio.writef('%3d ', opt[i][j])
-------------------------------------------------------------------------------------------
run_tests.py
----------------------------------
mport datetime, os, signal, subprocess, sys, time, unittest

def run(command, stdin = None, timeout = 30):
    """
    Runs the specified command using specified standard input (if any) and
    returns the output on success.
    """

    start = datetime.datetime.now()
    process = subprocess.Popen(command.split(),
                               stdin = subprocess.PIPE,
                               stdout = subprocess.PIPE,
                               stderr = subprocess.STDOUT)
    if not stdin is None:
        process.stdin.write(stdin)
        process.stdin.close()
    while process.poll() is None:
        time.sleep(0.1)
        now = datetime.datetime.now()
        if (now - start).seconds > timeout:
            os.kill(process.pid, signal.SIGKILL)
            os.waitpid(-1, os.WNOHANG)
            return "__TIMEOUT__"
    return process.stdout.read().strip()

class Problem1(unittest.TestCase):

    def test1(self):
        command = """python edit_distance.py"""
        sought = """AACAGTTACC
TAAGGTCA
11 9
7   8 10 12 13 15 16 18 20
6   6   8 10 11 13 14 16 18
6   5   6   8   9 11 12 14 16
7   5   4   6   7   9 11 12 14
9   7   5   4   5   7   9 10 12
8   8   6   4   4   5   7   8 10
9   8   7   5   3   3   5   6   8
11   9   7   6   4   2   3   4   6
13 11   9   7   5   3   1   3   4
14 12 10   8   6   4   2   1   2
16 14 12 10   8   6   4   2   0"""
        got = run(command, open("input.txt", "r").read())
        self.assertNotEquals(got, "__TIMEOUT__")
        self.assertEquals(got, sought)

class Problem2(unittest.TestCase):

    def test1(self):
        command1 = "python edit_distance.py"
        got = run(command1, open("input.txt", "r").read())
        self.assertNotEquals(got, "__TIMEOUT__")
        command2 = "python alignment.py"
        sought = """Edit distance = 7
A T 1
A A 0
C - 2
A A 0
G G 0
T G 1
T T 0
A - 2
C C 0
C A 1"""
        got = run(command2, got)
        self.assertNotEquals(got, "__TIMEOUT__")
        self.assertEquals(got, sought)

if __name__ == "__main__":
    unittest.main()

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote