Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

python 3.6 The file dna.txt contains 200 different strings, each on a different

ID: 3891284 • Letter: P

Question

python 3.6

The file dna.txt contains 200 different strings, each on a different line, each with 100 letters.

Define the dissimilarity between two such strings as the number of positions in which they differ. For example, the dissimilarity between ACTCAAG and CATCGAA is 4, since the two strings agree only in their third, fourth and sixth letters, and differ in the remaining 4 letters.

Out of the 200 strings in dna.txt, find the two strings that have the least dissimilarity between them. You may take for granted the fact that there is a unique minimal dissimilarity pair.

Report both the strings themselves, which lines they are on in the file, and the dissimilarity between them. Note that they will probably not be consecutive lines in the file, so you’ll have to compare each of the 200 strings to every other string.

AAGAGACACCCAACTTATGGGAGTTTCGGAAGTCTTCCCCGTCACACCATTGAACTCGTAGATATACTTCATTTATCCCCCCGATTGTTAGCGTTACACG
TGATGTAAATGTTTTCTAGGCAAGGTTGGCTAGTGATTACACCAGTCCATCTCTCCTTTCCTAACGTAGACCGTGCAAAACCCTTCCTCATGGACAACCG
TGGAGCCCATTTAACACCCACTTTGATACTTCCATATATGAGTAACCGAAGCCTATTTTTGTAATCGTCTTTCGGTCCGGCGAGTTAAATCATTACTGTT
TTATAGGGACTGGTTTGAAGGACCGTAGTACGCACTCTGCACGCAATGAGAAGTAGATCAACACGCAACACCTTGCGTGCGTACTAATTTGAGTACGTGT
TGCTCATTCCACTGGTGGCGCTTTTCCAAGTGAATGAACATCCACTCTACGTGGCAAAGTACGCTGCTTTTTCGCGATCGAAATTTGCACTGCGAAACCA
ATCGAAAGATTAAGCGTGTTGCCGCATGGCTGCCGGTAGATTCGCGAAATAGCGCATTCGCTATCCTGAAGCAGTGTTCAATGGTGTTGGAACGTGGGAC
CTTCTGAAGGTTCCATATCCTACTAACTCTCCCGCCCTACATCCATCATATTCGCCGACCTGGCTAGCTAGTTTTGTCTCCCAGCTAAATGCATGGACGT
AGTCCGATCTATTAAAGTAGGCAAGCGCAATGTGCCATATACTCCTGACGAATAAAAAGGTCGGGGGTACATGACTTGGTAGCCTCACCACCCTAGGGAA
GAGTCTAAGTTCGCGGACATGCGATTCGGGGCGCAAAACACCTTACATATATATTAGAGAGCGCTATGCAATGCTGGGGACTGAGCCTACTCTGTGGCGT
ATCTAGACAGTTGATCTCTACGACATCTTCAATAGTGTCTAATGGTCAACGGTTAGATTACCCAGTCTGACTTCCTTGCGTCCACCTCCCTAAAGAAATA
AGGACCAGTACCTGGCAACCACCTCCCGAGTTTACAGCATCAGACTCCTGCATGAAAAAGCGCTCGCACAAACCGATACTAGGCTAATAGGTATGGTGTT
CGGCGCAGAACCGAAGAAGCTGTTGATAACTGGATTCGTCCAGTCTTTAATTGTTGTTTTGTAATACTCTCTAAACACAATAATTACCCTTTCAGTATGT
ACTTATGCTTGCAGTCGTCCCATAGGGAGACTTACAACGTTGTACTGGGGTAAATCAGCTGGCATATCCGCGATTAAAATGGGCGACGGCAGTTTACTAG
CCACTCAACACAGCCCCACTGGTAGAACCCACGGGGTAACAGAAGGTTGTTACCATACCGTGACTGCGATAAGAAGGTCTATGAAATCCATCCCCGATAT
ACAGTCATAAGAGCTCTGGTCCAGGGAGCTGTTTCGGACCAGCTCCTCATATGACTCACTCGCTAGAAACTTTCCATTTACGATGTCCATCTCGGTTTGA
ACCTGCATAGTAAGGGTTGCGCTACTTTCTACCACTCGGCCCAACTGTGACTAGCGACGCTGAGGCATCAGTCACTGGACTGCCGAGTGGTATATCTTTA
TGATTCCTATTGATGGGAAAAATTACGGGGAGACACGGACCCAGTATGACTCGCACTGTCGAAGGCTTGCCAGTGATAAGACTCATAAACATGAGCCGAA
CGGAGTTCGTAAGTACTTTAATTGCCACAATATTGGCACCGTGGTTCGTAGAGTCCGTATTGAATTGGGCCCTTGTCGGGATCTACTAGGGTGCGATTAC
GATAGGGCCACCGTTATTGGTGTTCCCGAACGTAGGCTGGTGCATGTAGCCGCCTGACCGAGCCAACTAAACGGACATAATGTCTTAACTGTAGTTGTAG
TGTACAAGGTCCTGGCGGAACTGCAGCGCGGTGTATGAAACCAACTGGACTTCCCACTGTTGGAGAGGTACTCTGATAGAACCTAGATAGCACTGCTGAC
GACGCAGTTCTTGGGCTTAAGCTACGCAAGTAGGCCAACGGGGGGTCATAGAGAAGCGTAGGACTAACGCTACTCAGGATCATCTCGATAACGTACTTTT
CATCGAAATGTTTAGTGATTATTGAACATTACCCTGCTTTTTTCGTGGCTCTGCAATTAGTATCTTTAAGCTATCTCTTATGCGTAACTGCAGACGTATG
ACGGTCGAGCGGAGTTGGGGTCCCGCTGCCGGCTAACCCTTGTACGAGGTATTCGGCATAACCCTGCAATGATATCATACATCGAGCGTCCCCAGATTTC
CACTCTCTATGAAAATCCTGTGCCAGCGCGCCGAGTTTGCCGTAAATGCCACCTGAAACGCAGCTTCTTCTAAGAAAATGCTGACGAAACGCGTTCCAGA
ATATGCGGAGCTGCCCTATTGGCGAACGGATAACAGTGCCGCGTATGTCACAGGTATTAGCCCCTCCCCCTGCTGAAACATGTAATTATATACTCGCAGT
GTGCGGTAAATGAAACCGTACCTCGGTTACCCTCGGAGAAAAGTGTGCCAGGTCTACGAATCCGAACTGTGGTCTAATTCTCTTTTGTTTCAGTACTAAA
CGTTATTATAGGGTCGGCGTACGTGCTATAACAAGCTATGGTATGCCACGATTTATGATTATGAACCAATCCAACAATGCCCGACTTGGTATCGCGCTAC
TGTCTTTGTTGCCCGCCGACGCAAGTAATATGACTCACAACACGGATCTATGCTCGTCCCGTTGGACGCTTTGCTATCGCTTTGTTATAAGTCCATCACC
ACTTTGGGTGACGTAGGGATACTTGAGGTGATGCTACAGATGGATCTTTTATTGTTATCGTTGACACCTTTTAACGTCTAGATCTGTTAAAGTGCAGCGC
CCTACCCTTAGGTGGGTCGCACACAATTATGGATTGTCATGATAACGCCATGACTCCTCGAGCAGGAGTACGATGAGTTTTAGAACGCGCGAGGACCATG
GAATGCAACGGGTACCATGCAGGTGCTTTTGGGGGCCTCGGGGTAACCTAGGTGTCATTATCCCCGCCTGCTACCTATCGAACGAGGACGACGGAGTTGC
TTTCCAAAATACTTATCCCCGTCAGAATAATGCGAACGGATCACATTTCACTTGACAGTCTGCTCTTCTTCTTGCGTGTACGTATGATCTCCCGGATAGA
TATACACATTGGCCATCCTTCGACCTTACTTTCGCTCAGCGAATCCGTGTCATACAAATACCCCCGCCGTGCCTGGTCTTCTTCGAGCGAATTCGGTAGC
ATGACTCGCAATGTGCGTTGCACCAGTGGTATCCGTTATTATTCCATGCTGAGTGCAACGCAGGGTGATTGCCAACCCACCAACGAGCTCAAGGAGTCGT
GTGCGAAACACAACGTCGCGCCGGTCGCGTGTATAAACACTGCGCAGGTGAGCGATCACACCCAGCTAATCCACGTCGAAAGAACGCCGTTACGATAACA
GTTTATACGAACTCTCGGGTAATCGTCTAGACGGCGGCTTGTAGAATTGAATTATGTATTTAACCGTCCTAGAGCATGGCCGACGGTACACAAGGTTGAT
TTGTAAACTTCATTGCCGGATCGGAATTGGAAACACTAAACTTTAATTAGGGCCTCTGAACGCGTGAACGTCGCATGGCGGGATAGTGGCCGTATGAAGG
GAGACGATGAAGTATGTAAGTCACATTGCTTTTGATGGTAGGACAACTCGCGCACTAAGAACTAATATTACGCAATTGCTTTGTACGCACGTATGCTTGG
CACCCGGGCCTAGGTAAAATGATGGGAGGTCGGTATGAGTTTAGGTTCGACTCTAGCGGCCCTGTCGGCGGCCCCTTCCACGGTGGGGCACGGAGGCATG
TGTGGTTTAAGAGCGTAGTTGTTTCCGCGCTTTCCTAACGTACCGATCATGATACCGGAGTGCTCTGTGCCCACAGGGAAACGTCGAGGCACGTCTCCTG
AAATCCCCGACAAGACCGTTCGATCGACTGTTTCCCTAGCAGACGATTTATCGTGAACACCCCAAAGATTGTCGTGGGGAATTCACGTTAGCCCCAGATT
AAGGTGGTTCTGCACTCCACGTTATAGCTCGCGCCTGCAATAGTGCCATACACGCTACCACTGAGCACTAGACCAGGCCACGTGCCGAGTGATAGTGGTG
TTTTGAACGAGACACTTCTGTCGGACGGCGATGCCGGTGCGCAAGTGGGAGAATCCGGTTCGATGCAACGTAACCGGCTCAGCTGCGACTAGAAGGTTCC
CAACCTCCCGGTTAGTGCTATTGTCCAGATTATATCTTGCATCGGATCTTACAGACCTAAAGAGTCGAGACAGGAGACTTCTGCTCAGGCGTCAATAAGC
AAGTGGCCTAGGTTCTGCATCAGGAAGCGCCGTGTAGACTTAGGCATCGGCACGATGCAACCGGTGGTTGCCGCAGTTTCGGATTGATGATCTTTTCCGT
GTTGTAGACGAAAATTGGTGGATACCCGCGCTTAGTTGTGCTGTACACGTGCTGACAGGATTACTACTGACCTGTTGCTTCACTCCTGCTTTCGAAAAGC
CGCCGTGATAAGTATACACGCCGCGACTGCGGATCAGTGGGTATTGACCATCTATACTCAAAGTTTATCCATGGTCGTGCTTGCTTCCGGTTGACACTAG
GCGGAATATTATTGGCCCAGCAGGCATGGCCGAATTCCCGATCTACCTATCTACCTTATACCTCTATGGCCGGATTCGCCATCGCTCATCCCGAAATGTC
ACAAAAAGGATCCGTTCTTGTGACTCGCGGCTAAGGGATAACCCTGACTTCACCCGGGATCGTCTCTAACAGTATAACGTGTGGGTCGACTTGTCCTCGT
TGCGGTATTGGACGATATCCTAGATACAGATACGGTGGCAGGTCCCTCGCCAACTTAGAGGTGACTATAAGAGCTCGAGAGCTCTACGGTTTAGGTAATA

200 lines with 100 letters. the first 50 lines are above.

Explanation / Answer

#Store all the strings in file.txt and proceed with the code

with open('file.txt') as f:
content = f.readlines()
#Now we remove the whitespace at the end of each line
content = [x.strip() for x in content]
total_strings = len(content)
string1=''
string2=''

#initialized to some large value
min_diff = 1000
for i in range(total_strings):
first_string = content[i]
for j in range(i+1,total_strings):
second_string = content[j]
diff = 0
for k in range(len(first_string)):
if first_string[k]!=second_string[k]:
diff=diff+1
if diff<min_diff:
min_diff=diff
string1=first_string
string2=second_string
print("The first string is: "+string1)
print("The second string is: "+string2)
print("The difference is : "+str(min_diff))