Hamming distance and similarity between two strings Hamming distance is one of t
ID: 3872824 • Letter: H
Question
Hamming distance and similarity between two strings
Hamming distance is one of the most common ways to measure the similarity between two strings of the same length. Hamming distance is a position-by-position comparison that counts the number of positions in which the corresponding characters in the string are different. Two strings with a small Hamming distance are more similar than two strings with a larger Hamming distance.
Example:
first string = “ACCT”
second string = “ACCG”
A C C T
| | | *
A C C G
In this example, there are three matching characters and one mismatch, so the
Hamming distance is one.
The similarity score for two sequences is then calculated as follows: similarity_score = (string length - hamming distance) / string length similarity_score = (4-1)/4 = 3/4 = 0.75
Two sequences with a high similarity score are more similar than two sequences with a lower similarity score. The two strings are always the same length when calculating a Hamming distance.
Assignment Details:
In this assignment, you will calculate the similarity of two sequences using Hamming distance between two strings, search a genomic string looking for matches to a subsequence, and calculate the similarity scores for a sample DNA sequences compared to known DNA sequences.
Here we have provided a small portion of a DNA sequence from a human, a mouse, and an unknown species. Smaller DNA sequences will be compared to each of these larger DNA sequences to determine which has the best match. Each of the DNA sequences can be copied from this write-up and stored as a constant or variable in your program.
Part 1 – Compare two sequences to each other
float similarityScore (string sequence1, string sequence2)
Define the similarityScore function. This function takes two string arguments, calculates their Hamming distance, and returns their similarity score as a floating-point number. This function should only calculate the similarity if the two strings are the same length, otherwise return 0.
You should implement your main function to repeatedly ask to the user for 2 strings (sequence1 and sequence2) such that you can pass them as parameters to your similarityScore function. Writing your own main function will allow you to test and debug every function you write.
Example:
sequence 1 = CCGCCGCCGA
I I * I I * I I * I
sequence 2 = CCTCCTCCTA
Matches = 7, Mismatches = 3
Hamming distance = 3, Length = 10
Similarity = (10-3) / 10 = .7
result = similarityScore("CCGCCGCCGA", "CCTCCTCCTA");
cout<<result<<endl;
The variable "result" should equal 0.7
COG will grade Part 1 by generating random sequences to be compared. The results of your function will be compared to our results for the same sequences.
Part 2 – Count the matches of a sequence to a genome
int countMatches(string genome, string sequence1 , float min_score)
The countMatches function takes three parameters, a string containing the genome to search, a string containing the sequence to find, and a floating point value containing the minimum similarity score that will be considered a match.
The function does the following:
The function should find all the positions of the genome where the genome’s substring matches sequence1 with a similarity greater than or equal to the min_score.
The function should return the count of the number of such positions it found. Example:
n_matches = countMatches("CCGCCGCCGA", "CGC", 0.60);
Matching genome to CGC at 0.60 similarity.
The sequece CGC matches 100% at position 1 and 4.
Position 7 matches at 67%, still above the threshold.
Therefore, the value returned by the function would be 3.
COG will grade Part 2 by generating random sequences to be found in a given genome. The results of your function will be compared to our results for the same sequences.
Part 3 – Find best matching genome for a given sequence
We have a random DNA sequence, and we want to find the closest species to it. Is the DNA sequence more similar to human, mouse, or unknown?
When could this kind of comparison be useful? Suppose that the emergency room of some hospital sees a sudden and drastic increase in patients presenting with a particular set of symptoms. Doctors determine the cause to be bacterial, but without knowing the specific species involved they are unable to treat patients effectively. One way of identifying the cause is to obtain a DNA sample and compare it against known bacterial genomes. With a set of similarity scores, doctors can then make more informed decisions regarding treatment, prevention, and tracking of the disease.
The goal of this part of the assignment is to write functions that can be useful to determine the identity of different species of bacteria, animals, etc … . By simply using the similarity score routine you implemented you can compare an unknown sequence to different genomes and figure out the identity of the unknown sample.
float findBestMatch(string genome, string seq)
The findBestMatch function should take two string arguments and return a floating point value of the highest similarity score found for the given sequence at any position within the genome. In other words, this function should traverse the entire genome and find the highest similarity score by using similarityScore() for the comparisons between seq and each sequential substring of genome.
<hint: this function is very similar in structure to the countMatches function>
int findBestGenome(string genome1, string genome2, string genome3, string seq)
The findBestGenome function should take four string arguments(unknown sequence, mouse_genome, human_genome and unknown_genome).
Return an integer indicating which genome string, out of the 3 given, had the highest similarity score with the given sequence.
For each genome, the function will find the highest similarity score of the sequence (at any position) within that genome (call function findBestMatch described above).
The return value from this function will indicate which genome had the best match, 1, 2, or 3. In the case that two or more of the sequences have the same best similarity score, return 0.
COG will grade Part 3 based on both the value returned from findBestGenome and
findBestMatch.
Note: DNA sequences for human, mouse and unknown genomes will be uploaded as a file on Moodle with this assignment for testing purposes.
Getting started
Let’s talk about implementation and testing your solutions. We should approach the design of algorithms or programs from the top down. Begin writing your algorithms as very abstract descriptions and then repeatedly writing more detailed descriptions of the complex abstractions. Each of your abstractions could be its own function. Once you have reached a level of detail in your descriptions that provides well-understood solutions to the problem, you can begin to implement (write the code) for those functions. Implementation is best done from the bottom up, meaning you implement the base functions (those functions that don’t call other functions) first and test them until you are satisfied they behave correctly and therefore have the correct code for implementing that function.
Once you have your base functions implemented, you can implement the next layer of abstractions that use those base functions in their algorithms. Again, test the new functions until you are satisfied they perform as required. This method of bottom up will progressively add in functionality until the complete algorithm is implemented.
This assignment is written in a manner that will allow you to implement functions each part and use COG to verify that it works before implementing the next part.
Write your similarityScore() function and test it by calling it from main() and passing it two known strings. For example:
cout << “Test1: ” << similarityScore(“ACGT”,“ACGG”) << endl;
cout << “Test2: ” << similarityScore(“ATAT”,“ACAC”) << endl;
. . .
Once you know that your function works for the specific tests, modify your main to get the input from the user and pass the strings to the function. Once the user input and output are implemented as required, submit to COG to verify your compliance with all the requirements. Now you can implement the code for the next part of the assignment. Repeat the implementation, local testing, and COG testing stages for each of the required functions.
When testing the functions, it may be easier to use smaller genome sequences when debugging your code. Use simple examples to verify that the functions are working before testing it on longer strings. Once you’re confident the function works, call your functions from main() using the following strings as the first input parameter to the function:
HumanDNA = "CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCC AGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTT TGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGT CCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTG CAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAGCTGAGCACTG GAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTT CATG"
mouseDNA = "CGCAATTTTTACTTAATTCTTTTTCTTTTAATTCATATATTTTTAATATGTTTACTATT AATGGTTATCATTCACCATTTAACTATTTGTTATTTTGACGTCATTTTTTTCTATTTCC TCTTTTTTCAATTCATGTTTATTTTCTGTATTTTTGTTAAGTTTTCACAAGTCTAATAT AATTGTCCTTTGAGAGGTTATTTGGTCTATATTTTTTTTTCTTCATCTGTATTTTTATG ATTTCATTTAATTGATTTTCATTGACAGGGTTCTGCTGTGTTCTGGATTGTATTTTTC TTGTGGAGAGGAACTATTTCTTGAGTGGGATGTACCTTTGTTCTTG"
unknownDNA = "CGCATTTTTGCCGGTTTTCCTTTGCTGTTTATTCATTTATTTTAAACGATATTTATAT CATCGGGTTTCATTCACTATTTTTCTTTTCGATAAATTTTTGTCAGCATTTTCTTTTAC CTCTTCTTTCTGTTTATGTTAATTTTCTGTTTCTTAACCCAGTCTTCTCGATTCTTATC TACCGGACCTATTATAGGTCACAGGGTCTTGATGCTTTGGTTTTCATCTGCAAGAG TCTGACTTCCTGCTAATGCTGTTCTGTGTCAGGGTGCATCTGAGCACTGATGTGGA
Example:
first string = “ACCT”
second string = “ACCG”
A C C T
| | | *
A C C G
Explanation / Answer
main.cpp
#include <iostream>
#include "assignment4.cpp"
using namespace std;
int main() {
string humanDNA = "CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATG";
string mouseDNA = "CGCAATTTTTACTTAATTCTTTTTCTTTTAATTCATATATTTTTAATATGTTTACTATTAATGGTTATCATTCACCATTTAACTATTTGTTATTTTGACGTCATTTTTTTCTATTTCCTCTTTTTTCAATTCATGTTTATTTTCTGTATTTTTGTTAAGTTTTCACAAGTCTAATATAATTGTCCTTTGAGAGGTTATTTGGTCTATATTTTTTTTTCTTCATCTGTATTTTTATGATTTCATTTAATTGATTTTCATTGACAGGGTTCTGCTGTGTTCTGGATTGTATTTTTCTTGTGGAGAGGAACTATTTCTTGAGTGGGATGTACCTTTGTTCTTG";
string unkownDNA = "CGCATTTTTGCCGGTTTTCCTTTGCTGTTTATTCATTTATTTTAAACGATATTTATATCATCGGGTTTCATTCACTATTTTTCTTTTCGATAAATTTTTGTCAGCATTTTCTTTTACCTCTTCTTTCTGTTTATGTTAATTTTCTGTTTCTTAACCCAGTCTTCTCGATTCTTATCTACCGGACCTATTATAGGTCACAGGGTCTTGATGCTTTGGTTTTCATCTGCAAGAGTCTGACTTCCTGCTAATGCTGTTCTGTGTCAGGGTGCATCTGAGCACTGATGTGGAGTTTTCTTGTGGATATGAGCCATTCATAGTGTGGGATGTGCCATAGTTCATG";
cout << similarityScore(humanDNA, mouseDNA) << endl;
cout << countMatches(humanDNA, "GCC", 0.5) << endl;
cout << findBestMatch(mouseDNA, "ACT") << endl;
cout << findBestGenome("GGAACACA", "CGATATGA", "GGAGTA", "CAATC") << endl;
cout << complementSequence("GGAACACA") << endl;
cout << reverseComplementSequence("GGAACACA") << endl;
}
assignment4.cpp
#include <iostream>
#include <string>
using namespace std;
/*
Compares two strings together and calculates their similarity
Parameters: first sequence, second sequence
Returns: similarity score for the two sequences
*/
float similarityScore(string sequence1, string sequence2) {
float sequence1Length = sequence1.length();
float sequence2Length = sequence2.length();
if (sequence1Length != sequence2Length) {
return 0;
}
float mismatches = 0.0;
for (int i = 0; i < sequence1Length; i++) {
if (!(sequence1[i] == sequence2[i])) {
mismatches++;
}
}
return (sequence1Length - mismatches) / sequence1Length;
}
/*
Counts all matches of a sequence in a genome with a minimun score consideration
Parameters: genome, sequence, minScore
Returns: all matches within the min score boundary
*/
int countMatches(string genome, string sequence, float minScore) {
int matches = 0;
int sequenceLength = sequence.length();
for (int pos = 0; pos < genome.length() - sequenceLength + 1; pos++) {
float score = similarityScore(genome.substr(pos, sequenceLength), sequence);
if (score >= minScore) {
matches++;
}
}
return matches;
}
/*
Finds the best match in terms of similarity for a genome and a sequence
Parameters: genome, sequence
Returns: best match
*/
float findBestMatch(string genome, string seq) {
float bestMatch = 0.0;
for (int pos = 0; pos < genome.length(); pos++) {
float score = similarityScore(genome.substr(pos, seq.length()), seq);
if (score > bestMatch) {
bestMatch = score;
}
}
return bestMatch;
}
/*
Compares three genomes and returns the index number of the one with the best match for a sequence
Parameters: first genome, second genome, third genome, sequence
Returns: best matching genome for given sequence
*/
int findBestGenome(string genome1, string genome2, string genome3, string seq) {
float genome1Score = findBestMatch(genome1, seq);
float genome2Score = findBestMatch(genome2, seq);
float genome3Score = findBestMatch(genome3, seq);
if (genome1Score > genome2Score && genome1Score > genome3Score) {
return 1;
}
else if (genome2Score > genome1Score && genome2Score > genome3Score) {
return 2;
}
else if (genome3Score > genome1Score && genome3Score > genome2Score) {
return 3;
}
else {
return 0;
}
}
/*
Calculates the complement sequence given a sequence of DNA
Parameters: sequence
Returns: complement sequence
*/
string complementSequence(string seq) {
string complementSeq = "";
for (int i = 0; i < seq.length(); i++) {
if (seq[i] == 65) {
complementSeq += "T";
}
else if (seq[i] == 67) {
complementSeq += "G";
}
}
return complementSeq;
}
/*
Calculates the reverse complement sequence given a sequence of DNA
Parameters: sequence
Returns: reverse complement sequence
*/
string reverseComplementSequence(string seq) {
string complementSeq = complementSequence(seq);
string reverseComponentSeq = "";
for (int i = complementSeq.length(); i > 0; i--) {
reverseComponentSeq += complementSeq[i];
}
return reverseComponentSeq;
}
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.