A DNA sequence consists of a string with one of 4 bases (A, T, C, and G) at each

ID: 3122752 • Letter: A

Question

A DNA sequence consists of a string with one of 4 bases (A, T, C, and G) at each position in the string (e.g. ATACCAGGCT). You are studying the evolution of two species of Hawaiian honeycreeper birds, and you want to see how closely a DNA sequence from each of them matches the other. Each sequence is 1000 bases long. You line them up side by side and want to find the length of the longest contiguous stretch of bases in which the bases match between the two sequences (i.e. there are no mismatches within this stretch). (a) Write an algorithm that returns the length of the longest contiguous stretch of exactly matching bases between the pair of 1000 base sequences.

Explanation / Answer

This problem is based on suffix tree .

The algorithm is as follows:

I am giving out a sample program for better understanding.

#include <list>

#include <map>

#include <string>

#include <iostream>

#include <algorithm>

using namespace std;

// Get a non-const iterator to the shortest string in a list

list<string>::iterator shortest_of(list<string> & strings)

{

auto where = strings.end();

size_t min_len = size_t(-1);

for (auto i = strings.begin(); i != strings.end(); ++i) {

if (i->size() < min_len) {

where = i;

min_len = i->size();

}

return where;

}

// Say whether a string is a common substring of a list of strings

bool

is_common_substring_of(

string const & candidate, list<string> const & strings)

{

for (string const & s : strings) {

if (s.find(candidate) == string::npos) {

return false;

}

return true;

}

/* Get a multimap whose keys are the at-most `quota` greatest

lengths of common substrings of the list of strings `strings`, each key

multi-mapped to the set of common substrings of that length.

multimap<size_t,string>

n_longest_common_substring_sets(list<string> & strings, unsigned quota)

{

size_t nlengths = 0;

multimap<size_t,string> results;

if (quota == 0) {

return results;

}

auto shortest_i = shortest_of(strings);

if (shortest_i == strings.end()) {

return results;

}

string shortest = *shortest_i;

strings.erase(shortest_i);

for ( size_t start = 0; start < shortest.size();) {

size_t skip = 1;

for (size_t len = shortest.size(); len > 0; --len) {

string subs = shortest.substr(start,len);

if (!is_common_substring_of(subs,strings)) {

continue;

}

auto i = results.lower_bound(subs.size());

for ( ;i != results.end() &&

i->second.find(subs) == string::npos; ++i) {}

if (i != results.end()) {

continue;

}

for (i = results.begin();

i != results.end() && i->first < subs.size(); ) {

if (subs.find(i->second) != string::npos) {

i = results.erase(i);

} else {

++i;

}

auto hint = results.lower_bound(subs.size());

bool new_len = hint == results.end() || hint->first != subs.size();

if (new_len && nlengths == quota) {

size_t min_len = results.begin()->first;

if (min_len > subs.size()) {

continue;

}

results.erase(min_len);

--nlengths;

}

nlengths += new_len;

results.emplace_hint(hint,subs.size(),subs);

len = 1;

skip = subs.size();

}

start += skip;

}

return results;

}

// Testing ...

int main()

{

list<string> strings{

"OfBitWordFirstFileWordZ.xls",

"SecondZWordBitWordOfFileBlue.xls",

"ThirdFileZBitWordWhiteOfWord.xls",

"WordFourthWordFileBitGreenZOf.xls"};

auto results = n_longest_common_substring_sets(strings,4);

for (auto const & val : results) {

cout << "length: " << val.first

<< ", substring: " << val.second << endl;

}

return 0;

}

Navigate

A DNA sequence composed of repeating identical sequences Jack and Jill are under

A DNA sequence is a sequence of some combination of the characters A (adenine),

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

A DNA sequence consists of a string with one of 4 bases (A, T, C, and G) at each

Question

Explanation / Answer

Related Questions

Navigate