This is a c program. Thanks ----- Superstring Assembly Suppose that multiple cop
ID: 3746916 • Letter: T
Question
This is a c program.
Thanks
-----
Superstring Assembly
Suppose that multiple copies of a long string are cut up into much smaller pieces. If you are given the available fragments, is it possible to rebuild the larger string? For example, if the original string was "algorithmsarefunfunfun", and you got given a list of fragments like "refun" and "unfun" "rith" and "lgo" and so on, plus lots more small fragments, could you rebuild the original string? Probably you could, right? But what about if the original string was millions of characters long, like War and Peace? Or billions of characters long, like your genetic sequence?
Let’s ask a slightly different question now: if you got given a set of small string fragments each from a larger string that you didn’t know anything about, could you at least find the shortest string in which every supplied fragment occurs at least once? This problem is known as the shortest superstring problem, and can be thought of as having parameters n, the number of string fragments provided, and m, the total length of those string fragments. The shortest superstring problem is very important in bioinformatics, where genome reassembly from short fragments called reads helps contribute to our understanding of the behavior of organisms.
Unfortunately, finding optimal solutions to the shortest superstring problem has been shown to be very challenging, and falls into a family of intractable problems that we’ll talk about later in the semester. In this project you will implement some approximation techniques that generate non-optimal superstrings from a collection of string fragments.
Input Data
Input to your program will (always) come from stdin, and will (always) consist of strictly lower-case alphabetic strings, one per input line, each line terminated by a single newline ' ' character. Be sure to read the message on the FAQ page about newlines. For example, the following file test0.txt containing six fragments is a valid input:
accgtcgatg
gcctag
gtacctacta
cgatgcc
tcgatgccgca
atgagaccgtc
As you develop your program according to the stages listed below, the output will evolve. Full output examples for this and two other test files are linked from the FAQ page. You should also check your program against other inputs too; and can easily generate your own tests using a text editor. Testing and debugging is your responsibility.
1
In terms of developing your program for this project, you may assume that no string fragment will be longer than 20 characters, and that there will be at most 1;000 such fragments supplied. You may also use brute-force programming rather than try and use more elegant data structures and algorithms – this activity is primarily about C programming rather than algorithm design.
Stage 0 – Reading Strings (0/15 marks)
Before tackling the more interesting parts of the project, your very first program should read the set of input fragments into an array of strings, and then print them out again so that you can be sure you have read them correctly. You can do that as part of your DEBUG mode code, see the suggestion below. Each of the following stages will then require a function that carries out the required processing, in each case working through the strings that were read, and then building (and selectively writing) the required superstring.
Stage 1 – Overlapping Suffixes (8/15 marks)
In this first stage you are to build a superstring according to the following process: start with the first of the input fragments and use it to initialize a partial superstring; then process every other fragment in input order, checking first to see if it is already wholly present in the partially built superstring, and if it is not, checking to see if there is any overlap between the head of the new fragment and the tail of the superstring. If the fragment appears already within the superstring, no further action is needed. Or, if there are overlapping characters, the required part of the fragment is appended to the end of the partial superstring. If there is no tail overlap, the whole of the fragment is appended to the partial superstring. For example, on the test0.txt file, the six input fragments result in this sequence of superstrings being formed:
Stage 0 Output
--------------
6 fragments read, 55 characters in total
Stage 1 Output
--------------
frg= 0, slen= 10 Accgtcgatg
frg= 1, slen= 15 AccgtcgatGcctag
frg= 2, slen= 24 AccgtcgatGcctaGtacctacta
frg= 3, slen= 24 AccgtCgatGcctaGtacctacta
frg= 4, slen= 35 AccgtCgatGcctaGtacctactaTcgatgccgca
frg= 5, slen= 45 AccgtCgatGcctaGtacctactaTcgatgccgcAtgagaccgtc
where the three numbers shown are the operation number, the number of the fragment that is being added, and the new length of that partial superstring. Note how the letters that are at the start of each fragment have been capitalized in the output so that they can be identified in the superstring, and how one of the fragments was found within the partial superstring. At the end of the process there are 45 characters in the superstring, a saving of 10 compared to the original input size of m = 55 characters.
Explanation / Answer
#include <string.h>
#include <stdio.h>
#include<stdbool.h>
// Check if str1 is substring of str2 or vice versa
bool findSubstring(char *first, char *second){
int l1 = strlen(first);
int l2 = strlen(second);
if ((l1 < l2) && (strstr(second, first) != NULL)){
return 0;
}
if ((l1 >= l2) && (strstr(first, second) != NULL)){
return 0;
}
return 1;
}
int comparePrefixSuffix(char *first, char *second) {
char *first1 = first;
char *second1 = second;
char *res;
int l1 = strlen(first1);
int l2 = strlen(second1);
strrev(second1);
int count = 0;
while (*first1 == *second1) {
count = count + 1;
if (*first1 == '' || *second1 == '')
break;
first1++;
second1++;
}
return count;
}
int findOverlaps(char *str1, char *str2, char *str)
{
int max = 0;
int l1 = strlen(str1);
int l2 = strlen(str2);
// check str1 is substring of str2 or vice versa
bool substring = findSubstring(str1, str2);
if(substring && (l1<l2)){
str = str2;
int length = strlen(str);
return length;
}
if(substring && (l2 <=l1)){
str = str1;
int length = strlen(str);
return length;
}
// check prefix of str1 matches with suffix of str2
int count1 = comparePrefixSuffix(str1, str2);
// check prefix of str2 matches with suffix of str1
int count2 = comparePrefixSuffix(str2, str1);
// check which overlapping part is max length
if(count1 > count2){
//Append string 1 in string 2 (otherthan common part) and return
str= str2;
char *sub = str1 + count1;
strcat(str, sub);
return strlen(str);
}
else{
//Append string 2 in string 1 and return
str = str1;
char *sub = str2 + count2;
strcat(str, sub);
return strlen(str);
}
}
char *findFinalstring(char *arr[], int len)
{
// run len-1 times to consider every pair
while(len != 1)
{
int max = 0;
int l, r;
char *lastString;
for (int i = 0; i < len; i++)
{
for (int j = i + 1; j < len; j++)
{
char *str;
int res = findOverlaps(arr[i], arr[j], str);
// check for maximum overlaps
if (max < res)
{
max = res;
lastString = str;
l = i, r = j;
}
}
}
len--;
// if no overlap, append arr[len] to arr[0]
if (max == 0)
strcat(arr[0], arr[len]);
else
{
arr[l] = lastString;
arr[r] = arr[len];
}
}
return arr[0];
}
// Driver program
int main()
{
char *arr[] = {"catgc", "ctaagt", "gcta", "ttca", "atgcatc"};
int len = sizeof(arr)/sizeof(arr[0]);
char *final = findFinalstring(arr, len);
return 0;
}
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.