Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Your job is to write a script named countmatches that expects at least two argum

ID: 3870140 • Letter: Y

Question

Your job is to write a script named countmatches that expects at least two arguments on the command line. The rst argument is the pathname of a le containing a valid DNA string with no newline characters or white space characters of any kind within it. (It will be terminated with a newline character.) This le contains nothing but a sequence of the letters a, c, g, and t. The remaining arguments are strings containing only the bases a, c, g, and t in any order. For each valid argument string, it will search the DNA string in the le and count how many non-overlapping occurrences of that argument string are in the DNA string. To make sure you understand what non-overlapping means, the string ata occurs just once in the string atata, not twice, because the two occurrences overlap. If your script is called correctly, it will output for each argument a line containing the argument string followed by how many times it occurs in the string. If it nds no occurrences, it should output 0 as a count. For example, if the string aaccgtttgtaaccggaac is in a le named dnafile, then your script should work like this:

Explanation / Answer

** As desired language is not mentioned, providing solution in C#(can be readly ported to any modern day OOPs supporting language).

Please find the detailed comments inline.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace DNASequencer
{
class DNAMatch
{
static void Main(string[] args)
{
Console.WriteLine("Please enter the valid DNA sequence[containing only 'a','c','g','t'] to be matched:");
string input=Console.ReadLine();

//Dictionary[HashTable] to store the susequence and corresponding count in input string.
Dictionary<string,int> countList= new Dictionary<string, int>();

//Validity criteria
//1. if the length of input is greater than 0
//2. if the input is valid DNA sequnce ie. contains only a,c,g,t sequence.

if (input.Length > 0 && IsValidDNA(input))
{
Console.WriteLine("Please enter the sub sequences to match, separated with (space) press (enter) to end");
List<string> subSequences = Console.ReadLine().Split(' ').ToList();
foreach(string item in subSequences)
{
if (IsValidDNA(item))
{
countList.Add(item, CountDNAStrings(input, item));
}
else
{
Console.WriteLine("Invalid Sub Sequence: "+item+" passed!");
countList.Add(item, 0);
}
}

//Outpu the Sequnce and their count in the sequential order from the Dictionary
foreach (var item in countList)
Console.WriteLine("Sequence: " + item.Key + " Occured: " + item.Value + " times.");
}
else
Console.WriteLine("The Input string is invalid- either is empty or contains some character other than a,c,g,t,Please try again.");
Console.ReadLine();
}

//This methods checks the validity by replacing valid characters with empty values
//If the length of the remaining string is still more than 0, then some invalid chars are also there.
private static bool IsValidDNA(string input)
{
string temp = input.Replace("a", "").Replace("c", "").Replace("g", "").Replace("t", "");
return temp.Length==0;
}

//This mehtod does the actual work of finding the non-overlapping sub sequence in the input,
// Time complexity is O(n) internally.
public static int CountDNAStrings( string input, string testSubstring)
{
int count = 0;
if (input.Contains(testSubstring))
{
for (int i = 0; i < input.Length; i++)
{
if (input.Substring(i).Length >= testSubstring.Length)
{
bool equals = input.Substring(i, testSubstring.Length).Equals(testSubstring);
if (equals)
{
count++;
//Don't count overlapping matches
i += testSubstring.Length - 1;  
}
}
}
}
return count;
}
}
}