Hi, I need help with a UNIX assignment. I\'m trying to figure out the following.
ID: 641592 • Letter: H
Question
Hi, I need help with a UNIX assignment. I'm trying to figure out the following.
I need to:
1. Create an executable shell script called ~/assignments/UNIXchallenge/UNIXchallenge.sh
2. The shell script should read a specifically named file my instructor gave me
3. and produce three output files (in the current working directory):
Hs_gene.txt
Hs_CDS.txt
Hs_transcript.txt
The first file (Hs_gene.txt) should contain only those entries with a "feature" type of "gene"; the second file (Hs_CDS.txt) should contain only those entries with the feature type of "CDS", and the third file (Hs_transcript.txt) should contain only those entries with feature type = "transcript".
The output files do not contain all the entries in the original .gtf file; I only need the entries corresponding to "seqname," "feature," "start," "end," and "attribute." You'll also probably want to use the UNIX "cut" command to extract specific entries. I also have to figure out how to separate different entries into different files, based on the "feature" field. I need a way to use "grep" without getting spurious matches.
I know I need to utilize concepts related to GFF/GTF Files, the UNIX "cut" command, UNIX "pipes", and advanced "grep" regular expression syntax (in particular, how to grep tab characters)
I need lots of help with this assignment! Thanks!
Explanation / Answer
#!/bin/sh
#grep command will output the lines with matching pattern word
#-w means the mentioned word have to be matched
# the grep command syntax is
# greap -operand [pattern] [inputfile] > [output file]
grep -w "gene" filegivenbyinstructor >tempfile1gene
grep -w "CDS" filegivenbyinstructor >tempfile2cds
grep -w "transcript" filegivenbyinstructor >tempfile3transcript
# let the field numbers are
# seqname: 1 feature: 2 start: 3 end: 4 attribute:5
# extract required fields
# the cut command will extract the mentioned fields from
# input file and '>' operator will output the result in new file
# the cut command syntax is
#cut -poerand delimiter []field1],[field2],...[field n] [input filename] > [output file name]
cut -d' ' -f1,2,3,4,5 tempfile1gene >Hs_gene.txt
cut -d' ' -f1,2,3,4,5 tempfile2cds >Hs_CDS.txt
cut -d' ' -f1,2,3,4,5 tempfile3transcript >Hs_transcript.txt
# to remove duplicate records we can use command sort and uniq
# operator '|' is used to inline input...i.e. the output of sort command is immediately given to uniq command as input
# this is more than one command execution in one time .i.e. concatenated commands or nested commands
$ sort Hs_gene.txt | uniq
$ sort Hs_CDS.txt | uniq
$ sort Hs_transcript.txt | uniq
# but this method of removing duplicates may change the order of records
#so we can use other methods using awk and grep commands if we need to preserve the order of records.
$ awk '!a[$0]++' Hs_gene.txt
$ awk '!a[$0]++' Hs_CDS.txt
$ awk '!a[$0]++' Hs_transcript.txt
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.