Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Hi, I need help with a UNIX assignment. I\'m trying to figure out the following.

ID: 641592 • Letter: H

Question

Hi, I need help with a UNIX assignment. I'm trying to figure out the following.

I need to:

1. Create an executable shell script called ~/assignments/UNIXchallenge/UNIXchallenge.sh

2. The shell script should read a specifically named file my instructor gave me

3. and produce three output files (in the current working directory):

Hs_gene.txt

Hs_CDS.txt

Hs_transcript.txt

The first file (Hs_gene.txt) should contain only those entries with a "feature" type of "gene"; the second file (Hs_CDS.txt) should contain only those entries with the feature type of "CDS", and the third file (Hs_transcript.txt) should contain only those entries with feature type = "transcript".

The output files do not contain all the entries in the original .gtf file; I only need the entries corresponding to "seqname," "feature," "start," "end," and "attribute." You'll also probably want to use the UNIX "cut" command to extract specific entries. I also have to figure out how to separate different entries into different files, based on the "feature" field. I need a way to use "grep" without getting spurious matches.

I know I need to utilize concepts related to GFF/GTF Files, the UNIX "cut" command, UNIX "pipes", and advanced "grep" regular expression syntax (in particular, how to grep tab characters)

I need lots of help with this assignment! Thanks!

Explanation / Answer

#!/bin/sh

#grep command will output the lines with matching pattern word

#-w means the mentioned word have to be matched

# the grep command syntax is

# greap -operand [pattern] [inputfile] > [output file]

grep -w "gene" filegivenbyinstructor >tempfile1gene

grep -w "CDS" filegivenbyinstructor >tempfile2cds

grep -w "transcript" filegivenbyinstructor >tempfile3transcript

# let the field numbers are

# seqname: 1 feature: 2 start: 3 end: 4 attribute:5

# extract required fields

# the cut command will extract the mentioned fields from

# input file and '>' operator will output the result in new file

# the cut command syntax is

#cut -poerand delimiter []field1],[field2],...[field n] [input filename] > [output file name]

cut -d' ' -f1,2,3,4,5 tempfile1gene >Hs_gene.txt

cut -d' ' -f1,2,3,4,5 tempfile2cds >Hs_CDS.txt

cut -d' ' -f1,2,3,4,5 tempfile3transcript >Hs_transcript.txt

# to remove duplicate records we can use command sort and uniq

# operator '|' is used to inline input...i.e. the output of sort command is immediately given to uniq command as input

# this is more than one command execution in one time .i.e. concatenated commands or nested commands

$ sort Hs_gene.txt | uniq

$ sort Hs_CDS.txt | uniq

$ sort Hs_transcript.txt | uniq

# but this method of removing duplicates may change the order of records

#so we can use other methods using awk and grep commands if we need to preserve the order of records.

$ awk '!a[$0]++' Hs_gene.txt

$ awk '!a[$0]++' Hs_CDS.txt

$ awk '!a[$0]++' Hs_transcript.txt

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote