Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

LAB TASK: BLAST is a multipurpose bioinformatics algorithm for searching a query

ID: 672163 • Letter: L

Question

LAB TASK: BLAST is a multipurpose bioinformatics algorithm for searching a query sequence against a database of other sequences. The algorithm returns local alignments of varying significance. Today you will try both the on- and off-line versions of BLAST. Your goal will be to identify a protein sequence and then scan its host genome for duplicated genes (or other interesting alignments).

mRNA transcript for following translated protein: >Unknown putative shoe-interacting protein 1 MPKKSIEEWEEDAIESVPYLASDEKGSNYKEATQIPLNLKQSEIENHPTVKPWVHFVAGGIGGMAGAVVTCPFDLVKTRLQSDIFLKAYKSQAVNISKGSTRPKSINYVIQAGTHFKETLGIIGNVYKQEGFRSLFKGLGPNLVGVIPARSINFFTYGTTKDMYAKAFNNGQETPMIHLMAAATAGWATATATNPIWLIKTRVQLDKAGKTSVRQYKNSWDCLKSVIRNEGFTGLYKGLSASYLGSVEGILQWLLYEQMKRLIKERSIEKFGYQAEGTKSTSEKVKEWCQRSGSAGLAKFVASIATYPHEVVRTRLRQTPKENGKRKYTGLVQSFKVIIKEEGLFSMYSGLTPHLMRTVPNSIIMFGTWEIVIRLLS*

Part A: Online BLAST
1) Determine the identity of this protein using the BLAST website (http://blast.ncbi.nlm.nih.gov/). From which genome is it most likely to have originated?
2) Perform a second web BLAST, this time restricting the search database to the genome identified in Step A. Are there any duplicated copies of this protein in its host genome? If so, are the reported functions of the duplicates similar to those of the original protein? How do you explain the other significant alignments observed (if any)?

Part B: Local BLAST UNIX TIP: When working on the command line, you can redirect information normally printed to the screen into a file by following your command with a “>” and a file name: ./program_name argument1 argument2 etc. > output_file.txt You can also redirect information normally printed to the screen into another program by following your command with a “|” and a program name. For example, to scroll through your program’s output: ./program_name argument1 argument2 etc. | less To avoid the time delays of sending information over the internet and waiting in line for access to NCBI’s servers, it is often convenient to run BLAST completely offline. Typically this is done when one wishes to search a reduced database. 1) First you must format your database (originally in FASTA format; http://en.wikipedia.org/wiki/FASTA_format) for fast searching by BLAST. This is done with the program makeblastdb. In order to learn more about makeblastdb, type the following in the command line: makeblastdb –h In order to format the database you need to search, type: makeblastdb –in genome.fasta –dbtype prot –parse_seqids The –parse_seqids flag parses the IDs of the input FASTA file, and makes the output database searchable by those IDss

1.) What does –dbtype prot indicate about our input file?

2) You can now search against this database using a variety of BLAST programs. Here we’ll use psiblast. To learn more about psiblast, type the following in the command line: psiblast –h Which arguments are needed to search the query sequence (query.fasta) against your newly formatted database (genome.fasta)?

Run the program in the command line.

Explanation / Answer

The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information. All of the nucleotide-nucleotide BLAST programs can be used to accomplish this goal. However, MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence. In addition to the expect value significance cut-off, MEGABLAST also provides an adjustable percent identity cut-off that overrides the significance threshold.

>Sequence_1
AGACAGATCACTTCAGTCGCCACAATTAGCCATGGATAAGATACACCATTGCCATC
>Sequence_2
AGACAACTTCAGTCGCCGATCACTCGCCACAATTTCAGTCGCCATAAGGCAATTAT

If the query sequences are already present in Entrez, their GI or Accession numbers can be pasted in the search box, one identifier per line.

U12345
F12564
BH023812

Bioinformatics

BLAST was used to identify contigs from the Atlantic salmon and rainbow trout assemblies containing sdY and other verified male-specific markers (OmyY1, OtY2, OtY3, GH-) in BAC assemblies. This resulted in a single contig from each assembly, each of which was used to scan the assemblies a second time for additional significant comparative sequence identity. We used MULAN (Ovcharenko et al. 2005) to perform threaded blockset aligner (TBA) analysis with a minimum sequence identity cutoff of 50% using the rainbow trout, Atlantic salmon, and extended Chinook salmon (OtY3) contigs, all containing sdY. For all candidate evolutionary conserved region (ECR) blocks identified by MULAN, we performed phylogenetic analysis using ClustalW (Larkin et al. 2007) to determine whether sequences in SDR contigs were more related to each other than to other similar sequences found throughout the rainbow trout (Miller M, unpublished data) and Atlantic salmon draft genome assemblies (GB: AGKD02000000). Using only ECR blocks exhibiting direct phylogenetic relatedness, we quantified the total length of alignment across each contig, the total length of alignment not including species-specific insertions, and the percent identity of the entire aligned region.

We performed multiple database searches to look for signatures of TEs. First, BLASTx was run for each sdY-containing contig to check for TE protein alignments in the NCBI nr protein database using default parameters. We also searched in repetitive element-specific databases GyDB and Dfam using BLAST and hidden Markov model algorithms (Lloréns et al. 2008; Wheeler et al. 2013). Next, we checked each sdY-containing contig for transcript alignments within the cGRASP EST (expressed sequence tag) cluster database using BLASTn default parameters to identify additional regions of the SDR contigs that are potentially transcribed but may not code for proteins. The resulting hits are reported as candidate TEs responsible for relocation of the SDR cassette. Each of these candidate TEs and flanking sequences were analyzed for functional motifs using GenomeNet Motif (http://www.genome.jp/tools/motif/), directionality of transcription, signature of transposition events (i.e., target or insertion sites), and amino acid sequence identity within the SDR contigs (local BLASTx).

We also calculated phylogenetic relatedness of elements between species (ClustalW) using related sequences in the rainbow trout and Atlantic salmon genomes. Neighbor-joining trees were generated using candidate elements within each SDR contig and the top ten alignments in the rainbow trout and Atlantic salmon genomes to each SDR-related sequence. We attempted to identify potential outgroup sequences in a number of different ways including NCBI nr database searches for sequences similar to candidate TEs in nonsalmonid species and searches for nonsalmonid outgroup sequences using translated amino acid sequence intermediaries. The resulting sequences were either too similar to candidate TEs to properly root the tree or too dissimilar to properly align. This is most likely due to the nature of TEs which have a complex evolutionary history within a single species’ genome, where related sequences within salmonid genomes can be more dissimilar to candidate TEs than similar sequences from other species. For these reasons, we generated unrooted phylogenetic trees to observe potential clustering of candidate TEs.

//Another Solution

Step 1.BLAST File_1 and File_2, outputting to a new file "hits1"

Step 2. I then take the output file, which contains only the IDs and convert it to fasta format.

Step 3. Then I can take this file and compare it to file_3. Repeat steps 2 and 3 until all the files are compared...

The problem I am having is that with each round of BLAST, duplicates of each ID are created. For example in file_1 (and in the database) the gene ID "003375649" is only present 1x. But the hits1 file (after running the first round of BLAST) contains this ID 2x. Then on the second round of BLAST (hits2) the output contains this ID 4x. This means that although the original file contains only 3000 protein sequences, the hits 2 file already contains >42,000 sequences and the BLAST takes for.ev.er to run! I want to know how to solve this problem but more importantly I am interested in knowing why the program is creating duplicates for my general knowledge.

Here's what I have tried: a. making more stringent e-values (1e-30), this does not help b. creating local databases from the .faa files themselves rather than one larger one and running blast like so:

this gives the exact same results as this first method. Soooo why the duplicates