Difference between revisions of "BLAST"

(Example: Running a BLAST search)
(Technical Notes)
 
(176 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{TutorialsTopNav}}
 
{{TutorialsTopNav}}
  
 +
=Overview=
 +
 +
"The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families" (quoted from the [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome NCBI BLAST homepage]).
 +
 +
geWorkbench submits BLAST jobs to the NCBI server.  NCBI-supported sequence databases and search algorithms can be selected in the user interface.  Since release 2.1.0, geWorkbench supports almost all BLAST setting options available through the NCBI web interface. 
 +
 +
Please note that although, in geWorkbench, we have adopted the default settings for each BLAST algorithm as seen on the NCBI website, those settings are subject to change at any time by NCBI.  Before submitting a BLAST job from geWorkbench, the user should verify that the parameter settings are appropriate for their query.
 +
 +
 +
The BLAST analysis is available when a protein or DNA sequence is loaded and selected in the [[Workspace|Workspace]].  The BLAST Analysis and BLAST Results Viewer must be loaded in the [[Component Configuration Manager]].
  
  
  
=Overview=
+
[[Image:BLAST_Parameters_Main-blastn-full.png|{{ImageMaxWidth}}]]
The BLAST algorithm is found in the '''Sequence Alignment''' tab, located in the command area of geWorkbench (lower right quadrant).  The Sequence Alignment tab appears when a protein or DNA sequence is loaded and selected in the Project Folders component.  BLAST is currently the only alignment option supported.
 
  
The BLAST algorithm is used to find similarities between nucleotide or amino acid query sequences and sequences held in a database. It is often used to give clues to the function of a sequence based on its similarity to already characterized sequences.
+
'''Figure legend: BLAST Main Parameters'''. A nucleotide sequence has been loaded into the [[Workspace|Workspace]].
  
geWorkbench runs BLAST by submitting jobs to the NCBI server.   NCBI-supported sequence databases and search algorithms can be selected in the user interface (arrows). There is no provision at this time for running a local BLAST job on the client desktop machine.  
+
=NCBI Documentation=
 +
geWorkbench serves as a interface to the NCBI BLAST server and implements the same options as the NCBI BLAST website. Detailed information about each option can be found on NCBI webpages, including:
 +
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi NCBI BLAST Home Page]
 +
* [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs BLAST Docs - Main help page]
 +
** [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide Blast Program Selection Guide]
  
 +
Older help pages...
 +
* [http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html NCBI BLAST Help page 1] and
 +
* [http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml NCBI BLAST help page 2 ].
  
[[Image:T_SequenceAlignment_BLAST_Main.png]]
 
  
  
Line 18: Line 33:
  
 
==Prerequisites==
 
==Prerequisites==
* The Sequence Alignment component must be loaded in the geWorkbench [[Tutorial_-_Component_Configuration_Manager |Component Configuration Manager]].   
+
* The BLAST Analysis and Viewer components must be loaded in the [[Component_Configuration_Manager |Component Configuration Manager]].   
* A protein or nucleotide sequence must be loaded in the Project Folders component.
+
* A protein or nucleotide sequence must be loaded in the [[Workspace|Workspace]].
  
 
==Query sequences==
 
==Query sequences==
BLAST accepts nucleotide or amino-acid query sequences in the FASTA format. A query file can contain one or multiple sequences.  The file can be loaded from disk using the '''File Open''' command, or may have been placed into the Project Folders component by another component such as the Sequence Retriever, or as a result of a previous BLAST run.
+
BLAST accepts nucleotide or amino-acid query sequences in the FASTA format. A query file can contain one or multiple sequences.  The file can be loaded from disk using the '''File Open''' command, or may have been placed into the [[Workspace|Workspace]] by another component such as the Sequence Retriever, or add from the result of a previous BLAST run.
 +
 
 +
==Invoking BLAST==
 +
 
 +
BLAST is a normal geWorkbench analysis component and can be invoked either by right-clicking on a sequence node in the [[Workspace|Workspace]], or through the Commands entry in the Menu Bar at the top of the geWorkbench GUI.
 +
 
 +
 
 +
Invoking BLAST via right-clicking on a sequence node in the [[Workspace|Workspace]]:
 +
 
 +
 
 +
[[Image:BLAST_Analysis_invocation.png]]
 +
 
 +
 
 +
 
 +
Invoking BLAST via the Commands menu.  The desired sequence node must be selected first in the [[Workspace|Workspace]]:
 +
 
 +
 
 +
[[Image:BLAST_Analysis_Command_invocation.png]]
  
 
==Parameters - Main==
 
==Parameters - Main==
 +
 +
 +
In addition to the parameters shown in the previous image, BLASTX and TBLASTX add the Genetic Code option:
 +
 +
[[Image:BLAST_Parameters_Main-blastx.png]]
 +
 
===Algorithms===
 
===Algorithms===
  
Line 36: Line 74:
 
====For nucleotide query sequences:====
 
====For nucleotide query sequences:====
  
* '''blastn''' - Compares a nucleotide query sequence against a nucleotide sequence database.
+
* '''blastn''' - three algorithms are available under the blastn choice, in order of decreasing similarity of the query to the target sequences,  Each compares a nucleotide query sequence against a nucleotide sequence database:
 +
** '''megablast''' - optimize for highly similar sequences.
 +
** '''discontinuous megablast''' - optimize for more dissimilar sequences
 +
** '''blastn''' - optimize for somewhat similar sequences.
 
* '''blastx''' - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
 
* '''blastx''' - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
* '''tblastx''' - Compares the 6 frame translations of a nucleotide query sequence against the six frame translations of a nucleotide sequence database.
+
* '''tblastx''' - Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database.
  
 
===Databases===
 
===Databases===
Standard protein and nucleic acid databases maintained at NCBI are supported.  The appropriate databases for the search algorithm chosen will be displayed.
+
Standard protein and nucleic acid databases maintained at NCBI are supported.  The appropriate databases for the search algorithm chosen will be displayed.  A window to the right, "Database Details", displays a summary of the database currently selected in the list.
  
 
====For nucleic acids:====
 
====For nucleic acids:====
* '''ncbi/nt''' - all non-redundant DNA sequences.
+
* '''Nucleotide Collection (nr/nt)'''
* '''ncbi/pdbnt''' - nucleotide sequences derived from the PDB database (protein 3D structure database).
+
* '''Reference RNA seqeunces (refseq_rna)'''
* '''ncbi/yeast.nt''' - yeast genonic sequences.
+
* '''Reference genomic sequneces (refseq_genomic)'''
 +
* '''NCBI Genomes (chromosome)'''
 +
* '''Expressed Sequence Tags (est)'''
 +
* '''Human subset of EST (est_human)'''
 +
* '''Mouse subset of EST (est_mouse)'''
 +
* '''Non-human, non-mouse ESTs (est_others)'''
 +
* '''Genomic survey sequences (gss)'''
 +
* '''High throughput genomic sequences (htgs)'''
 +
* '''Patent sequences (pat)'''
 +
* '''Protein Data Bank (pdb)'''
 +
* '''Human ALU repeat elements (alu)'''
 +
* '''Sequence tagged sites (dbsts)'''
 +
* '''Whole genome shotgun contigs (wgs)'''
 +
* '''Metagenomic Samples (env_nt)'''
 +
* '''Transcriptome Shotgun Assembly (tsa_nt)'''
  
 
====For proteins:====
 
====For proteins:====
* '''ncbi/nr''' - all non-redundant protein sequences
+
* '''Non-redundant protein sequences (nr)'''
* '''ncbi/pdbaa''' - protein sequences from the PDB database (protein 3D structure database).
+
* '''Reference proteins (refseq_protein)'''
* '''ncbi/swissprot''' - sequences from Swiss-Prot, a primary reference database.
+
* '''UniProtKB/Swiss-prot (swissprot)'''
* '''ncbi/yeast.aa''' - translations of yeast genomic coding regions.
+
* '''Patented protein sequences (pat)'''
 +
* '''Protein Data Bank sequences (pdb)'''
 +
* '''Metagenomic proteins (env_nr)'''
 +
* '''Transcriptome Shotgun Assembly (tsa_nr)'''
  
==Parameters - Advanced Options==
+
===Search Choice===
 +
These settings provide the ability to restrict the search in certain ways
  
The following options to NCBI BLAST can be chosen
+
* '''Exclude''': exclude certain specialized database entries from the search.
 +
** '''Models''' (XM/XP)
 +
** '''Uncultured/environmental sequences'''
 +
 
 +
* '''Entrez Query''' - an Entrez query can be entered directly to restrict a search e.g. to a particular species.  Please see [http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml NCBI BLAST help].  An example from that page is "Mus musculus[organism] AND biomol_mrna[properties]".  This limits the search to mouse mRNA entries in the database.
 +
 
 +
===Genetic Code===
 +
Genetic code to be used in blastx (and tblastx) translation of the query.
 +
 
 +
==Algorithm Parameters==
 +
===General Parameters===
 +
* '''Max target sequences''' - maximum number of hits to return.
 +
* '''Automatically adjust parameters for short input sequences'''.
 +
* '''Expect threshold''' - Expected number of chance matches in a random model.
 +
* '''Word size''' - The length of the seed that initiates an alignment.
 +
* '''Max matches in a query range''' - Limit the number of matches to a query range. This option is useful if many strong matches to one part of a query may prevent BLAST from presenting weaker matches to another part of the query.
 +
** Note - NCBI reports that this feature may not work if there are a large number of full-length sequence matches in the chosen database.
 +
** The function of this feature is described in [http://www.ncbi.nlm.nih.gov/pubmed/10890403 Berman et al., 2000].
 +
 
 +
===Scoring Parameters===
 +
* '''Match/mismatch scores''' (blastn, megablast, discontinuous megablast) - scores to use for a match or mismatch.
 +
* '''Matrix''' - Various scoring matrices (BLOSUM, PAM) are available for protein and translated queries. 
 +
* '''Gap Costs''' - The pull-down menu shows the available choices of gap costs for the current scoring matrix.
 +
* '''Compositional adjustments''' -  "...takes into account the amino acid composition of the individual database sequences involved in reported alignments. This improves E-value accuracy, thereby reducing the number of false positive results."
 +
 
 +
===Filters and Masking===
 
* '''Low Complexity''' - filter out low compositional complexity sequence.
 
* '''Low Complexity''' - filter out low compositional complexity sequence.
* '''Mask lower case''' - filter out sequence which is in lower case.
+
* '''Species-specific repeats filter''' - masks species-specific repeats (e.g. human LINE's and SINE's).  This option can speed searches involving long query sequences or databases containing sequences with many repeats.
* '''Mask for lookup table only''' - masks low-complexity sequence only while constructing the lookup table used by the intial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence.  NCBI notes that this option is experimental and subject to change.
+
* '''Mask for lookup table only''' - masks low-complexity sequence only while constructing the lookup table used by the initial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence.  NCBI notes that this option is experimental and subject to change.
* '''Human repeats filter''' - masks human repeats (LINE's and SINE's).  This option can speed searches involving long query sequences or databases containing sequences with many repeats.
+
* '''Mask lower case letter''' - filter out sequence which is in lower case in the FASTA query sequence.
 +
 
 +
===Discontiguous Word Options===
 +
 
 +
Please see the [http://www.ncbi.nlm.nih.gov/blast/discontiguous.shtml NCBI page on discontiguous megablast] for a detailed explanation of these options.
 +
 
 +
* '''Template Length'''  
 +
* '''Template Type'''
 +
 
 +
===Other===
 
* '''Display result in your web browser''' - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.
 
* '''Display result in your web browser''' - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.
 +
* '''Restore defaults''' - restore all settings for the currently selected algorithm to their default values.
 +
 +
 +
===Algorithm Parameter Setting Defaults===
 +
The default settings for each query type were taken from the NCBI BLAST website.
 +
 +
Please note that options and default settings on the NCBI BLAST website are subject to change at any time.
 +
 +
The user should verify all settings are appropriate for his or her particular BLAST query.
 +
 +
====blastn - megablast====
 +
 +
[[Image:BLAST_Parameters_Main-blastn.png]]
  
Please see the [http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html NCBI BLAST Help page] for further details on these options.
 
  
 +
[[Image:BLAST_Parameters_blastn-megablast.png|{{ImageMaxWidth}}]]
  
[[image:T_SequenceAlignment_BLAST_AdvancedParams.png]]
+
====blastn - discontiguous megablast====
  
==General controls==
+
[[Image:BLAST_Parameters_blastn-discontinuous-megablast.png|{{ImageMaxWidth}}]]
  
* '''All Markers''' - if selected, use all sequences loaded, overriding any activated sets in the Marker Sets component.
+
====blastn - blastn====
* '''Total Sequence Number''' - indicates how many sequences have been selected for query.
 
* '''Curling arrow''' - start BLAST search
 
* '''Stop sign''' - stop BLAST search (if pushed, geWorkbench will not wait for or retrieve the BLAST results).
 
  
 +
[[Image:BLAST_Parameters_blastn.png|{{ImageMaxWidth}}]]
  
=BLAST results viewer=
+
====blastp====
  
When the results are returned they are placed in the Project Folders component as a child of the sequence they correspond to.  You can mouse over the result set to see how many sequences are in it.
+
[[Image:BLAST_Parameters_Main-blastp.png]]
  
  
[[Image:T_SequenceAlignment_BlastInProject.png]]
 
  
 +
[[Image:BLAST_Parameters_blastp.png|{{ImageMaxWidth}}]]
  
The result file will be opened in your web browser, if this option was selected.  The alignment results are also displayed directly in the geWorkbench BLAST results viewer and can be further manipulated.  Each different target hit is listed on a line in the results table.  Note that a query sequence can hit a database target sequence in more than one place, resulting in multiple alignments displayed per target hit.
+
====blastx====
  
In the Blast results viewer you can select sequences to add back to the main project by checking the include box and then the '''Add Selected Sequences To Your Project''' button.
+
[[Image:BLAST_Parameters_Main-blastx.png]]
  
You can also add just the aligned parts by clicking on the button '''Only Add Aligned Parts'''.
 
  
The results viewer also shows statisics for each hit, including the E-value, start and length of the hit, and the percent identity.
+
[[Image:BLAST_Parameters_blastx.png]]
  
In the pane at left in the picture below, the name of the input query sequence is shown, e.g. the gi number of a Genbank sequence.  '''Note that the "gi" designator is not shown''', though it should be.  If there had been more than one, then this would show a list of names, and you could select for which input query you wish to view the alignment results.
 
  
 +
====tblastx====
  
[[Image:T_SequenceAlignment_BLAST_results.png]]
+
[[Image:BLAST_Parameters_Main-tblastx.png]]
  
  
 +
[[Image:BLAST_Parameters_tblastx.png]]
  
In the Blast results viewer, the '''Load''' button allows one to load an external Blast file in HTML format into the viewer.
+
====tblastn====
  
 +
[[Image:BLAST_Parameters_Main-tblastn.png]]
  
  
==Submitting a BLAST job==
+
[[Image:BLAST_Parameters_tblastn.png|{{ImageMaxWidth}}]]
  
* Press the curved arrow submit button.  The adjacent Stop button will terminate the search (geWorkbench will not wait for or retrieve the BLAST results).
+
==Analyze==
  
[[Image:T_SequenceAlignment_BLAST_start_stop.png]]
+
The BLAST analysis is launched by pushing the '''Analyze''' button.  A dialog with a progress bar will appear.  The analysis can be canceled by pushing the '''Cancel''' button on this dialog.
  
* Once the search has been submitted, a progress bar in the "Main" tab will indicate first that the sequence is being uploaded and then that the job is running.
 
  
 +
[[Image:BLAST_Analysis_Progress_Dialog.png]]
  
[[Image:T_SequenceAlignment_BLAST_running.png]]
+
=BLAST Results Viewer=
  
=Example: Running a BLAST search=
+
When the BLAST search results are returned they are placed in a new node in the [[Workspace|Workspace]] as a child of the query sequence used.  Mousing over the result set will show how many sequences are in it.
Two Genbank Fasta sequence files are provided in the tutorial dataset, a nucleotide sequence, "NM _024426-Wilms.Fasta", and its protein sequence, "NP_077744-Wilms.fasta".
+
 
 +
 
 +
[[Image:Blast_result_projects_folder.png]]
 +
 
 +
Each different hit is listed on a line in the results table, shown below.  Note that a query sequence can hit a database target sequence in more than one place, resulting in multiple alignments displayed per target hit.  The results viewer also shows statisics for each hit, including the E-value, start position and length of the hit, and the percent identity.
 +
 
 +
If the "Display result in your web browser" option was chosen, then the browser will open with the HTML formated results. 
 +
 
 +
In the pane at left in the picture below, the name of the input query sequence is shown, e.g. the gi number of a Genbank sequence.  If there had been more than one query sequence, then this pane would show a list of query sequence names, allowing you to select the results to be viewed.
 +
 
 +
[[Image:Blast_Wilms_NM_result.png|{{ImageMaxWidth}}]]
 +
 
 +
 
 +
==Controls==
 +
===Within the list of returned hits===
 +
* '''Include''' check boxes - when checked, selects these sequences for import into the [[Workspace|Workspace]].
 +
** '''''Note''''' - Starting with geWorkbench 2.4.0, sequence hits from more than one query sequence can be included in a single sequence set imported back into the [[Workspace|Workspace]].  Move between different result sets by selecting the desired query sequences one at a time at left in the results window, and then within each result set select the desired sequences.  A warning will appear when this option is used, to insure that the user really intended to include results from multiple hits.
 +
 
 +
[[Image:BLAST_multiple_selected_warning.png]]
 +
 
 +
===At the bottom of the pane===
 +
* '''Reset''' - uncheck all "Include" boxes.
 +
* '''Select All''' - mark as checked all the "Include" boxes.
 +
* '''Add Complete Sequences to Workspace''' - for each hit whose "Include" box is checked, add its full sequence to a sequence node in the [[Workspace|Workspace]].  This option performs a retrieval query against the NCBI database to fetch the full sequence corresponding to each selected hit, not just the aligned portions. 
 +
** Technical note - The query uses the integer Entrez database id (e.g. GI number) for a sequence, and this id will be reflected in the fasta format sequence entry returned as a sequence node to the Workspace.
 +
* '''Only Add Aligned Parts''' - for each hit whose "Include" box is checked, add to the [[Workspace|Workspace]] only the portion(s) of its sequence which aligned with the query sequence.  The new sequence node will contain one sequence for each aligned region.
 +
** Technical note - these sequences will be displayed with the accession number, as was the original hit. 
 +
*** The tag "---PARTIALLY INCLUDED" will also be appended to the accession number in the sequence node.  
 +
*** If there are sub-sequence in the hit, they will be indicated with (''n'') appended just after the accession number, where ''n'' is the number of the sub-sequence, e.g. "(1)---PARTIALLY INCLUDED".
 +
 
 +
===In the query list===
 +
* '''Search''' - input a text search to find entries in the list of queries.
 +
* '''Find Next''' - search for the next occurence of the entered text.
 +
 
 +
==Adding selected sequence hits to the Workspace==
 +
The sequences corresponding to individual hits in the BLAST search can be retrieved from NCBI and added to the [[Workspace|Workspace]].
 +
 
 +
[[Image:BLAST_Wilms_NM_select_mus.png|{{ImageMaxWidth}}]]
 +
 
 +
Here we select a particular hit, by checking the "include" box next to it, and press the button "Add Selected Sequences to Workspace". The sequence is retrieved and placed into the [[Workspace|Workspace]] as shown below:
 +
 
 +
[[Image:BLAST_Wilms_NM_select_mus_added.png]]
 +
 
 +
 
 +
==Complete vs Aligned Sequences in the Workspace==
  
For a simple search using the nucleotide query file, one can select the '''blastn''' program and search against the '''ncbi/nt''' non-redundant database of nucleotide sequencesFor an even quicker example search, one could run the protein query sequence against a small protein database derived from those sequences found in the PDB database of proteins having known structures.
+
The Wilms tumor sequence was queried against the Human EST database.  The final hit, with accession DB442323.1, shows the kind of difference that can occur between retrieving a complete sequence (first picture) and only the aligned parts of a sequence (second picture).
 
 
Here we will illustrate a search using the nucleotide file "NM _024426-Wilms.Fasta".
 
  
* Read the "NM _024426-Wilms.Fasta" data file into the Project component using the '''File Open''' command and file type '''FASTA'''.
 
  
* In the Project component, make sure the sequence file just read in is selected.  This will activate those components that can work with sequence data.
+
Complete sequence for hit:
  
* In the Commands Area click on the '''Sequence Alignment''' tab.
 
  
* Select the '''BLAST''' tab.
+
[[Image:BLAST_humEST_complete_seq.png]]
  
The length of the sequence is shown, and if desired a subset of the input sequence can be specified for use in the search.  If more than one sequence was read in, the length of the longest is displayed.
 
  
* Click on the drop down arrow and select a program. Since this is a nucelotide query, here we select a nucleotide query program '''blastn'''.
 
  
* Select the desired nucleotide database.  Here select  '''ncbi/nt''' - the complete non-redundant nucleotide database.  For a faster search, one could select the ncbi/pdbnt database instead, which is much smaller.
+
Aligned part of hit only:
  
'''Note:''' The text field at the bottom of the Sequence Alignment component shows the number of sequences that have been selected.  If you have a Fasta file that has multiple sequences, you can select the ones you want in the Markers component and activate this selection, letting you search on a subset. You can search on all sequences in a file by clicking the '''All Markers''' checkbox.
 
  
 +
[[Image:BLAST_humEST_aligned_only.png]]
  
[[Image:T_SequenceAlignment_BLAST_Main.png]]
+
=Example: Running a BLAST search=
 +
Two Genbank sequence files in FASTA format are included in the geWorkbench data/public_data folder: a nucleotide sequence, "NM _024426-Wilms.Fasta", and its protein sequence, "NP_077744-Wilms.fasta".
  
 +
For a simple search using the nucleotide query file, one can select the '''blastn/megablast''' program and search against the '''nr/nt''' database of nucleotide sequences. 
  
* Click on the '''Advanced Options''' Tab
+
* In the [[Workspace|Workspace]],
 +
** right-click on the [[Workspace|Workspace]] icon and select "Open File(s)", or
 +
** from the top-level "File" menu, select "Open->File". 
 +
* Select a file type of FASTA. 
 +
* Navigate to data/public_data within the geWorkbench distribution and select the file "NM _024426-Wilms.Fasta".
 +
* Press "Open".
 +
* Right-click on the new sequence data node.
 +
* Select Analysis->BLAST Analysis.
  
* Change the '''Expect Value''' to 0.01.  This sets the cutoff for which BLAST hits will be displayed.
 
  
* Make sure "dna mat" is selected for the '''Matrix'''.
+
* For program select '''blastn''', and leave the default algorithm choice set to megablast.
 +
* For database, select '''nr''' - the complete nucleotide database.  
  
 +
'''Note:''' - If you have a fasta file that has multiple sequences, you can create a set in the Markers component and use only this set for the search, by activating the set (check the box by its name in the Markers component). You can override an activated Marker Set and search on all sequences in a file by clicking the '''All Markers''' checkbox.
 +
 +
* Click on the '''Algorithm Parameters''' Tab
 +
* Change the '''Expect threshold''' to 0.01.  This sets the cutoff for which BLAST hits will be displayed.
 
* Leave the '''Display result in your web browser''' checked.
 
* Leave the '''Display result in your web browser''' checked.
  
 +
* Hit the "Analyze" button.  The job will be submitted and the results returned as shown in the sections above.
 +
 +
=Technical Notes=
 +
* The NCBI BLAST server may occasionally return an error when sequences are searched from geWorkbench. The problem appears to depend on the load on the NCBI BLAST server.
 +
* When geWorkbench is asked to submit multiple sequences to the NCBI BLAST server, it will submit them one at a time and wait for the results before submitting the next sequence.  This is done to simplify the subsequent parsing and display of the results.
 +
* '''Metagenome databases''' - The Whole Genome Shotgun (WGS) databases at NCBI include the metagenome samples. The metagenomic, or "environmental" samples sequences are currently also available in the env_nt and env_nr (metagenomic) databases, but this may not always be true.  The NCBI BLAST website now only supports searching metagenomic projects through the WGS database. However, geWorkbench does not directly support searching the metagenomic projects via the WGS database, and instead continues to provide direct query of the metagenomes via env_nt and env_nr. (Additional details in Mantis #3198, #3801).
 +
* '''Only first 100 hits displayed''' - geWorkbench only displays the first 100 hits returned by BLAST.  The complete set of hits is shown in the web browser if this option is checked on the Algorithm Parameters tab.
 +
 +
=References=
 +
* Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W. (2000) Winnowing sequences from a database search. J Comput Biol. 7(1-2):293-302.
  
[[Image:T_SequenceAlignment_BLAST_AdvancedParams.png]]
+
* The NCBI BLAST site provides a comprehensive [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=References list of references].

Latest revision as of 23:41, 26 January 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Overview

"The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families" (quoted from the NCBI BLAST homepage).

geWorkbench submits BLAST jobs to the NCBI server. NCBI-supported sequence databases and search algorithms can be selected in the user interface. Since release 2.1.0, geWorkbench supports almost all BLAST setting options available through the NCBI web interface.

Please note that although, in geWorkbench, we have adopted the default settings for each BLAST algorithm as seen on the NCBI website, those settings are subject to change at any time by NCBI. Before submitting a BLAST job from geWorkbench, the user should verify that the parameter settings are appropriate for their query.


The BLAST analysis is available when a protein or DNA sequence is loaded and selected in the Workspace. The BLAST Analysis and BLAST Results Viewer must be loaded in the Component Configuration Manager.


BLAST Parameters Main-blastn-full.png

Figure legend: BLAST Main Parameters. A nucleotide sequence has been loaded into the Workspace.

NCBI Documentation

geWorkbench serves as a interface to the NCBI BLAST server and implements the same options as the NCBI BLAST website. Detailed information about each option can be found on NCBI webpages, including:

Older help pages...


BLAST job setup

Prerequisites

Query sequences

BLAST accepts nucleotide or amino-acid query sequences in the FASTA format. A query file can contain one or multiple sequences. The file can be loaded from disk using the File Open command, or may have been placed into the Workspace by another component such as the Sequence Retriever, or add from the result of a previous BLAST run.

Invoking BLAST

BLAST is a normal geWorkbench analysis component and can be invoked either by right-clicking on a sequence node in the Workspace, or through the Commands entry in the Menu Bar at the top of the geWorkbench GUI.


Invoking BLAST via right-clicking on a sequence node in the Workspace:


BLAST Analysis invocation.png


Invoking BLAST via the Commands menu. The desired sequence node must be selected first in the Workspace:


BLAST Analysis Command invocation.png

Parameters - Main

In addition to the parameters shown in the previous image, BLASTX and TBLASTX add the Genetic Code option:

BLAST Parameters Main-blastx.png

Algorithms

The user must make sure that the algorithm chosen matches the type of query sequence (protein or nucleotide) that has been loaded. Some of the algorithms translate a nucleotide query, a nucleotide database, or both into amino acid sequence before executing the query. Searching in the amino-acid space is more sensitive for certain types of query, as it ignores synonymous, non-functional changes in nucleotide sequence.

For protein query sequences:

  • blastp - Compares an amino acid query sequence against a protein sequence database.
  • tblastn - Compares a amino acid query sequence against a nucleotide database translated in all reading frames.

For nucleotide query sequences:

  • blastn - three algorithms are available under the blastn choice, in order of decreasing similarity of the query to the target sequences, Each compares a nucleotide query sequence against a nucleotide sequence database:
    • megablast - optimize for highly similar sequences.
    • discontinuous megablast - optimize for more dissimilar sequences
    • blastn - optimize for somewhat similar sequences.
  • blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
  • tblastx - Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database.

Databases

Standard protein and nucleic acid databases maintained at NCBI are supported. The appropriate databases for the search algorithm chosen will be displayed. A window to the right, "Database Details", displays a summary of the database currently selected in the list.

For nucleic acids:

  • Nucleotide Collection (nr/nt)
  • Reference RNA seqeunces (refseq_rna)
  • Reference genomic sequneces (refseq_genomic)
  • NCBI Genomes (chromosome)
  • Expressed Sequence Tags (est)
  • Human subset of EST (est_human)
  • Mouse subset of EST (est_mouse)
  • Non-human, non-mouse ESTs (est_others)
  • Genomic survey sequences (gss)
  • High throughput genomic sequences (htgs)
  • Patent sequences (pat)
  • Protein Data Bank (pdb)
  • Human ALU repeat elements (alu)
  • Sequence tagged sites (dbsts)
  • Whole genome shotgun contigs (wgs)
  • Metagenomic Samples (env_nt)
  • Transcriptome Shotgun Assembly (tsa_nt)

For proteins:

  • Non-redundant protein sequences (nr)
  • Reference proteins (refseq_protein)
  • UniProtKB/Swiss-prot (swissprot)
  • Patented protein sequences (pat)
  • Protein Data Bank sequences (pdb)
  • Metagenomic proteins (env_nr)
  • Transcriptome Shotgun Assembly (tsa_nr)

Search Choice

These settings provide the ability to restrict the search in certain ways

  • Exclude: exclude certain specialized database entries from the search.
    • Models (XM/XP)
    • Uncultured/environmental sequences
  • Entrez Query - an Entrez query can be entered directly to restrict a search e.g. to a particular species. Please see NCBI BLAST help. An example from that page is "Mus musculus[organism] AND biomol_mrna[properties]". This limits the search to mouse mRNA entries in the database.

Genetic Code

Genetic code to be used in blastx (and tblastx) translation of the query.

Algorithm Parameters

General Parameters

  • Max target sequences - maximum number of hits to return.
  • Automatically adjust parameters for short input sequences.
  • Expect threshold - Expected number of chance matches in a random model.
  • Word size - The length of the seed that initiates an alignment.
  • Max matches in a query range - Limit the number of matches to a query range. This option is useful if many strong matches to one part of a query may prevent BLAST from presenting weaker matches to another part of the query.
    • Note - NCBI reports that this feature may not work if there are a large number of full-length sequence matches in the chosen database.
    • The function of this feature is described in Berman et al., 2000.

Scoring Parameters

  • Match/mismatch scores (blastn, megablast, discontinuous megablast) - scores to use for a match or mismatch.
  • Matrix - Various scoring matrices (BLOSUM, PAM) are available for protein and translated queries.
  • Gap Costs - The pull-down menu shows the available choices of gap costs for the current scoring matrix.
  • Compositional adjustments - "...takes into account the amino acid composition of the individual database sequences involved in reported alignments. This improves E-value accuracy, thereby reducing the number of false positive results."

Filters and Masking

  • Low Complexity - filter out low compositional complexity sequence.
  • Species-specific repeats filter - masks species-specific repeats (e.g. human LINE's and SINE's). This option can speed searches involving long query sequences or databases containing sequences with many repeats.
  • Mask for lookup table only - masks low-complexity sequence only while constructing the lookup table used by the initial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence. NCBI notes that this option is experimental and subject to change.
  • Mask lower case letter - filter out sequence which is in lower case in the FASTA query sequence.

Discontiguous Word Options

Please see the NCBI page on discontiguous megablast for a detailed explanation of these options.

  • Template Length
  • Template Type

Other

  • Display result in your web browser - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.
  • Restore defaults - restore all settings for the currently selected algorithm to their default values.


Algorithm Parameter Setting Defaults

The default settings for each query type were taken from the NCBI BLAST website.

Please note that options and default settings on the NCBI BLAST website are subject to change at any time.

The user should verify all settings are appropriate for his or her particular BLAST query.

blastn - megablast

BLAST Parameters Main-blastn.png


BLAST Parameters blastn-megablast.png

blastn - discontiguous megablast

BLAST Parameters blastn-discontinuous-megablast.png

blastn - blastn

BLAST Parameters blastn.png

blastp

BLAST Parameters Main-blastp.png


BLAST Parameters blastp.png

blastx

BLAST Parameters Main-blastx.png


BLAST Parameters blastx.png


tblastx

BLAST Parameters Main-tblastx.png


BLAST Parameters tblastx.png

tblastn

BLAST Parameters Main-tblastn.png


BLAST Parameters tblastn.png

Analyze

The BLAST analysis is launched by pushing the Analyze button. A dialog with a progress bar will appear. The analysis can be canceled by pushing the Cancel button on this dialog.


BLAST Analysis Progress Dialog.png

BLAST Results Viewer

When the BLAST search results are returned they are placed in a new node in the Workspace as a child of the query sequence used. Mousing over the result set will show how many sequences are in it.


Blast result projects folder.png

Each different hit is listed on a line in the results table, shown below. Note that a query sequence can hit a database target sequence in more than one place, resulting in multiple alignments displayed per target hit. The results viewer also shows statisics for each hit, including the E-value, start position and length of the hit, and the percent identity.

If the "Display result in your web browser" option was chosen, then the browser will open with the HTML formated results.

In the pane at left in the picture below, the name of the input query sequence is shown, e.g. the gi number of a Genbank sequence. If there had been more than one query sequence, then this pane would show a list of query sequence names, allowing you to select the results to be viewed.

Blast Wilms NM result.png


Controls

Within the list of returned hits

  • Include check boxes - when checked, selects these sequences for import into the Workspace.
    • Note - Starting with geWorkbench 2.4.0, sequence hits from more than one query sequence can be included in a single sequence set imported back into the Workspace. Move between different result sets by selecting the desired query sequences one at a time at left in the results window, and then within each result set select the desired sequences. A warning will appear when this option is used, to insure that the user really intended to include results from multiple hits.

BLAST multiple selected warning.png

At the bottom of the pane

  • Reset - uncheck all "Include" boxes.
  • Select All - mark as checked all the "Include" boxes.
  • Add Complete Sequences to Workspace - for each hit whose "Include" box is checked, add its full sequence to a sequence node in the Workspace. This option performs a retrieval query against the NCBI database to fetch the full sequence corresponding to each selected hit, not just the aligned portions.
    • Technical note - The query uses the integer Entrez database id (e.g. GI number) for a sequence, and this id will be reflected in the fasta format sequence entry returned as a sequence node to the Workspace.
  • Only Add Aligned Parts - for each hit whose "Include" box is checked, add to the Workspace only the portion(s) of its sequence which aligned with the query sequence. The new sequence node will contain one sequence for each aligned region.
    • Technical note - these sequences will be displayed with the accession number, as was the original hit.
      • The tag "---PARTIALLY INCLUDED" will also be appended to the accession number in the sequence node.
      • If there are sub-sequence in the hit, they will be indicated with (n) appended just after the accession number, where n is the number of the sub-sequence, e.g. "(1)---PARTIALLY INCLUDED".

In the query list

  • Search - input a text search to find entries in the list of queries.
  • Find Next - search for the next occurence of the entered text.

Adding selected sequence hits to the Workspace

The sequences corresponding to individual hits in the BLAST search can be retrieved from NCBI and added to the Workspace.

BLAST Wilms NM select mus.png

Here we select a particular hit, by checking the "include" box next to it, and press the button "Add Selected Sequences to Workspace". The sequence is retrieved and placed into the Workspace as shown below:

BLAST Wilms NM select mus added.png


Complete vs Aligned Sequences in the Workspace

The Wilms tumor sequence was queried against the Human EST database. The final hit, with accession DB442323.1, shows the kind of difference that can occur between retrieving a complete sequence (first picture) and only the aligned parts of a sequence (second picture).


Complete sequence for hit:


BLAST humEST complete seq.png


Aligned part of hit only:


BLAST humEST aligned only.png

Example: Running a BLAST search

Two Genbank sequence files in FASTA format are included in the geWorkbench data/public_data folder: a nucleotide sequence, "NM _024426-Wilms.Fasta", and its protein sequence, "NP_077744-Wilms.fasta".

For a simple search using the nucleotide query file, one can select the blastn/megablast program and search against the nr/nt database of nucleotide sequences.

  • In the Workspace,
    • right-click on the Workspace icon and select "Open File(s)", or
    • from the top-level "File" menu, select "Open->File".
  • Select a file type of FASTA.
  • Navigate to data/public_data within the geWorkbench distribution and select the file "NM _024426-Wilms.Fasta".
  • Press "Open".
  • Right-click on the new sequence data node.
  • Select Analysis->BLAST Analysis.


  • For program select blastn, and leave the default algorithm choice set to megablast.
  • For database, select nr - the complete nucleotide database.

Note: - If you have a fasta file that has multiple sequences, you can create a set in the Markers component and use only this set for the search, by activating the set (check the box by its name in the Markers component). You can override an activated Marker Set and search on all sequences in a file by clicking the All Markers checkbox.

  • Click on the Algorithm Parameters Tab
  • Change the Expect threshold to 0.01. This sets the cutoff for which BLAST hits will be displayed.
  • Leave the Display result in your web browser checked.
  • Hit the "Analyze" button. The job will be submitted and the results returned as shown in the sections above.

Technical Notes

  • The NCBI BLAST server may occasionally return an error when sequences are searched from geWorkbench. The problem appears to depend on the load on the NCBI BLAST server.
  • When geWorkbench is asked to submit multiple sequences to the NCBI BLAST server, it will submit them one at a time and wait for the results before submitting the next sequence. This is done to simplify the subsequent parsing and display of the results.
  • Metagenome databases - The Whole Genome Shotgun (WGS) databases at NCBI include the metagenome samples. The metagenomic, or "environmental" samples sequences are currently also available in the env_nt and env_nr (metagenomic) databases, but this may not always be true. The NCBI BLAST website now only supports searching metagenomic projects through the WGS database. However, geWorkbench does not directly support searching the metagenomic projects via the WGS database, and instead continues to provide direct query of the metagenomes via env_nt and env_nr. (Additional details in Mantis #3198, #3801).
  • Only first 100 hits displayed - geWorkbench only displays the first 100 hits returned by BLAST. The complete set of hits is shown in the web browser if this option is checked on the Algorithm Parameters tab.

References

  • Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W. (2000) Winnowing sequences from a database search. J Comput Biol. 7(1-2):293-302.