"The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families" (quoted from the NCBI BLAST homepage).
geWorkbench submits BLAST jobs to the NCBI server. NCBI-supported sequence databases and search algorithms can be selected in the user interface. Since release 2.1.0, geWorkbench supports almost all BLAST setting options available through the NCBI web interface.
Please note that although, in geWorkbench, we have adopted the default settings for each BLAST algorithm as seen on the NCBI website, those settings are subject to change at any time by NCBI. Before submitting a BLAST job from geWorkbench, the user should verify that the parameter settings are appropriate for their query.
The BLAST analysis is available when a protein or DNA sequence is loaded and selected in the Project Folders component. The BLAST Analysis and BLAST Results Viewer must be loaded in the Component Configuration Manager.
Figure legend: BLAST Main Parameters. A nucleotide sequence has been loaded into the Project Folders component.
geWorkbench serves as a interface to the NCBI BLAST server and implements the same options as the NCBI BLAST website. Detailed information about each option can be found on NCBI webpages, including:
Older help pages...
BLAST job setup
- The BLAST Analysis and Viewer components must be loaded in the Component Configuration Manager.
- A protein or nucleotide sequence must be loaded in the Project Folders component.
BLAST accepts nucleotide or amino-acid query sequences in the FASTA format. A query file can contain one or multiple sequences. The file can be loaded from disk using the File Open command, or may have been placed into the Project Folders component by another component such as the Sequence Retriever, or add from the result of a previous BLAST run.
BLAST is a normal geWorkbench analysis component and can be invoked either by right-clicking on a sequence node in the Project Folders component, or through the Commands entry in the Menu Bar at the top of the geWorkbench GUI.
Invoking BLAST via right-clicking on a sequence node in the Project Folders component:
Invoking BLAST via the Commands menu. The desired sequence node must be selected first in the Project Folders component:
Parameters - Main
In addition to the parameters shown in the previous image, BLASTX and TBLASTX add the Genetic Code option:
The user must make sure that the algorithm chosen matches the type of query sequence (protein or nucleotide) that has been loaded. Some of the algorithms translate a nucleotide query, a nucleotide database, or both into amino acid sequence before executing the query. Searching in the amino-acid space is more sensitive for certain types of query, as it ignores synonymous, non-functional changes in nucleotide sequence.
For protein query sequences:
- blastp - Compares an amino acid query sequence against a protein sequence database.
- tblastn - Compares a amino acid query sequence against a nucleotide database translated in all reading frames.
For nucleotide query sequences:
- blastn - three algorithms are available under the blastn choice, in order of decreasing similarity of the query to the target sequences, Each compares a nucleotide query sequence against a nucleotide sequence database:
- megablast - optimize for highly similar sequences.
- discontinuous megablast - optimize for more dissimilar sequences
- blastn - optimize for somewhat similar sequences.
- blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastx - Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database.
Standard protein and nucleic acid databases maintained at NCBI are supported. The appropriate databases for the search algorithm chosen will be displayed. A window to the right, "Database Details", displays a summary of the database currently selected in the list.
For nucleic acids:
- Nucleotide Collection (nr/nt)
- Reference mRNA seqeunces (refseq_rna)
- Reference genomic sequneces (refseq_genomic
- NCBI Genomes (chromosome)
- Expressed Sequence Tags (est)
- Human subset of EST (est_human)
- Mouse subset of EST (est_mouse)
- Non-human, non-mouse ESTs (est_others)
- Genomic survey sequences (gss)
- High throughput genomic sequences (htgs)
- Patent sequences (pat)
- Protein Data Bank (pdb)
- Human ALU repeat elements (alu)
- Sequence tagges sites (dbsts)
- Whole genome shotgun reads (wgs)
- Environmental Samples (env_nt)
- Non-redundant protein sequences (nr)
- Reference proteins (refseq_protein)
- Swissprot protein sequences (swissprot)
- Patented protein sequences (pat)
- Protein Data Bank sequences (pdb)
- Environmental samples (env_nr)
These settings provide the ability to restrict the search in certain ways
- Exclude: exclude certain specialized database entries from the search.
- Models (XM/XP)
- Uncultured/environmental sequences
- Entrez Query - an Entrez format query can be entered directly to restrict a search e.g. to a particular species.
Genetic code to be used in blastx (and tblastx) translation of the query.
- Max target sequences - maximum number of hits to return.
- Automatically adjust parameters for short input sequences.
- Expect threshold - Expected number of chance matches in a random model.
- Word size - The length of the seed that initiates an alignment.
- Max matches in a query range - Limit the number of matches to a query range. This option is useful if many strong matches to one part of a query may prevent BLAST from presenting weaker matches to another part of the query.
- Note - NCBI reports that this feature may not work if there are a large number of full-length sequence matches in the chosen database.
- The function of this feature is described in Berman et al., 2000.
- Match/mismatch scores (blastn, megablast, discontinuous megablast) - scores to use for a match or mismatch.
- Matrix - Various scoring matrices (BLOSUM, PAM) are available for protein and translated queries.
- Gap Costs - The pull-down menu shows the available choices of gap costs for the current scoring matrix.
- Compositional adjustments - "...takes into account the amino acid composition of the individual database sequences involved in reported alignments. This improves E-value accuracy, thereby reducing the number of false positive results."
Filters and Masking
- Low Complexity - filter out low compositional complexity sequence.
- Species-specific repeats filter - masks species-specific repeats (e.g. human LINE's and SINE's). This option can speed searches involving long query sequences or databases containing sequences with many repeats.
- Mask for lookup table only - masks low-complexity sequence only while constructing the lookup table used by the initial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence. NCBI notes that this option is experimental and subject to change.
- Mask lower case letter - filter out sequence which is in lower case in the FASTA query sequence.
Discontiguous Word Options
Please see the NCBI page on discontiguous megablast for a detailed explanation of these options.
- Template Length
- Template Type
- Display result in your web browser - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.
- Restore defaults - restore all settings for the currently selected algorithm to their default values.
Algorithm Parameter Setting Defaults
The default settings for each query type were taken from the NCBI BLAST website.
Please note that options and default settings on the NCBI BLAST website are subject to change at any time.
The user should verify all settings are appropriate for his or her particular BLAST query.
blastn - megablast
blastn - discontiguous megablast
blastn - blastn
The BLAST analysis is launched by pushing the Analyze button. A dialog with a progress bar will appear. The analysis can be canceled by pushing the Cancel button on this dialog.
Analysis Controls - releases 2.2.2 and earlier
- All Markers - if selected, use all sequences loaded, overriding any activated sets in the Marker Sets component.
- Total Sequence Number - indicates how many sequences have been selected for query.
- Curling arrow - start BLAST search
- Stop sign - stop BLAST search (if pushed, geWorkbench will not wait for or retrieve the BLAST results).
BLAST Results Viewer
When the BLAST search results are returned they are placed in a new node in the Project Folders component as a child of the query sequence used. Mousing over the result set will show how many sequences are in it.
Each different hit is listed on a line in the results table, shown below. Note that a query sequence can hit a database target sequence in more than one place, resulting in multiple alignments displayed per target hit. The results viewer also shows statisics for each hit, including the E-value, start position and length of the hit, and the percent identity.
If the "Display result in your web browser" option was chosen, then the browser will open with the HTML formated results.
In the pane at left in the picture below, the name of the input query sequence is shown, e.g. the gi number of a Genbank sequence. If there had been more than one query sequence, then this pane would show a list of query sequence names, allowing you to select the results to be viewed.
Within the list of returned hits
- Include check boxes - when checked, selects these sequences for import into the Project Folders component.
- Note - Starting with geWorkbench 2.4.0, sequence hits from more than one query sequence can be included in a single sequence set imported back into the Project. Move between different result sets by selecting the desired query sequences one at a time at left in the results window, and then within each result set select the desired sequences. A warning will appear when this option is used, to insure that the user really intended to include results from multiple hits.
- Reset - uncheck all "Include" boxes.
- Select All - mark as checked all the "Include" boxes.
- Add Selected Sequences to Project - for each hit whose "Include" box is checked, add its sequence to a sequence node in the Project Folders component.
- Only Add Aligned Parts - for each hit whose "Include" box is checked, add to the Project Folders component only the portion of its sequence which aligned with the query sequence .
In the query list
- Search - input a text search to find entries in the list of queries.
- Find Next - search for the next occurence of the entered text.
Adding selected sequence hits to the project
The sequences corresponding to individual hits in the BLAST search can be retrieved from NCBI and added to the Project Folders component.
Here we select a particular hit, by checking the "include" box next to it, and press the button "Add Selected Sequences to Project". The sequence is retrieved and placed into the Project Folders component as shown below:
Example: Running a BLAST search
Two Genbank sequence files in FASTA format are included in the geWorkbench data/public_data folder: a nucleotide sequence, "NM _024426-Wilms.Fasta", and its protein sequence, "NP_077744-Wilms.fasta".
For a simple search using the nucleotide query file, one can select the blastn/megablast program and search against the nr/nt database of nucleotide sequences.
- In the Project Folders component,
- right-click on a Project and select "Open File(s)", or
- from the top-level "File" menu, select "Open->File".
- Select a file type of FASTA.
- Navigate to data/public_data within the geWorkbench distribution and select the file "NM _024426-Wilms.Fasta".
- Press "Open".
- Right-click on the new sequence data node.
- Select Analysis->BLAST Analysis.
- For program select blastn, and leave the default algorithm choice set to megablast.
- For database, select nr - the complete nucleotide database.
Note: - If you have a fasta file that has multiple sequences, you can create a set in the Markers component and use only this set for the search, by activating the set (check the box by its name in the Markers component). You can override an activated Marker Set and search on all sequences in a file by clicking the All Markers checkbox.
- Click on the Algorithm Parameters Tab
- Change the Expect threshold to 0.01. This sets the cutoff for which BLAST hits will be displayed.
- Leave the Display result in your web browser checked.
- Hit the "Analyze" button. The job will be submitted and the results returned as shown in the sections above.
- The NCBI BLAST server may occasionally return an error when sequences are searched from geWorkbench. The problem appears to depend on the load on the NCBI BLAST server.
- When geWorkbench is asked to submit multiple sequences to the NCBI BLAST server, it will submit them one at a time and wait for the results before submitting the next sequence. This is done to simplify the subsequent parsing and display of the results.
- Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W. (2000) Winnowing sequences from a database search. J Comput Biol. 7(1-2):293-302.
- The NCIB BLAST site provides a comprehensive list of references.
- This page was last modified on 16 July 2013, at 18:51.
- This page has been accessed 34,357 times.