Difference between revisions of "BLAST"

(Parameters - Main)
(Parameters - Advanced Options)
Line 60: Line 60:
 
* '''Mask lower case''' - filter out sequence which is in lower case.
 
* '''Mask lower case''' - filter out sequence which is in lower case.
 
* '''Mask for lookup table only''' - masks low-complexity sequence only while constructing the lookup table used by the intial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence.  NCBI notes that this option is experimental and subject to change.
 
* '''Mask for lookup table only''' - masks low-complexity sequence only while constructing the lookup table used by the intial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence.  NCBI notes that this option is experimental and subject to change.
* '''Human repeats filter''' - masks Human repeats (LINE's and SINE's).  This option can speed searches involving long query sequences or databases containing sequences with many repeats.
+
* '''Human repeats filter''' - masks human repeats (LINE's and SINE's).  This option can speed searches involving long query sequences or databases containing sequences with many repeats.
 
* '''Display result in your web browser''' - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.
 
* '''Display result in your web browser''' - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.
  

Revision as of 13:44, 2 October 2009

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot




Overview

The BLAST algorithm is found in the Sequence Alignment tab, located in the command area of geWorkbench (lower right quadrant). The Sequence Alignment tab appears when a protein or DNA sequence is loaded and selected in the Project Folders component. BLAST is currently the only alignment option supported.

The BLAST algorithm is used to find similarities between nucleotide or amino acid query sequences and sequences held in a database. It is often used to give clues to the function of a sequence based on its similarity to already characterized sequences.

geWorkbench runs BLAST by submitting jobs to the NCBI server. NCBI-supported sequence databases and search algorithms can be selected in the user interface (arrows). There is no provision at this time for running a local BLAST job on the client desktop machine.


T SequenceAlignment BLAST Main.png


BLAST job setup

Prerequisites

  • The Sequence Alignment component must be loaded in the geWorkbench Component Configuration Manager.
  • A protein or nucleotide sequence must be loaded in the Project Folders component.

Query sequences

BLAST accepts nucleotide or amino-acid query sequences in the FASTA format. A query file can contain one or multiple sequences. The file can be loaded from disk using the File Open command, or may have been placed into the Project Folders component by another component such as the Sequence Retriever, or as a result of a previous BLAST run.

Parameters - Main

Algorithms

The user must make sure that the algorithm chosen matches the type of query sequence (protein or nucleotide) that has been loaded. Some of the algorithms translate a nucleotide query, a nucleotide database, or both into amino acid sequence before executing the query. Searching in the amino-acid space is more sensitive for certain types of query, as it ignores synonymous, non-functional changes in nucleotide sequence.

For protein query sequences:

  • blastp - Compares an amino acid query sequence against a protein sequence database.
  • tblastn - Compares a amino acid query sequence against a nucleotide database translated in all reading frames.

For nucleotide query sequences:

  • blastn - Compares a nucleotide query sequence against a nucleotide sequence database.
  • blastx - Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
  • tblastx - Compares the 6 frame translations of a nucleotide query sequence against the six frame translations of a nucleotide sequence database.

Databases

Standard protein and nucleic acid databases maintained at NCBI are supported. The appropriate databases for the search algorithm chosen will be displayed.

For nucleic acids:

  • ncbi/nt - all non-redundant DNA sequences.
  • ncbi/pdbnt - nucleotide sequences derived from the PDB database (protein 3D structure database).
  • ncbi/yeast.nt - yeast genonic sequences.

For proteins:

  • ncbi/nr - all non-redundant protein sequences
  • ncbi/pdbaa - protein sequences from the PDB database (protein 3D structure database).
  • ncbi/swissprot - sequences from Swiss-Prot, a primary reference database.
  • ncbi/yeast.aa - translations of yeast genomic coding regions.

Parameters - Advanced Options

The following options to NCBI BLAST can be chosen

  • Low Complexity - filter out low compositional complexity sequence.
  • Mask lower case - filter out sequence which is in lower case.
  • Mask for lookup table only - masks low-complexity sequence only while constructing the lookup table used by the intial hit-find phase of BLAST. The second phase, hit extension, is not not affected and hits can be extended through low-complexity sequence. NCBI notes that this option is experimental and subject to change.
  • Human repeats filter - masks human repeats (LINE's and SINE's). This option can speed searches involving long query sequences or databases containing sequences with many repeats.
  • Display result in your web browser - geWorkbench will display the HTML page returned by NCBI BLAST in your web browser as well is within its own display.

Please see the NCBI BLAST Help page for further details on these options.


T SequenceAlignment BLAST AdvancedParams.png

General controls

  • All Markers - if selected, use all sequences loaded, overriding any activated sets in the Marker Sets component.
  • Total Sequence Number - indicates how many sequences have been selected for query.
  • Curling arrow - start BLAST search
  • Stop sign - stop BLAST search (if pushed, geWorkbench will not wait for or retrieve the BLAST results).


Submitting a BLAST job

  • Press the curved arrow submit button. The adjacent Stop button will terminate the search (geWorkbench will not wait for or retrieve the BLAST results).

T SequenceAlignment BLAST start stop.png

  • Returning to the Main tab; there the progress bar will show first the sequence being uploaded and then that the BLAST job is running.


T SequenceAlignment BLAST running.png


Setting up a search

Two Genbank Fasta sequence files are provided in the tutorial dataset, a nucleotide sequence, "NM _024426-Wilms.Fasta", and its protein sequence, "NP_077744-Wilms.fasta".

For a simple search using the nucleotide query file, one can select the blastn program and search against the ncbi/nt non-redundant database of nucleotide sequences. For an even quicker example search, one could run the protein query sequence against a small protein database derived from those sequences found in the PDB database of proteins having known structures.

Here we will illustrate a search using the nucleotide file "NM _024426-Wilms.Fasta".

  • Read the "NM _024426-Wilms.Fasta" data file into the Project component using the File Open command and file type FASTA.
  • In the Project component, make sure the sequence file just read in is selected. This will activate those components that can work with sequence data.
  • In the Commands Area click on the Sequence Alignment tab.
  • Select the BLAST tab.

The length of the sequence is shown, and if desired a subset of the input sequence can be specified for use in the search. If more than one sequence was read in, the length of the longest is displayed.

  • Click on the drop down arrow and select a program. Since this is a nucelotide query, here we select a nucleotide query program blastn.
  • Select the desired nucleotide database. Here select ncbi/nt - the complete non-redundant nucleotide database. For a faster search, one could select the ncbi/pdbnt database instead, which is much smaller.

Note: The text field at the bottom of the Sequence Alignment component shows the number of sequences that have been selected. If you have a Fasta file that has multiple sequences, you can select the ones you want in the Markers component and activate this selection, letting you search on a subset. You can search on all sequences in a file by clicking the All Markers checkbox.



T SequenceAlignment BLAST Main.png


  • Click on the Advanced Options Tab
  • Change the Expect Value to 0.01. This sets the cutoff for which BLAST hits will be displayed.
  • Make sure "dna mat" is selected for the Matrix.
  • Leave the Display result in your web browser checked.


T SequenceAlignment BLAST AdvancedParams.png

BLAST results viewer

When the results are returned they are placed in the Project Folders as a child of the sequence they correspond to. You can mouse over the result set to see how many sequences are in it.


T SequenceAlignment BlastInProject.png


The result file will be opened in your web browser, if this option was selected. The alignment results are also displayed directly in the geWorkbench Blast results viewer and can be there further manipulated. Each different target hit is listed on a line in the results table. Note that a query sequence can hit a database target sequence in more than one place, resulting in mulitple alignments displayed per target hit.

In the Blast results viewer you can select sequences to add back to the main project by checking the include box and then the Add Selected Sequences To Your Project button.

You can also add just the aligned parts by clicking on the button Only Add Aligned Parts.

The results viewer also shows statisics for each hit, including the E-value, start and length of the hit, and the percent identity.

In the pane at left in the picture below, the name of the input query sequence is shown, e.g. the gi number of a Genbank sequence. Note that the "gi" designator is not shown, though it should be. If there had been more than one, then this would show a list of names, and you could select for which input query you wish to view the alignment results.


T SequenceAlignment BLAST results.png


In the Blast results viewer, the Load button allows one to load an external Blast file in HTML format into the viewer.