Difference between revisions of "Sequence Retriever"

Line 20: Line 20:
 
Once a set of sequences has been obtained, it can used for several types of analysis in geWorkbench, including searching using known promoter motifs ([[Tutorial_-_Promoter_Analysis | Promoter_Analysis]]), running [[Tutorial_-_BLAST|BLAST]] searches, or looking for common motifs using [[Tutorial_-_Pattern_Discovery|Pattern Discovery]].
 
Once a set of sequences has been obtained, it can used for several types of analysis in geWorkbench, including searching using known promoter motifs ([[Tutorial_-_Promoter_Analysis | Promoter_Analysis]]), running [[Tutorial_-_BLAST|BLAST]] searches, or looking for common motifs using [[Tutorial_-_Pattern_Discovery|Pattern Discovery]].
  
==Limitations==
+
Nucleotide sequences are obtained directly from the UC Santa Cruz Golden Path database.  Amino-acid sequences are retrieved from the European Bioinformatics Institute (EBI).
This section applies to geWorkbench 1.03, released April 5, 2006.  For DNA sequences, only the sequence +-2000 bp from the transcription start site is available, and only for markers in the Affymetrix HG_U95 chip.  These sequences have been pre-cached on the geWorkbench server and are downloaded to the application the first time they are requested.  The DNA sequences have been their exons masked with the letter "E". 
 
 
 
 
 
For the next version, we are developing methods to obtain sequences directly from the UC Santa Cruz Golden Path database where possible.  Amino-acid sequences are retrieved on-the-fly from the European Bioinformatics Institute (EBI).
 
  
 
==Example - retrieving sequences for a list of gene markers==
 
==Example - retrieving sequences for a list of gene markers==
  
 
===Obtaining a set of markers===
 
===Obtaining a set of markers===
We will start with a group of markers obtained in the tutorial  [[Tutorial_-_Clustering#Hierarchical_Clustering_-_Example | Hierarchical Clustering]]The list of markers from that tutorial can also be loaded from the file "cluster_tree_12markers.csv" found in the tutorial data file (see [[Download]]). To load a set of markers, press the "Load Set" button at the bottom of the component and browse to the desired file.
+
Sequences can be retrieved for any set of markers of interestFor this example we have loaded the tutorial data file BCell-100.exp and selected the last 10 markers into a new Marker Set:
  
===Retrieving the sequences===
+
[[Image:T_SequenceRetriever_MarkerSet.png]]
  
We will retrieve sequences from -1999 to +1 bp from the transcription start site of each gene. 
 
  
Verify that the sequence type is set to DNA.  Press the '''Get Sequence''' button to download the sequences.
+
When the set is activate (through use of the check box) the selected marker set will appear in the Sequence Retriever component:
  
 +
[[Image:T_SequenceRetriever_Setup.png]]
  
[[Image:T_SequenceRetriever_84ClustSeqs.png]]
 
  
 +
We will retrieve DNA sequences from Santa Cruz and leave the default settings of +-10,000 relative to the start of transcription.  After pressing '''Get Sequence''' the sequences are downloaded:
  
By double-clicking on one of the lines representing a returned sequence, you can switch to a detailed view of the sequence:
+
[[Image:T_SequenceRetriever_AfterRetrieval.png]]
  
 +
Note that for several of the markers more than one sequence has been retrieved.  All sequences associated with a given gene symbol are retrieved.
  
[[Image:T_SequenceRetriever_84ClustSeqs_disp.png]]
 
  
== Adding the returned sequences to the project==
+
Double-clicking on one of the lines shows the sequence detail:
  
Retrieved FASTA format sequences can be added to a project by clicking the '''Add to Project''' button at lower right in the component (see interface picture above). Note that if the resulting sequence entry in the '''Project Folder''' is then selected, modules supporting sequence analysis and visualization will appear in the Analytical Tools and Visualization areas of the GUI.  However, the Sequence Retrieval component will not be visible!  You must select the Project or the sequence's parent object to see the Sequence Retrieval component again.
+
[[Image:T_SequenceRetriever_SequenceDetail.png]]
* Here we saved the returned sequences under the name "cluster sequences".
 
  
  
[[Image:T_SequenceRetriever_84ClustSeqs_ProjFold.png]]
+
The component provides check boxes which allow sequences of interest to be selected and added to the Project Folders component as a data node:
  
 +
[[Image:T_SequenceRetriever_SelectingForProject.png]]
  
  
==Generating a new list of markers for the returned sequences.==
+
When Add to Project is pushed, the user is asked for a name for the new data node:
  
The Sequence Retriever does not necessarily return a sequence for every probe listed.  We can generate a new list of genes for just those present here.
+
[[Image:T_SequenceRetriever_NamingSet.png]]
  
# In the Project Folders component, selecting the "cluster sequences" object just created.  Its contents now appear below in the Markers component. 
 
# in the Markers component, select all of the probes and right-click.
 
# Select Add to Set.
 
# Enter a name for the new set.  Here we have used "cluster tree seqs".
 
# Note that the Marker Sets component shows there are 64 sequences in the set, from the 84 markers we started with.
 
# This new list can also be saved to disk by right-clicking on it and selecting Save.  We have used the name "cluster_tree_total_pearsons_64of84_markers.csv"
 
  
 +
The resulting node is placed into the Project Folder as a child of the original dataset:
  
==Saving the sequences to an external FASTA file==
+
[[Image:T_SequenceRetriever_SequenceNode.png]]
  
# Right-click on the "cluster seqeunces" entry you made in the Project Folders component.
 
# Select Save.
 
# Enter a suitable name.  We have saved it as "640f84ClusterPearsonsSeqs.fasta"
 
  
 +
Note that when this node is added, the Viewing area of the geWorkbench GUI will now show components that support working with sequences.  However, the Sequence Retrieval component will no longer be visible!  You must select the Project or the sequence's parent object to see the Sequence Retrieval component again.
  
  
 +
==Saving the sequences to an external FASTA file==
  
[[Image:T_SequenceRetriever_SavingSets.png]]
+
# Right-click on the "selected sequences" entry you made in the Project Folders component.
 
+
# Select Save.
 
+
# Enter a suitable name and save the file.
==Aside - Marker sets for different objects==
 
 
 
Note that we have now created two Marker sets during this excercise.
 
* One (above) belongs to the set of sequences returned and shows the sequence names of the 64 probes for which sequences were retrieved.
 
* The other (below) belongs to the microarray dataset and contains the list of 84 markers found by clustering.
 
 
 
[[Image:T_SequenceRetriever_SavingSets2.png]]
 

Revision as of 18:09, 13 August 2008

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Outline

In this tutorial, we will

  • Discuss uses for retrieved sequences.
  • Review obtaining a set of markers for which we wish to retrieve sequences.
  • Retrieve DNA sequences from a remote resource
  • View the sequences in a sequence browser.
  • Add the sequences to the Project Folders component
  • Create a new list containing those genes for which sequences were returned.
  • Save the list of sequences to a file.
  • Save the sequences to a file.
  • Note that the different objects in the Project Folders component each have their own sets of Marker Sets.

Overview

geWorkbench contains a number of modules that allow DNA or protein sequences to be visualized and analyzed. Sequences can be loaded from a local disk as a FASTA format file, or can be retrieved from a remote resource. Here we discuss retrieval of sequences from the network.

Once a set of sequences has been obtained, it can used for several types of analysis in geWorkbench, including searching using known promoter motifs ( Promoter_Analysis), running BLAST searches, or looking for common motifs using Pattern Discovery.

Nucleotide sequences are obtained directly from the UC Santa Cruz Golden Path database. Amino-acid sequences are retrieved from the European Bioinformatics Institute (EBI).

Example - retrieving sequences for a list of gene markers

Obtaining a set of markers

Sequences can be retrieved for any set of markers of interest. For this example we have loaded the tutorial data file BCell-100.exp and selected the last 10 markers into a new Marker Set:

T SequenceRetriever MarkerSet.png


When the set is activate (through use of the check box) the selected marker set will appear in the Sequence Retriever component:

T SequenceRetriever Setup.png


We will retrieve DNA sequences from Santa Cruz and leave the default settings of +-10,000 relative to the start of transcription. After pressing Get Sequence the sequences are downloaded:

T SequenceRetriever AfterRetrieval.png

Note that for several of the markers more than one sequence has been retrieved. All sequences associated with a given gene symbol are retrieved.


Double-clicking on one of the lines shows the sequence detail:

T SequenceRetriever SequenceDetail.png


The component provides check boxes which allow sequences of interest to be selected and added to the Project Folders component as a data node:

T SequenceRetriever SelectingForProject.png


When Add to Project is pushed, the user is asked for a name for the new data node:

T SequenceRetriever NamingSet.png


The resulting node is placed into the Project Folder as a child of the original dataset:

T SequenceRetriever SequenceNode.png


Note that when this node is added, the Viewing area of the geWorkbench GUI will now show components that support working with sequences. However, the Sequence Retrieval component will no longer be visible! You must select the Project or the sequence's parent object to see the Sequence Retrieval component again.


Saving the sequences to an external FASTA file

  1. Right-click on the "selected sequences" entry you made in the Project Folders component.
  2. Select Save.
  3. Enter a suitable name and save the file.