Sequence Retriever

Revision as of 19:02, 17 August 2006 by Smith (talk | contribs) (Retrieving the sequences)

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Outline

In this tutorial, we will

  • Discuss uses for retrieved sequences.
  • Review obtaining a set of markers for which we wish to retrieve sequences.
  • Retrieve DNA sequences from a remote resource
  • View the sequences in a sequence browser.

Overview

geWorkbench contains a number of modules that allow DNA or protein sequences to be visualized and analyzed. Sequences can be loaded from a local disk as a FASTA format file, or can be retrieved from a remote resource. Here we discuss retrieval of sequences from the network.

Once a set of sequences has been obtained, it can used for several types of analysis in geWorkbench, including searching using known promoter motifs ( Promoter_Analysis), running BLAST searches, or looking for common motifs using Pattern Discovery.

Limitations

This section applies to geWorkbench 1.03, released April 5, 2006. For DNA sequences, only the sequence +-2000 bp from the transcription start site is available, and only for markers in the Affymetrix HG_U95 chip. These sequences have been pre-cached on the geWorkbench server and are downloaded to the application the first time they are requested. The DNA sequences have been their exons masked with the letter "E".


For the next version, we are developing methods to obtain sequences directly from the UC Santa Cruz Golden Path database where possible. Amino-acid sequences are retrieved on-the-fly from the European Bioinformatics Institute (EBI).

Example - retrieving sequences for a list of gene markers

Obtaining a set of markers

We will start with a group of markers obtained in the tutorial Hierarchical Clustering. The list of markers from that tutorial can also be loaded from the file "cluster_tree_12markers.csv" found in the tutorial data file (see Download). To load a set of markers, press the "Load Set" button at the bottom of the component and browse to the desired file.

Retrieving the sequences

We will retrieve sequences from -1999 to +1 bp from the transcription start site of each gene.

Verify that the sequence type is set to DNA. Press the Get Sequence button to download the sequences.


T SequenceRetriever 84ClustSeqs.png


By double-clicking on one of the lines representing a returned sequence, you can switch to a detailed view of the sequence:


T SequenceRetriever 84ClustSeqs disp.png

Adding the returned sequences to the project

Retrieved FASTA format sequences can be added to a project by clicking the Add to Project button at lower right in the component (see interface picture above). Note that if the resulting sequence entry in the Project Folder is then selected, modules supporting sequence analysis and visualization will appear in the Analytical Tools and Visualization areas of the GUI. However, the Sequence Retrieval component will not be visible! You must select the Project or the sequence's parent object to see the Sequence Retrieval component again.

  • Here we saved the returned sequences under the name "cluster sequences".


T SequenceRetriever 84ClustSeqs ProjFold.png


Generating a new list of markers for the returned sequences.

The Sequence Retriever does not necessarily return a sequence for every probe listed. We can generate a new list of genes for just those present here.

  1. In the Project Folders component, selecting the "cluster sequences" object just created. Its contents now appear below in the Markers component.
  2. in the Markers component, select all of the probes and right-click.
  3. Select Add to Set.
  4. Enter a name for the new set. Here we have used "cluster tree seqs".
  5. Note that the Marker Sets component shows there are 64 sequences in the set, from the 84 markers we started with.
  6. This new list can also be saved to disk by right-clicking on it and selecting Save. We have used the name "cluster_tree_total_pearsons_64of84_markers.csv"


Saving the seqeunces to an external FASTA file

  1. Right-click on the "cluster seqeunces" entry you made in the Project Folders component.
  2. Select Save.
  3. Enter a suitable name. We have saved it as "640f84ClusterPearsonsSeqs.fasta"