Difference between revisions of "User:Smith"

 
(31 intermediate revisions by the same user not shown)
Line 1: Line 1:
Design and outline of tutorials for geWorkbench
+
==Resources:==
  
Tutorial Design considerations -
+
http://geworkbench.org =
1. Probably best not to use detailed section numbers, since we cannot autoupdate them in this wiki. Instead, rely on links?
+
http://wiki.c2b2.columbia.edu/workbench
2. Each section should list example data files needed, and these should be part of distribution.
 
  
 +
http://wiki.c2b2.columbia.edu/workbook/index.php/Genomics_Workbook
  
Outline for tutorials
+
https://sharepoint.c2b2.columbia.edu/c2b2/default.aspx
        2.1 Before You Begin
 
        2.2 Getting Started
 
              Is caWorkbench downloaded and installed?  Link to download and installation
 
              Important concepts:
 
                  Use of activated phenotype and marker panels throughout application.
 
                      - if no panels are activated, the "Activated Arrays" and "Activated Markers" check boxes should have no effect.
 
                      - if gene or phenotype panels are activated, then these check boxes should control what is used or displayed-
 
                        -- if one of the boxes is checked, only activated markers or arrays will be used.
 
                        -- if the box is not checked, then ('''in most cases - are there any exceptions?''') the gene or phenotype panels will be ignored and all arrays or markers will be used.
 
                      Note that there is a new "plot" button that is available only when a gene panel is active.
 
              The  menu bar - point out that some commands are available both from the menu bar and by right-clicking on a dataset....
 
  
        2.3 Loading Data
+
http://wiki.c2b2.columbia.edu/mantis/
              2.3.1 File types supported
 
                    Expression
 
                        Affymetrix MAS5/GCOS (text files output by Affymetrix software)
 
                        Affymetrix File Matrix (.exp)(a geWorkbench defined format)
 
                        RMAExpress Processed File
 
                        GenePix
 
                        Note - the type "Normalized no-confidence expression matix" has switched the phenotype and gene labels -don't use until fixed.
 
                    Genotypic
 
                        Genotypic data files - is this working?
 
                    Sequence
 
                        Fasta
 
                    Pattern Detection
 
                        Pattern Files
 
  
              2.3.2 Loading MAS5/GCOS type files
+
http://wiki.c2b2.columbia.edu/mantis/view_all_bug_page.php
                        Use the 10 cardiomyopathy files from Harvard.
 
                        What happens the first time a new chip-type is loaded - how long does it take, what is happening, what internal files are being built?
 
              2.3.3 Merging loaded data
 
 
            [  '''These examples not really needed.....'''
 
              2.3.4 Loading matrix format files
 
                        Include webmatrix2000G?, webmatrix4000G? and webmatrix.exp?
 
                    Note - explain matrix format in an appendix
 
              2.3.5 Other file types supportedLoading RMAExpress files
 
                    Must generate an example RMAExpress file, start with harvard cardio files?
 
            ]
 
        2.4 Working with Marker and Phenotype Panels
 
                    Use the cardiomyopathy dataset created in 2.3
 
                2.4.1 Creating Phenotype Panels
 
                2.4.2 Assigning Case/Control status
 
                2.4.3 Activating a phenotype panel
 
                2.4.4 Creating Gene/Marker Panels
 
                2.4.5 Activating a phenotype panel
 
        2.5 Saving data files
 
                Use the cardiomyopathy dataset annotated in 2.4
 
                2.5.1 Save to matrix file
 
               
 
                 
 
  
          2.6 Visualize Gene Expression
+
http://wiki.c2b2.columbia.edu/mantis/login_page.php
                Microarray Panel
 
                  Point out intensity and array sliders, color key and array name.
 
                Color Mosaic
 
                  Point out only displays when "Display" button pushed.
 
                  Point out intensity, accession, gene height and width controls.
 
                  ??Explain whether remaining controls work or not: Pat,Abs,Ratio.???
 
                Expression Profiles
 
                  - displays expression level against array number. Each marker is a separate color line.
 
                Expression Value Distribution
 
                  - for a single array, plots expression value against marker number.
 
          2.6 Filter and Normalize Data
 
                2.6.1 Normalize
 
                2.6.2 Filter
 
          2.7 Clustering Gene Expression Data
 
          2.8 Differential Expression
 
                2.8.1 T Test
 
          2.9 Regulatory Network
 
          2.10 Integrated Annotation Information
 
          2.11 Enrichment Analysis
 
          2.12 Sequence Analysis
 
          2.13 Pattern Discovery
 
          2.14 Promoter Analysis
 
  
==Overview of the GUI and component interoperability==
+
http://wiki.c2b2.columbia.edu/isrce/index.php/MARINa,_IDEA,_CUPID_Grid_Service_Implementation
The graphical user interface for geWorkbench is divided into for major sections, for
 
1. Data management
 
2. Marker and Phenotype management
 
3. Visualization tools (primarily)
 
4. Analytical tools
 
  
  
The data managment area (1) is called the Project Panel. It can hold one workspace, and a workspace in turn can hold one or more projects. Any data file or analysis result is stored in a project.  A workspace and all the data it contains can be saved and returned to later.
+
http://gforge.nci.nih.gov
  
The most important design goal of geWorkbench is to allow data produced or altered in one module to be easily transfered to other modules for successive analysis steps. There are two places that hold shared data - the Project component (1), and the Panels component(2). While the Project component holds files and various types of analysis result sets, the Panels component groups markers or phenotypes into panels.  These panels can then be selected for further analysis of only this particular subset of data.  For example, several analysis components produce lists of markers, and each such new list is placed into the Markers component as a new marker panel.  A phenotype panel can be used to group, for example, microarrays by their disease state.
+
http://gforge.nci.nih.gov/projects/geworkbench
  
 +
http://wiki.c2b2.columbia.edu/informatics/
 +
same as
 +
(http://helpdesk.cu-genome.org/informatics/)
  
In a series of tutorials below, we will demonstrate how a panel of markers is defined through selecting a cluster in the Hierarchical Clustering component, and this panel of markers is then passed to the Sequence Retrieval component to begin sequence analysis.
 
  
 +
ICTVdb
  
A key feature of the GUI is that the modules displayed in the Visualization (3) and Analysis (4) areas depend on the type of data currently selected in the Project Panel.  Thus you will see a different set of choices (tabs) when a microarray data set is selected, as compared to when a DNA or protein sequence file is selected.  When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project Panel, but an appropriate viewer in the Visualization area is automatically selected.
 
  
  
 +
http://wiki.c2b2.columbia.edu/ictvdb/
  
A major feature of geWorkbench allowing different components to
+
nonpublic documents:
  
 
+
adcvs.cu-genome.org:/cvs/magnet
==Tutorial: Hierarchical Clustering==
 
 
 
===Preliminary Filtering and Normalization===
 
 
 
 
 
The file "webmatrix.exp" contains results from 100 Affymetrix HG-U95Av2 chips containing B-cell samples from numerous different disease states (phenotypes).  12600 markers are represented.  To prepare this dataset for clustering we will filter and normalize the data. The steps shown are just an example of how filtering and normalization can be used, and each dataset should be handled according to the type of analysis being undertaken and its goals.
 
 
 
For this dataset, we performed the following steps:
 
 
 
1. Applied '''Expression Threshold Filter''' to remove very low expression values in the range 0-20.
 
 
 
2. Applied the '''Missing Values Filter''' with a maximum number of missing values per marker of 2. (Deletes markers with more than 2 missing values).  This reduced the number of markers to 6327.
 
 
 
3. Performed '''Quantile Normalization''' using '''Averaging Method''' of '''Mean Marker Profile'''.
 
 
 
4. Applied the '''Deviation Filter''' with Deviation Bound of 20 and '''Missing Values''' set to '''Marker Average'''.
 
 
 
5. Applied the '''Missing Values Filter''' as in (2), which further reduced the number of markers to 6270.
 
 
 
The resulting dataset was named '''webmatrix_fn.exp'''.
 
 
 
 
 
===Fast Hierarchical Clustering===
 
 
 
'''Fast Hierarchical Clustering''' is found in the '''Analysis Panel'''.
 
 
 
In this example we shown Hierarchical Clustering being performed with the following options:
 
 
 
1. Clustering Method:  "Total Linkage"
 
 
 
2. Clustering Dimension: "Both"
 
 
 
3. Clustering Metric: "Euclidean"
 
 
 
 
 
[[Image:T_Analysis_FHC.png]]
 
 
 
 
 
Hit '''Analyze''' to run the clustering.  The resulting dataset is inserted into the '''Project Panel'''
 
 
 
 
 
[[Image:T_ProjectFolder_HierarchClust.png]]
 
 
 
 
 
and can be viewed in the '''Dendrogram Panel'''.  Here we will pick a subtree near the top for further investigation.
 
 
 
1. Click '''Enable Zoom'''.
 
 
 
2. Position the mouse pointer over the cluster subtree of interest.  It will be highlighted in blue.
 
 
 
 
 
[[Image:T_Dendrogram_SelectCluster.png]]
 
 
 
 
 
1. Left click on the highlighted subtree to view it alone.
 
 
 
2. By right clicking on the image, and selecting "Add to panel"
 
 
 
 
 
[[Image:T_Dendrogram_ClusterDetailAdd.png]]
 
 
 
 
 
this markers in this subtree can be added as a new marker panel to the '''Gene Panel.'''
 
 
 
 
 
[[Image:T_GenePanel_ClusterTree.png]]
 
 
 
 
 
==Tutorial: Marker Annotations==
 
 
 
For this tutorial, will be examine the group of markers selected in the '''Hierarchical Clustering''' tutorial.  geWorkbench can retrieve gene and pathway information from databases hosted at the NCI. 
 
 
 
1. The desired marker panel is activated by checking its box in the '''Gene Panel'''.
 
 
 
2. In the '''Marker Annotations''' panel, select '''Retrieve Annotations'''.
 
 
 
 
 
[[Image:T_MarkerAnnotations_ClusterTree.png]]
 
 
 
 
 
The links under the heading '''Gene''' can be clicked to display information from the CGAP database at the NCI:
 
 
 
 
 
[[Image:T_CGAP_Page_for_NME1.png]]
 
 
 
 
 
The '''Pathway''' links can be clicked to display BioCarta pathway diagrams provided through the NCI's caCORE/caBIO resource.  The graphical components are themselves clickable to provide further information.
 
 
 
 
 
[[Image:T_caBIO_Pathways_h_ndkDynamin.png]]
 
 
 
 
 
==Tutorial: Sequence Retrieval==
 
 
 
 
 
geWorkbench contains a number of modules that allow DNA or protein sequences to be analyzed.  Sequences can be loaded from a local disk as a FASTA format file, or can be retrieved from a database.  Here we discuss retrieval of sequences from the network.
 
 
 
For this example, we will start with the group of markers selected in the '''Hierarchical Clustering''' tutorial.
 
 
 
We will download sequences from +-2000 bp from the transcription start site of each gene.  This region may contain some regulatory elements such a transcription factor binding sites.
 
 
 
Press the '''Retrieve Sequences''' button to download the sequences.
 
 
 
 
 
[[Image:T_SeqeunceRetriever_ClusterTree.png]]
 
 
 
 
 
The retrieved sequences are placed in the Project Folder.  Note that when this entry is selected, the modules supporting sequence analysis will appear.
 
 
 
 
 
[[Image:T_ProjectFolder_ClusterSeqs.png]]
 
 
 
 
 
 
 
==Tutorial: Pattern Discovery==
 
 
 
The geWorkbench '''Pattern Discovery''' module uses an algorithm called '''SPLASH''' to search for common patterns in sets of DNA or protein sequences. This type of search can be used, for example, to search for common regulatory elements in otherwise unrelated sequences.
 
 
 
For this tutorial, we will begin with the set of sequences retrieved in the '''Sequence Retriever''' tutorial.  These sequences derive from a cluster of genes showing similar expression pattern across a number of different experiments.
 
 
 
A number of parameters can be adjusted by the user, as shown in the figure, to adjust the sensitivity of the search.
 
 
 
 
 
[[Image:T_PatternDiscovery_Run.png]]
 
 
 
 
 
The result of the search can be viewed both in the '''Pattern Discovery''' module itself and in other sequence viewer modules.
 
 
 
 
 
[[Image:T_PatternDiscovery_Result.png]]
 
 
 
 
 
The results of a run of '''Pattern Discovery''' are placed in the Project Folder:
 
 
 
 
 
[[Image:T_ProjectFolder_PatternDiscovery.png]]
 

Latest revision as of 13:11, 6 August 2013

Resources:

http://geworkbench.org = http://wiki.c2b2.columbia.edu/workbench

http://wiki.c2b2.columbia.edu/workbook/index.php/Genomics_Workbook

https://sharepoint.c2b2.columbia.edu/c2b2/default.aspx

http://wiki.c2b2.columbia.edu/mantis/

http://wiki.c2b2.columbia.edu/mantis/view_all_bug_page.php

http://wiki.c2b2.columbia.edu/mantis/login_page.php

http://wiki.c2b2.columbia.edu/isrce/index.php/MARINa,_IDEA,_CUPID_Grid_Service_Implementation


http://gforge.nci.nih.gov

http://gforge.nci.nih.gov/projects/geworkbench

http://wiki.c2b2.columbia.edu/informatics/ same as (http://helpdesk.cu-genome.org/informatics/)


ICTVdb


http://wiki.c2b2.columbia.edu/ictvdb/

nonpublic documents:

adcvs.cu-genome.org:/cvs/magnet