From Informatics

Overview

Version 1.1 of geWorkbench addresses a number of known shortcomings and adds new functionality in a number of areas. Some of the most salient enhancements include:

An explicit GUI for loading Affy annotations files. The new approach will add transparency to how annotations are added to the system. As part of this enhancement we expect the annotations parser component to be re-engineered.
Explicit designation of the chip type (where needed) by the user, rather than through automatic inference (the current implementation of the latter creates problems when, e.g., a dataset is filtered).
Rectification of known issues in the GUI and functionality of the Sequence panel, Promoter panel and Sequence retriever panel.
Support for the loading of .CEL formatted Affy files. This functionality calls for piping .CEL files through R methods that transform the probe level data of the .CEL files into probeset data appropriate for loading into geWorkbench.
Component versioning with support for automatic downloading and installation of newer versions of existing components.
Full-fledged implementation of the Visual Builder to support (1) component sourcing from remote component repositories, (2) easy configuration of components with compatible interfaces.
Workflow support.
Addition of a significant number of new clustering/classification methods.

The list of requirements below is only a guide at this point. We may decide to drop some of them or reschedule for a port v1.1 release. Similarly, new requirements will be added as a result of the MAGNet timelines as well as the caBIG Y2 project.

Detailed Requirements Specification

This section is "live"; it will be undergoing modifications as needed.

Annotations Parser

Use Case document
Status: Released to production

The system shall recognize chip type when a dataset is loaded.

The system will attempt to auto recognize chip type if it is not included in the file. If so, the system shall probe the user with the auto recognition results. The user is offered the option to agree or disagree with the results. If the user disagrees, the system will prompt the user to designate an 'other' chip type. This may result in email notification of the new data types (other) to the support staff. The system shall store the user designated chip type in the data file.

The system shall support user modification of file chip types at any time. This will result in data node updates to reflect a change and a dataset history log entry to capture this change.

Related Mantis bug(s): 200, 452, 356

Sequence Panel

Use Case document
Status: Released to production

The sequence panel will include a line view (current format) and a full sequence view where instead of a line the full character string for a sequence is displayed. In the full sequence view a pattern match is indicated with bold/colored face font. The system shall support double clicking on a sequence line in the line view to display the entire sequence within in a separate pane (similar to how this is handled within the Promoter panel).

The system will include a "All sequences" checkbox to allow viewing only sequences that belong to activated groups.

When patterns are displayed in the sequence panel, the user can click on a pattern block (in the line view) and see the location of the pattern match over the target sequence at the bottom of the view with the matching location centered in the view. In addition, this sequence detail shall include the numeric position of the first and the last matching pattern character.

Related Mantis bug(s):

Sequence Retriever

Use Case document
Status: Released to production

The system shall include the following in the sequence retriever view:

A "Clear" button to remove results from the view (this can go at the bottom of the panel, just make the text boxes for the DNA upstream/downstream regions shorter).
A "Stop" button to allow stopping an ongoing retrieval attempt.
A drop-down list indicating the data source where data are obtained from. Options may include "Local"
""caBIO", "EMBL", etc.
For each retrieved protein sequence the protein sequence name shall be displayed to the left on the sequence (what is displayed now seems to be a truncated version of the marker name corresponding to the sequence). Wherever possible (depending on the source), protein names will be hyperlinked to URLs providing sequence annotation.
For each retrieved DNA sequence the corresponding marker's name will be displayed to the left of the retrieved sequence (as is the case now). Wherever possible data will be collected from the sequence data source about the retrieved sequences, including: Organism, Chromosome, Chromosome location. Such data will be displayed next to the sequence header within the pane that appears when double-clicking on a sequence.
Each retrieved sequence shall have a selection check box (similar to the ones in the panel displaying BLAST results) which allows users to select which sequences will be added to the project when the "Add to project" button is selected.
A "select/unselect all" option allows users to select or unselect the selection boxes for all retrieved sequences.
Total number sequences displayed in the view. This value is refreshed when the view is updated.
Thelimitation of upstream and downstream values of 2k shall be removed.
The system indicate visually that sequences have been retrieved for the select marker(s).
The system will support selecting a specific marker(s) and displaying only sequence that correspond to the selected marker(s).

The online help should describe precisely what sequences are retrieved (e.g., when retrieving protein sequences, are these proteins for which the marker belongs to an exon? An intron? A promoter region?).

When a user changes the type selector ("DNA", "Protein"), the system shall redraw the sequence display to reflect the most recent selections for this type. If the user opts to add the retrieved sequences to the project panel, the system shall capture in a log entry (within the dataset history panel) the input markers parameters used for the retrieval.

Similar to the Sequence Panel, the promoter will offer a way to toggle between a line view (the one currently used) and a full character sequence view where for each DNA/Protein retrieved its full character sequence is displayed. The double-click functionality (which displays the sequence of a selected sequence in a separate pane) will remain.

Related Mantis bug(s): 295, 406, 424, 425

Project Panel

Persisted workspaces will be parseable by subsequent application builds. At present this is not the case as workspaces are saved using Java serialization.

Related Mantis bug(s): 500

Promoter Panel

Use Case document
Status: Released to production

The component shall include a right scroll bar to allow for viewing the entire promoter panel window.

The promoter logo will come up on a separate pane (rather than consume real estate in the main component area).

The run parameters (lower left portion of the component) will be folded in a separate subtab. The online help shall be extended to provide detailed description of what each such parameter represents.

A promoter from the TF list can be added only once in the Selected TF list (in the current version double clicking on a promoter in the TF list that is already in the Selected TF list results in a duplicate being added in the Selected TF list). Search boxes will allow text searching for a TF in both the TF and Selected TF lists (similar to how search boxes are used to located markers in the Marker Panel).

Support will be added for adding at once multiple transcription factors from the TF list to the Selected TF list. Support will be added for removing at once multiple transcription factors from the Selected TF list.

The online help should explain in detail what the buttons "Add TF" and "Retrieve" do as well as what is the format of the data files they process. Promoters added through the "Add TF" button should be persisted across application invocations.

"Image snapshot" and "Save results" options will be provided. The "Image snapshot" will create an image with legends for the various matching TFs / patterns. The "Save results" option will allow the user to save the displayed data in a comma or tab separated format with 3 columns:

The name of the TF or the pattern representation of a matching element.
The name of the underlying sequence.
The first matching position.

A "Prefs" subtab will designate the location of the source data files used to retrieve the JASPAR motifs. In addition to the default JASPAR distribution file, users shall be able to specify an alternative JASPAR-formatted data file where to load transcription factors from. They will also be able to indicate if the data from the new file are to replace or to be added to the currently loaded set of factors.

Related Mantis bug(s):

Online Help

The system shall support context sensitive online help: after clicking on a component, hitting F1 should bring up the online help section for that component.

The topics within the online help system shall be listed alphabetically.

Related Mantis bug(s):

Data Export

The system shall support exporting gene expression data in:

the cluster format (see format description in http://www.stanford.edu/group/robinsonlab/microarrays/ClusterTreeView.pdf)
a spreadsheet format that can be read into R as a data frame using the read.table() function.
comma separated values (csv) and tab delimited format.

The system shall support exporting sequence data in GCG format.

Related Mantis bug(s): 479

Affymetrix Data Format

The system will support loading native .CEL Affymetrix data formats.

The user shall be able to designate one or more data transformations mapping the probe level data in the .CEL file to probe set level which can then be loaded in geWorkbench. The project folder window will only include the post transformation results as the data node. The transformations will be detailed in the dataset history.

Users will be able select among a set of transformations supported by the Bioconductor package (http://www.bioconductor.org/). Performance of the designated operations will be outsourced to a local or server instance of R (which must be equipped with the relevant Bioconductor packages).

The system shall support loading .chp files.

Dependent Requirement: Data Node Warning Related Mantis bug(s): 407, 465

Component Versioning

Every component shall have a version number indicating its development history. Similarly, the geWorkbench core shall have version number. The system shall maintain mapping of every component version number to all the core version numbers under which the component can run. Both core and component versions must obey a total ordering; that is, given any two version numbers, the system can determine which one is more recent.

On application startup, the application with check with the geWorkbench server if any of the following is available:

new version of the core: The system shall notify the user of the updates and prompt the user to download it. Upgrading of a local installation will respect previous user settings.
new version of a component(s): The system notifies the user and prompts the user to download it. Installation of new components proceeds respecting user settings and preexisting component level data and if needed migration of data to new component standard.

The system will maintain information about the version of all analysis, filtering and normalization components that were used in the calculation of results nodes and in the treatment of data nodes. This information will be accessible from the dataset history component.

The system shall support pinning down component version information when storing (and executing) caScript workflows.

Exact format of the version numbering - TBD

caSCRIPT workflows handling of component versioning - TBD

Related Mantis bug(s): none

Component Visual Builder

The system shall allow designating the location of the geWorkbench component repository from within the Visual Builder.

For each listed component the Visual Builder shall provide the following information:

Component description outlining the component functionality and intended use.
Provenance information to include authoring lab, license text, and third library information.
Partner components, i.e., all other components that can be subscribers of or providers for the selected component.

Using the Visual Builder users will be able to specify which visual area of the geWorkbench GUI a designated component will be loaded in. For components with special license needs, the application will support a license info window with "Accept" buttons before the users can proceed with the download (we may want to allow pointing users to an external URL for dealing with licensing issues; details TBD).

Every component will expose their @Subscribe, @Publish, @Script and @Module methods and their associated annotations.

Related Mantis bug(s): none

Data Transformation (filtering and normalization) Implications

Data filtering and normalization affect the structural integrity of the underlying dataset. This in turn, affects the ability to correctly represent child result nodes which were obtained prior to filtering. The system shall manage the impact of data transformation to children result nodes.

Specifically:

Filtering/normalization will continue to operate as is (i.e., without prompting the user and without creating a new data node) as long as the dataset being modified has no children nodes.
When filtering/normalizing a dataset that has children nodes the user will be prompted to specify how they want the resulting dataset to be handled:
1. Be represented with a new node in the project folders tree.
2. Replace the original dataset. In this case, all children nodes will be removed as well.

Related Mantis bug(s): 451, 482

Look and Feel

The system will include a selection of available component configuration set-ups (flavors). At any point in time, users can browse available flavors and select one to use. Example flavors could include all.xml, sequenceanalysis.xml, reverseengineering.xml etc. When browsing through available flavors, the system shall include a description describing what functionality the flavor offers.

Users can remove a component from the application layout by clicking on the close window control of the component. The system must include the option to reinstate all the components from the most resent selected flavor.

The application persist the component layout at the time of exit. Upon restart, the same component layout at exist is reinstated.

Related Mantis bug(s): 115, 165

Workflow support

Presently the system provides rudimentary data set history functionality. Where for every data set, the history of the data transformations is captured (set of filters normalizers have been applied to the dataset). This history does not include the exact parameter setting for each of those data transformations. The data history functionality shall be extended so that:

Parameter setting for the data transformations are recorded
Component version of the data transformations are recorded
For results nodes, the analysis parameter settings are recorded
From the history of a project node the system will allow the automatic generation of a named workflow which can subsequently be applied to new nodes
The system shall support editing of named workflows to permit deletion of arbitrary steps, modification of parameter setting and addition of arbitrary steps.

Related Mantis bug(s): 480

Result Nodes

If a result node is visible in the project folder before the associated analysis is complete, the system shall provide a visual queue should indicate the analysis is in progress. In such cases, mouse over the result node should reveal the percentage completion in the analysis.

The system will include a cancellation button in the analysis components allowing users to terminate analysis before it's complete.

Related Mantis bug(s):

Molecular Structures

Status: Partially completed; released.

Jmol will be wrapped as a geWorkbench component. The application will support loading and visualization jMol pdb data files. Also in the context of a FASTA file, the system shall support selecting a sequence to retrieve its' structure from the pdb database and visualize in jmol.

Related Mantis bug(s): none

Analysis Methods

Add support for (a subset of) the following analysis methods (listed in decreasing order of desirability):

SAM - Significance Analysis of Microarrays
ANOVA - One-way Analysis of Variance
SVM - Support Vector machines
PCAE and PCAG - Principal components analysis
FOM - Figures of Merit
GSH - Gene shaving
ST - Support trees (Bootstrapping)

Related Mantis bug(s): none

Business Requirements v1.1

From Informatics

Contents

Overview

Detailed Requirements Specification

Annotations Parser

Sequence Panel

Sequence Retriever

Project Panel

Promoter Panel

Online Help

Data Export

Affymetrix Data Format

Component Versioning

Component Visual Builder

Data Transformation (filtering and normalization) Implications

Look and Feel

Workflow support

Result Nodes

Molecular Structures

Analysis Methods

Views

Personal tools

Navigation

Search

Toolbox