From Informatics

Overview

This page is for planning development of geWorkbench.

Planning for post-geWorkbench 2.5 releases

caArray 2.5 + feature list

Below is a list of target features, published May 3, 2010 in the caArray forum of the MAT-KC. We should determine which data types it would make sense to handle in geWorkbench, and what algorithms we could add to analyze these types.

Illumina BGX/TXT array designs
Illumina Sample Probe Profile TXT data files
Illumina genotyping processed matrix TXT files
Affymetrix AGCC/Calvin CHP and CEL files
Affymetrix CNCHP copy number files
Agilent GEML/xml array designs
Agilent raw TXT files for aCGH, gene expression and miRNA assays
Copy number data in MAGE-TAB data matrix format

- Nimblegen NDF array designs (in collaboration with Yale University)

Nimblegen Pair Report (raw and normalized) TXT files (in

collaboration with Yale University)

geWorkbench Business Requirements version 1.7+

R-based services

The initial implementation will be to a R-installation on the user’s own machine. Our group will not implement R-services as caGrid services. Perhaps we can make use of Martin Morgan’s caGrid R project in the future.
Need to investigate whether the local interface should be
- Command line only –
  - would provide version independence.
  - geWorkbench would write out a R script.
- R-Server API –
  - presumably would provide better control and interaction.
Candidates for priority implementation:
- RMA
- Bioconductor Affy Quality control measures.
  - Use external viewer depending on complexity

Additional Standard Methods

Provide methods, either native or via R, for dealing with more complex experiments and additional standard algorithms

T-test with false discovery rate (FDR).
1. MEV implements “False Discovery Control”, Korn et al., 2001, 2004.
Paired-sample t-test (it is in MEV).
2-way ANOVA (it is in MEV).
Time-course experiments
Replicate experiments
Need to look at what people are doing right now – NextGen sequencing etc.
Others? – Look through BRB Array tools and GenePattern and see what is most interesting.

Better integration with caArray - CEL file handling

Download CEL files for an experiment into a project directory. This is available in the existing APIs.
Could make a CEL file data node in Project Folders which represents files on disk.
Run R-based QC and RMA etc.
mport resulting dataset into geWorkbench.

Additional platform support

Illumina
Agilent
(whatever is being used in caArray, and where native files are required for typical analysis).

The specific types of data need to be investigated for each case. This would involve
- Data files
- Annotation files

geWorkbench native grid services (ARACNE, clustering etc)

We should look for existing parallel versions of our compute code for execution on the cluster.
- Hierarchical Clustering - deHoon
- SOM Clustering
- Anova
- ARACNE
- MatrixREDUCE
- NetBoost
Hardware - Need better hardware to support real calculations. Compute intensive services should be run on the cluster, not grid node.
Code commonality - geWorkbench’s own grid services need to share the same code for the actual algorithm implementations in grid and local versions. The algorithms should be separated from the particular data structures of geWorkbench if possible.
- Note that the wrappers for local vs grid versions differ.
- Note that somehow grid and local hierarchical clustering started with completely different code bases – have now converged….
- It was suggested that the actual algorithm implementatons be made BISON-independent and reusable. These could then be used both locally and in grid implementations. - Doing this for hierarchical clustering as an example
Data transfer –
- Data transfer from geWorkbench to a grid service is a two-step process. First the data is transferred to an intermediary dispatcher component. Then it is transferred from the dispatcher to the analytical grid service.
- Currently, the full dataset is sent to the dispatcher, then only the selected markers sent to final grid service. The very slow performance of e.g. hierarchical clustering on the grid implies this is a big bottleneck. (The entire dataset is base64 encoded for transmission to the dispatcher). Clustering that takes a second locally takes several minutes on the grid.
- Also under the current implementation, a full XML expansion of the selected data is performed to send it from the dispatcher to the actual analytic caGrid service. This prevents large datasets from being submitted to our grid services because the memory demands for such expansions are too large.
- One possible method to improve file transfer from dispatcher to grid service is by using caTransfer (Done).
Grid service versions- how many versions of grid service interfaces to support? These change whenever e.g. Bison changes (why?).

Project Directory

There should be a clearly defined project directory that corresponds to the “Project” in geWorkbench. Once set, this should be the default location in which to look for new files or to which to save files.
It should be alterable by the user.
What to do if the user selects a different location from which to load or to which to save a file?
- Remember most recent, or always go to Project?
- Perhaps best to remember most recent, but as in Windows have an icon to allow instant return to the Project folder.

Component Configuration Manager (aka "Visual Builder") (Done)

geWorkbench has a static configuration file (by default, conf/all.xml) which specifies which components will be loaded when the application starts. As the application does not allow the user to dynamically modify this file, the conservative approach is to include all available geWorkbench components in that file. That, in turn, leads to a long startup time (as all these components need to be instantiated) and also slow performance (as the events generated by the framework are handled sequentially by all components that are listeners of such events, even though most of these components will not be utilized in any given session. One way to rectify this situation would be to offer a Component Configuration Manger (CCM) which would allow users to specify components for dynamic loading/unloading, as the application is running. The application should persist user choices and remember them at the next invocation, starting up with the same component configuration as the one that was in place when the application last exited. Preliminary requirements.

Memory, performance and scalability issues

We must characterize what the limits are to the size of datasets we can currently handle:
- How much data can we simply load?
- What is the data-size multiplier in the current architecture – how much is a dataset expanded as it is loaded into objects in memory?
- Note – little of expansion is in the raw expression data – it is the annotation data that takes a lot of room.
- What are per-algorithm limits, e.g. for hierarchical clustering?
If these limits are significant, what do we need to do to overcome them?
- Support 64 bit architectures.
- mplement data caching so not all is in memory? e.g. investigate disk caching algorithms (ehcache, etc.) to allow preserving the original datasets.
The Marker and Array set “checkboxes” can be so slow as to make the application unusable.
- Review what has been done and why it is still so slow.
- Review use of synchronous vs asynchronous events. Can changes between these two modes help further?
Thread safety and synchronization.
Bison –
- A revised version is being discussed (Floratos/Califano lab).
- Bison version information must be maintained.
Memory error handling (Mantis 693):
- Is it possible to warn the user if there is no more memory? Especially on the MAC we are using for testing (where one has to wait quite some time sometimes) I would find it helpful to know that geWorkbench crashed and is not waiting or doing something.
- Notify user if a child process dies due to memory problems – can it be implemented?

Java 1.6 support (transitioned in geWorkbench 2.0)

Two components known not to work

caArray 2.1
GenePattern PCA

Remove these two and see if rest of application works under Java 1.6. In past have seen random crashes under 1.6.

Persistence of user changes to interface (column arrangements in tables etc)

Re-implement GO Terms component - Notes for a new Use Case document

This has been moved to GeneOntology2_spec.

Pattern Discovery server rewrite.

Unit tests

Continue to implement...

Cytsocape 2.4 (Done)

What needs to be done to allow upgrade from Cytoscape 1 to Cytoscape 2.4?

Clean up/ refactor for more coherent sources.

Under the current organization it can be hard for the unititiated to understand where a component will be found.
Sections of code or whole files become abandoned. Should they be removed?
SVM: why is it under components-clustering and not components - analysis?
caGrid - Where does it fit?
Interactions – a previous component version is present in the interactions directory – dead code.
gpmodule
synteny
(others?)
why is the file workbook-0.91.jar being used for the ARACNe and MINDy projects, e.g. components\aracne-java\lib\workbook-0.91.jar.

Graphics

Graphics export/printing should have publication quality, it should be standardized and more than one format should be implemented (jpeg, tiff, png). We need a use case for this.

Framework – Eclipse Rich Client Platform

The Eclipse rich client platform could provide a number of desirable services.

Project Directory
Multiple workspaces.
Versioning of components?
Workflows?

Sequence ambiguity code handling

Nucleotide sequence ambiguity codes may defeat protein/nucleotide detection in various components.

What is real extent of problem – that is, what percentage of a worst case nucleotide sequence is non-GCAT?
It may be better to handle sequence type detection centrally rather than reimplement in each component.
- A sequence type value would then be added to main data structure.
Or a utility function could be written which could be called by any component.
Note Zhou has done work with a protein component on a branch.
If automatic detection is used, at a minimum there should be a way to override the inferred sequence type.

Move bison to a separate project

The idea here is to manage bison separately from geworkbench. Bison is a data model and could be reused in multiple places (clients like geworkench, command line tools, server side code like grid services). This should be void of both swing code as well as algorithms.

With regards to algorithms, we may want to consider a separate project for them as well (like compute-code). This is would be independent of geworkbench and could be used in any client or app (rich clients, command line tools, webapps, etc.).

This would give use two new projects: bison (model), compute-code (algorithms working on primitives).

I took a stab at moving bison into a separate project and found the following:

AnnotationParser (AP)
- this is in bison and depends on core: org.geworkbench.engine.preferences and org.geworkbench.engine.properties.PropertiesManager
it makes sense to move AP out of bison, but other bison structures depend on this (CSExprMicroarraySet, CSExpressionMarker)
SoapParamDataset
- is in bison and depends on core: org.geworkbench.util.patterns.PatternDB. Shouldn't PatternDB be part of bison (it extends CSAncillaryDataset)? If this is a custom data type, it shouldn't be part of bison. SoapParamDataset itself should not be part of bison.
Bison objects depend on Script, which is in engine. This should be put in a separate project.
Algorithms directory. In general, I like the idea of keeping algorithms in a separate project. If we keep all algorithms in a separate project (compute-code, for example), we can then have a script that jars these algorithms and places them in both the correct geworkbench components and grid services.

Standardize File Reading API

The way files are read in is not standardized. We have about 17 classes that extend DataSetFileFormat.java.

Of these, some are responsible for reading the file, others delegate to the bison object itself to read the file (in many cases, the reader is not closed).

All of these Format classes implement FileFormat and are forced to satisfy the FileFormat contract. In some cases, this doesn't make sense. For example, the PDBFileFormat and SequenceFileFormat both must implement getMArraySet but end up returning null (since these classes don't work with microarrays).

Miscellaneous enhancements

Check for new version availability on startup and by menu item.
What is the update frequency of the underlying CNKB database?

Notes

MeV (looking at most recent version 4.2) uses RServe. It implements two functions: RAMA and Bridge: “RAMA (Robust Analysis of MicroArrays) [1] uses a Bayesian hierarchical model for the robust estimation of cDNA microarray intensities. BRIDGE (Bayesian Robust Inference for Differential Gene Expression) [2] tests for differentially expressed genes for both one and two-color microarray data. BRIDGE uses a similar Bayesian model as RAMA, but they are two independent bioconductor packages.” (MeV 4.2 Manual, pg 297).

GeWorkbench Roadmap