From Informatics

Jump to: navigation, search

1 GeWorkbench Roadmap page
2 Interesting possible further additions for planning purposes -circa release 1.7.0
3 general observations - circa release 1.7.0
4 FAQ candidates - circa release 1.7.0
5 Feature Requests - circa release 1.7.0
6 Potential Bugs and other changes - circa release 1.7.0
7 some other notes from 1.7.0
8 Tips for transferring Wiki-based tutorials to Online Help
9 Comments from the caBIG Annual Meeting 2009 in Washington DC
10 Notes on release process
11 Legacy directories needing cleanup - circa release 1.7.0
12 Other interesting facts about how geWorkbench works
- 12.1 Cytoscape interactions with geWorkbench
- 12.2 CNKB - Cytoscape

GeWorkbench Roadmap page

Interesting possible further additions for planning purposes -circa release 1.7.0

CEL file download from caArray to a local directory. (no)
R-Server API integration to provide Bioconductor routines such as Affy Quality control, RMA.
Plan for adding support for Illumina, Agilent data types both from raw files and via caArray. (no)

general observations - circa release 1.7.0

In scatter plot, why is "rank statistics" checkbox all the way at the top of the screen?
Image snapshot on scatter plot appeared under closed parent data node - looks like nothing happened.
SOM analysis can be started when a SOM data node is selected, result then appears nested under that first result.
The Cancer Gene Index use case spec is in Mantis bug 1757.
Richard would like a SPLASH website (for calculations).
Our local copy of caArray is at afapp1.c2b2.columbia.edu, port 31099
Our local copy of GenePattern seems to be at afdev2.c2b2.columbia.edu, port 9999
MindyAnalysis.java contains or contained the method public MindyDataSet publishMatrixReduceSet().
Check ARACNE and MINDY site directs from amdec.cu-genome.org. What about a SPLASH site?
I have a note that GenePattern PCA does not work with Java 1.6.
Does T-profiler depend on the old component.ontology files (3 file format)?
Apache math library has gamma(), t-test() (including paired t-test), and chi-square test.
Colt math library also has relevant functions.
cellularnetwork has some hardcoded GO terms, see GeneOntologyUtil.java lines 22-30. It also defines KINASE = "K", TF = "TF", and PHOSPATASE = "P" (sic - note it has a spelling error).
Permutations box in ARACNE cannot fit 4-digit numbers.
What should the variance default be for t-test (currently unequal)?
File handles should be closed after read. e.g check on Annotation file.
Log2 Transform (ignores?) any values marked missing. t-test does not complain about missing values.
There is BLAT code in the sequence alignment component. The component should be refactored, hard to even tell which parts of code in use. (See bug #1635).
Need detailed descriptions of each file format.
MINDy2 - Does any Multiple testing or other p-value correction depend on the number of modulators that passed independence test and were used? Is the need to set p-value criterion on delta (MI)? (don't think so but had a note on it).
The Nature Protocols paper uses a percentage DPI (15%). Is this different than what we offer in ARACNe?
If in Mantis we used a better designator than "development" we could easily get changes since last version using built-in change log. E.g. make a 1.7.0_dev version for pre-branch and a 1.7.0 version for post-branch/post-release. Or something like that.
Why is the MRA component embedded in analysis rather than being a separate component?
Cytoscape needs to be separated from Cutenet component.
Is it really a good idea that we warn users to log2 transform the data prior to ANOVA analysis? This is done because ANOVA expects normal (I think) distribution of the input data.

FAQ candidates - circa release 1.7.0

If several arrays are loaded singly, with annotations, and then merged, what happens to the redundant annotation info?
If several arrays are loaded for later merge, does it matter with which the annotations are loaded?
If multiple grid jobs are started, do they all run simultaneously?
What happens to my remote grid job if I remove the pending data node (I think: job is not canceled, results are not retrieved).
When I run ARACNe preprocessing, is the entire dataset always used, or only selected markers/arrays?
How is the "relative" display calculated?
Which sequence IDs from the Affy annotation files are used for sequence retrieval? Is is the Entez ID for protein and nucleotide?
How are multiple genes annotated to a single probe handled? We know the code retains these multiple genes, but are they used anywhere? (It was discussed but don't know outcome).
Are annotation files supported for other than Affy array types - No.
I closed a window (x). How do I get it back?
Does saving a workspace save the parameters currently set in an analysis component? - Only if the parameters have been explicitly saved. Each analysis component has a "Save Settings" button with which this can be done.
Can I load non-microarray data using the tab-delimited format? Yes - need to document exact file description.
What internal identifiers are used in geWorkbench? (what is stored in second column of affy file matrix format?)
How can I screen out a particular subset of genes from further analysis, e.g. a list of MHC or immunoglobulin family genes?
What are the "default" genSpace security settings ( which are used if one is not logged in)? (Is the default to send ... that is, you have to register/login to shut it down?)
What exactly happens to the data if I have e.g. done a hierarchical clustering and then remove markers from the dataset by filtering? Are the markers removed from the cluster, is the cluster somehow maintained, or does it now show the wrong marker names in the clusters?
How do I kill geWorkbench if necessary? Answer - geWorkbench runs within a Java virtual machine (JVM). It is this process that must be canceled. Under windows, the JVM process is called "javaw.exe". HOWEVER - a typical reason that geWorkbench seems frozen is that a dialog box has become hidden behind geWorkbench. Under Windows, you can see available windows by pressing "Alt-Tab". If you see a geWorkbench dialog, bring it to the front and close it (push OK or similar).

Feature Requests - circa release 1.7.0

Should be able to display all digits in Tabular Microarray Viewer.
support gene lists in one-gene-per-line format (this is the typical exchange format).
Add multiple testing correction directly to ARACNe GUI.
Can we get affy detection calls for caArray CHP data at same time as signal channel?
Would binning the EVD by array for each marker be of interest, to see distribution of values for each marker (looking for normality)?
Add CHP file parser?
Should be able to display statistics on microarrays such as max and min values.
Need a way to handle analysis of technical replicates of arrays.
Need a fold-change filter?
Add a "test connection" button to the caArray interface, which would just issue a very simple query.
File and annotation loading for other array types.
Annotation file loader should first look in same directory from which microarray file was loaded.
A number of components need better/more descriptive names in the CCM. I thought I have recorded the new mapping already but cannot find it. In any case, if any names are changed, need to see if this affects GenSpace - that is, what source is genSpace taking names from. See bug 1908 - http://wiki.c2b2.columbia.edu/mantis/view.php?id=1908
There is nothing joining a resultant marker set with the analysis that produced it. That is, if you have done a number of say ANOVA runs, each will produce a set of significant markers. Other than order, there is no way to see the connection, especially if say an anlysis or marker set were to be deleted. then the order becomes unreliable.
Marker sets that result e.g. from ANOVA as "significant genes" should contain the name of analysis algorithm that produced them.

Potential Bugs and other changes - circa release 1.7.0

If you try to log in to caArray using a username/password and get it wrong, it can be hard to rest the flow to where you can properly try to connect again.
"Normalization Panel" tab in GUI should just say "Normalization".
In the analysis routines list, Anova should be ANOVA.
The Dendrogram component allows click-selection of markers but not arrays. Color mosaic-based displays enable both.
In the ANOVA analysis component, the parameter "autohighlight" feature does not work when parameters match a saved set.
In the SOM component, the choice of "Function" (bubble or Gaussian) is not saved.

some other notes from 1.7.0

Cancer Gene Index

Mantis bug entry #1757 has a lengthy use case description for the CGI component.

Grid jobs

http://wiki.c2b2.columbia.edu/mantis/view.php?id=1771

Notes on control of grid jobs: (1) once a grid job is running, there is no way to cancel the actual execution on the back end, and (2) if you right-click on the pending node indication in Project folders, you can remove the job from geWorkbench. This effectively cancels the grid job from the user's point of view. The results will then not be posted into geWorkbench when the remote node completes.

ANOVA

http://wiki.c2b2.columbia.edu/mantis/view.php?id=1732

TIGR's MeV code executes the permutations twice when Westfall-Young is selected: "The code that outputs the matrix numbers for two rounds is in org.tigr.microarray.mev.cluster.algorithm.impl.OneWayANOVA - not geWorkbench's source code."

Scatter Plot

http://wiki.c2b2.columbia.edu/mantis/view.php?id=1186

Extensive changes were made to scatter plot code to allow overlapping values to be displayed on mouse-over. Pretty neat but complicated code!

Tips for transferring Wiki-based tutorials to Online Help

1. View the page in "Printable" format.

2. Save the web page as type "web page, complete". Save it using as file name the name of the component it represents, e.g ccm.htm for the Component Configuration Manager, aracne.htm for ARACNe etc.

3. Run the Perl script to remove the TOC and Chapter link tables.

4. Open the *.htm file in Notepad++ or other HTML editor.

5. Remove additional unneeded header and footer information.

6. Place the htm file and its image directory in the component help directory. There are extra javascript files in the image directory that need to be removed and not checked in to CVS.

12. If this is new online help, copy an existing set of *.jhm, *.hs, and *toc.xml files to the new component's directory and rename them appropriately. You will also need to add the online help code to the *.ccm.xml in the component's top level directory.

Note - If you don't want to edit the page at the begining to insert the NOEDITSECTION tag etc, instead you can use a global regular expression search in Dreamweaver such as:

left-angle-bracket div dot dot star div right-angle-bracket, with no spaces.

Comments from the caBIG Annual Meeting 2009 in Washington DC

The most frequent question was if geWorkbench supported data from Next-Gen sequencing projects.
Several people asked how geWorkbench compared with GenePattern, which they had already tried or were using.
Warren Kaplan of Australia noted that they like to write their own R-scripts, and that these can then easily be packaged up in GenePattern.
Warren also noted that the statisticians in his institution only want to use the "moderated t-test" as e.g. provided in the limma package.
Warren also mentioned a particular visualization in GeneSpring that people really find useful involving heat maps, arrays by groups, and expression profiles. ???
Charles Donnely of Jackson Labs mentioned using geWorkbench as part of their caBIG adoption program.

Notes on release process

Make sure version date is updated before release.
Make sure geWorkbench splash screen has been updated.
Try out actual installer release build versions well before the final release, to catch anything to do with file location or installer incompatibilities etc.
The Promoter component has two Jaspar data files in the source directory. The Jaspar website needs to be checked for new data file releases.
Update geWorkbench Tool Landing Page on the caBIG website:
1. Go to https://cabig.nci.nih.gov/login_form to log in.
2. Edit page. Versionize.
3. After editing notify ....
Update Knowledge Center wiki pages:
1. https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/What%27s_new_on_geWorkbench
2. https://cabig-kc.nci.nih.gov/Molecular/KC/index.php/What_is_New

Legacy directories needing cleanup - circa release 1.7.0

These directories arose for specific projects that are no longer active or have been redesigned. But some files within these directories are still being used by unknown components. These dependencies should be worked out and any needed files moved to more appropriate locations.

\geworkbench\lib\Simulation_libs
\geworkbench\lib\caArrayMageom

Also, the "lib" directory has a number of files that are probably no longer needed.

In geWorkbench 1.7, the following files were removed from lib\caArrayMageom:

caarray-client.jar
commons-collections.jar
log4j.1.2.5.jar
mageom-client.jar

The following components were removed from lib but then replaced:

lib/mageom.jar - put back in, needed by Bison.
lib/arrayexpressWithoutDoc.jar - put back in , needed by GenSpace!
lib/mageom-client.jar - put back in.
lib/ArrayExpress.jar - put back in.
lib/caBIO.jar - put back in.

The following files could be removed from lib (probably belonged to old reverse engineering component):

GenesAtWork.dll
mutualinfo.dll
MutualInfoDLL.dll

Other interesting facts about how geWorkbench works

Cytoscape interactions with geWorkbench

http://wiki.c2b2.columbia.edu/mantis/view.php?id=1749

In reference to bug 1749, it was learned (from Mark) that the Node gene names in Cytoscape are mapped to a Swissprot ID using a facility that was originally implemented for GeneWays. The Swissprot ID is what is used to search for matching markers in the Markers component. No string search on actual gene names is done against the markers component. So a gene which does not for some reason have a Swissprot ID will not map back from Cytoscape to the Markers component.

NME1 returned from Cytoscape

The "tag for visualization" function however uses gene names, not Swissprot IDs, to select markers in Cytoscape.

(from the geWorkbench 1.7.0 release page).

CNKB - Cytoscape

Only genes that are present in the microarray dataset will be displayed in the network displayed in Cytoscape.

http://wiki.c2b2.columbia.edu/mantis/view.php?id=1724

Note that this was not found to be a requirement and is being /has been changed....

GeWorkbench General Notes