IDEA

Detailed Investigation of the Discrepancy between the Java Version and the Original MATLAB Version

The MATLAB version compared with is the copy we have as of now. It is not in any code repository, so we just assume the copy we see, say, on 7/20/2002 in /ifs/data/c2b2/af_lab/cagrid/matlab/idea/scripts is static.

The exact test case I described here is as described in mantis. The exact numbers are important to confirm the problem and potential resolution.

I made a copy of all the MATLAB scripts to do the test. Because there is no conveniently available MATLAB license to use, I tried to run the script just as it is invoked by the geWorkvbench IDEA service, namely submitting to the cluster by running shell script idea_submit.sh, but I modified the script so to use the test copy and to leave the intermediate result (subdirectory phenotype_results) not deleted after the job is finished.

major steps in the algorithm

create a collection of 'edges'
calculate edgeCorr for each edge
calculate deltaCorr for each edge
calculate normCorr for each edge and score the edges by comparing normCorr and (Bonferroni corrected) p-value

I think that randomization is used only in step 4 when we create artificial null distribution. Otherwise, the algorithm is deterministic.

Comparison of the Results

The intermediate results from the MATLAB script that are interesting for us are: edgeCorr.txt, deltaCorr.txt, normDeltaCorr.txt, zDeltaCorr.txt. They are under directory phenotype_results.

The differences in step 2 and step 3 has not been analyzed yet. For now, two differences between Java version and MATLAB version are observed: (a) the number of edges in step 1 are different; and (2) the algorithm of kernel density estimation used in step 4 are different. Following are the details:

(a) edgeCorr.txt has 15779 rows. I think this is probably easier to fix or reconcile; and it is not likely to be the main reason of the different results.

(b) This is probably a major task if we need to implement the similar algorithm to the one used in MATLAB version; it probably is the main reason of the different results. MATLEB version (normalizeCorrChange2.m: line 272) uses ksdensity from Statistics Toolbox. Java version (NullDistribution line 180) uses KernelEstimator from weka packages. First, implementation of kernel estimation has many practical choices that are open to each specific implementation (see http://en.wikipedia.org/wiki/Kernel_density_estimation); the one in MATLEB and the one in weka are really very different. Second, more importantly, in the current code as we use it, the MATLAB version and the Java version don't even calculate the same thing mathematically. MATLAB (normalizeCorrChange2.m: line 272) calculates CDF; Java version calculates the probability of an interval with fixed width around the given point (this is based my reading of the weka code, and verified with some experiments). They have the same tendency only to the extent that when one is small, the other is also small. IDEA's MATLAB script has the part to tranlate close-to-one value to close-to-zero. Besides that, the two are just different fundamentally.

I tried some simplistic implementation of CDF based weka's probability function. See http://wiki.c2b2.columbia.edu/informatics/images/7/74/NullDistribution.java.txt The precision is too low to be useful. To be more specific, the difference between the calculated CDF(infinity) and 1 is about 1.e-4; the threshold involved in the mentioned test case is about 1.e-6.

major points in MATLAB scripts

idea2.m: the main entry point. line 549, calculate edgeCorr. line 609, calculate null distribution. line 677 (normalizeCorrChange2.m) calculate deltaCorr and normDeltaCorr. line 697, calculate scores.
normalizeCorrChange2.m: line 272, calculate normDeltaCorr

important information from Manju

Manju told Ken that the kernel estimation method used in ARACNe is 'the same' as that for IDEA (10/5/2012). We need to find out where is the actual implementation (assuming it is in Java language and in ARACNE java code) and to what extent they are 'the same' (considering the one used in IDEA code is in MATLAB).

Dependency one third-party library

weka.jar: only for calculating kernel density estimation

TODO

parameter panel issues (Min wrote some new code but eventually didn't check in because too much other issues are being worked on)
the return type of the service was ignored before. Now Meng is using its side effect to fix other errors so we don't have to make major change before the release. This should be cleaned up later.

IDEA

From Informatics

Contents

Requirement

Requirement for caGrid Service/MATLAB version

Detailed Investigation of the Discrepancy between the Java Version and the Original MATLAB Version

major steps in the algorithm

Comparison of the Results

major points in MATLAB scripts

important information from Manju

Dependency one third-party library

TODO

Views

Personal tools

Navigation

Search

Toolbox