EvidenceIntegration

From Informatics

Jump to: navigation, search

Contents

Overview

  • Use case document: [1]
  • Source code is split into two pieces:
    • The actual computation code is stored in adcvs.cu-genome.org/cvs/acallab/project/evidenceintegration
    • The workbench component is in the regular workbench CVS under the "evidenceintegration" component and includes a jar built from the above project.

Workbench Notes

  1. The gold standard datasets store their interactions as entrez IDs, so as a result you must load annotations with entrez IDs for any dataset you wish to use.

Computation Library

Basics

The main calculations are done by the a ported Java library stored in the acallab cvs repo. The original Perl scripts they are based on are in that repository as well. There is also a test case which matches the results of the java library to results generated with the original Perl scripts. They require access to the database storing the gold standard datasets. That database is for now running on afdev and is accessible via the URL jdbc:mysql://afdev:3306/evidence_integration. The username and password are in the test case.

Details

Database

The ppipos and ppineg tables store the gold standard interactions. They each contain a link to the source table which will allow for multiple sources of gold standard data to be used in the future. For now there is only one source available, which is the data that the original algorithm ran on.

Building

Use the dist-jar task to build an updated jar to include with the workbench component.

Calculation Methods

For speed reasons the calculations done by the SQL server in the original perl script were moved into Java, and done in memory. This resulted in a ~10x speedup of the algorithm, with the notable downside of larger memory requirements. To conserve memory the data is stored in primitive collections such as TIntIntHashMap, but there are more optimizations possible. Right now, the only thing loaded on creation of the main class is the sources of data, the gold standard data is read in upon first calcuation. This allows you to display the sources of data without incurring the ~30 second delay of loading all the calculation data into memory.

There are two main code blocks to be aware of in the calculation.

This block of code checks to see if the evidence is found in the gold standard sets. There's a block like this for both positive and negative sets (they could probably be refactored and combined):

            for (Integer sourceID : enabledGoldStandards) {
                HashMap<Integer, TIntHashSet> goldStandardPos = goldStandardPosSets.get(sourceID);

                for (Map.Entry<Integer, TIntHashSet> entry : goldStandardPos.entrySet()) {
                    Integer id1 = entry.getKey();
                    if (evidenceIDs.contains(id1)) {
                        TIntHashSet genes = entry.getValue();
                        TIntIterator geneIterator = genes.iterator();
                        while (geneIterator.hasNext()) {
                            int id2 = geneIterator.next();
                            if (evidenceIDs.contains(id2)) {
                                idIntersection.add(id1);
                                idIntersection.add(id2);
                                pos++;
                            }
                        }
                    }
                }
            }

And once those numbers are computed, then the evidence is binned and LR ratios calculated like so:

            Iterator<Edge> sortedEdgesIter = sortedEvidence.iterator();
            int evidenceProcessed = 0;
            float[] lrBinValues = new float[numBins];
            for (int i = 0; i < numBins; i++) {
                currentBinMax += binSize;
                if (i == numBins - 1) {
                    // Include all edges which might have weight == maximum weight
                    currentBinMax = Float.MAX_VALUE;
                }

                List<Edge> bin = new ArrayList<Edge>();
                int binPos = 0;
                int binNeg = 0;
                if (sortedEdgesIter.hasNext()) {
                    Edge currentEdge = sortedEdgesIter.next();
                    while (currentEdge != null && currentEdge.getWeight() < currentBinMax) {
                        int id1 = currentEdge.getId1();
                        int id2 = currentEdge.getId2();

                        bin.add(currentEdge);
                        if (isInGS(goldStandardPosSets, enabledGoldStandards, id1, id2, null)) {
                            binPos++;
                        } else if (isInGS(goldStandardPosSets, enabledGoldStandards, id2, id1, null)) {
                            binPos++;
                        }
                        if (isInGS(goldStandardNegSets, enabledGoldStandards, id1, id2, idIntersection)) {
                            binNeg++;
                        } else if (isInGS(goldStandardNegSets, enabledGoldStandards, id2, id1, idIntersection)) {
                            binNeg++;
                        }
                        evidenceProcessed++;
                        if (sortedEdgesIter.hasNext()) {
                            currentEdge = sortedEdgesIter.next();
                        } else {
                            currentEdge = null;
                        }
                    }
                    if (binPos > 0 && neg > 0 && binNeg > 0 && pos > 0) {
                        lrBinValues[i] = (float) (((double) binPos * neg + 0f) / ((double) binNeg * pos));
                    } else {
                        if (i > 0) {
                            lrBinValues[i] = lrBinValues[i - 1];
                        } else {
                            lrBinValues[i] = 0;
                        }
                    }

                    // Apply likelihood ratio to edges in this bin
                    for (Edge edge : bin) {
                        if (edge.getLikelihoodRatio() == 0) {
                            edge.setLikelihoodRatio(lrBinValues[i]);
                        } else {
                            edge.setLikelihoodRatio(lrBinValues[i] * edge.getWeight());
                        }
                    }
                }

                System.out.println("\t" + i + "=" + lrBinValues[i] + "(" + bin.size() + ": " + binPos + "/" + binNeg + ") .. ");
            }

Changes in the user case

  1. The binning method is changed.
  2. The filtering step is moved to the end of the process.

To Do

  1. Load additional gold standard sets to send to server. Right now there aren't any additional gold standard sets, but to support it would require loading the adj matrix as done for evidence, then adding a parameter to the calculation component to accept a list of additional gold standard data. The change to the calculations should be simple as it already supports multiple distinct gold standard sets from the database.
  2. Remove popup for evidence. I recommend attaching it to a right-click event on the JTable that lists the evidence. The methods already exist to remove individual evidences.
  3. Evidence and Gold standard lists on Performance Graph panel which control what to plot. Right now the performance graphs will show a graph for each gold standard and a line for each evidence.
  4. ROC curve plot
Personal tools