need to use "gold standard" knowledge to evaluate methodologies
Neil Clarke - Genome Institute of SingaporeJan 25, 2008 - 1:17 am
First, thanks to Gustavo for setting this up. Right after this discussion forum was set up, I wrote a draft set of comments but never got around to finishing it. I'll have to do that.
In the meantime, I'm posting here my response to the email that was sent recently asking us to remember that information on the gold standards is the IP of the folks who provided the data for the Challenges. Let me say at the outset that I do appreciate the work involved - and the potential risk to publishing priority - that is incurred by those who put together the Challenges. Sincere thanks for that. I do think, though, that it is absolutely necessary for the continued success of DREAM that the gold standards be divulged to teh predictors, preferably before the meeting but certainly before publication. The following is the main text of the email response I sent earlier today:
In order to figure out what worked - and, more importantly, what failed - we *have* to be able to use the list of "gold standards". Fortunately, I *do* know which of the 200 genes we were given were considered true positives. I was able to extract that information from the precision-recall and ROC curves provided to us by the organizers, based on our prediction.
This knowledge of the "gold standard" set is absolutely essential to analyzing what worked and what didn't. Those of you who were in NY may remember that I used that knowledge to show that we would have been much better off if we had only used our expression data analysis. We hurt ourselves considerably trying to include gene ontologies, predicted binding sites, ARACNe, publically available ChIP data, etc. Without knowing what the gold standard set was, I would not have been able to figure this out. I would gotten up and said that we did all these different things, and you (and I) would probably have come to the conclusion that we did something smart by incorporating these different terms. In fact, that's the wrong conclusion, but the only way we know that its wrong that is because I was able to figure out what was considered the gold standard set.
The talk would have been almost meaningless without this - and the same goes for the paper that we are writing.
I have no intention of identifying the gold standard genes in my paper. There wouldn't be any point, anyway - it doesn't matter to the analysis what the gene names are. However, I *do* need to do analyses that rely on knowing which of the genes are in the gold standard set.
I honestly don't see how these analyses could possibly infringe upon the intellectual property of those providing the Challenge set, or affect in any way their ability to publish or patent. However, if anyone disagrees with this, I would welcome further discussion before we get much further in the publication process.
need to use "gold standard" knowledge to evaluate methodologies
Neil Clarke - Genome Institute of SingaporeJan 25, 2008 - 1:17 am
First, thanks to Gustavo for setting this up. Right after this discussion forum was set up, I wrote a draft set of comments but never got around to finishing it. I'll have to do that. In the meantime, I'm posting here my response to the email that was sent recently asking us to remember that information on the gold standards is the IP of the folks who provided the data for the Challenges. Let me say at the outset that I do appreciate the work involved - and the potential risk to publishing priority - that is incurred by those who put together the Challenges. Sincere thanks for that. I do think, though, that it is absolutely necessary for the continued success of DREAM that the gold standards be divulged to teh predictors, preferably before the meeting but certainly before publication. The following is the main text of the email response I sent earlier today: In order to figure out what worked - and, more importantly, what failed - we *have* to be able to use the list of "gold standards". Fortunately, I *do* know which of the 200 genes we were given were considered true positives. I was able to extract that information from the precision-recall and ROC curves provided to us by the organizers, based on our prediction. This knowledge of the "gold standard" set is absolutely essential to analyzing what worked and what didn't. Those of you who were in NY may remember that I used that knowledge to show that we would have been much better off if we had only used our expression data analysis. We hurt ourselves considerably trying to include gene ontologies, predicted binding sites, ARACNe, publically available ChIP data, etc. Without knowing what the gold standard set was, I would not have been able to figure this out. I would gotten up and said that we did all these different things, and you (and I) would probably have come to the conclusion that we did something smart by incorporating these different terms. In fact, that's the wrong conclusion, but the only way we know that its wrong that is because I was able to figure out what was considered the gold standard set. The talk would have been almost meaningless without this - and the same goes for the paper that we are writing. I have no intention of identifying the gold standard genes in my paper. There wouldn't be any point, anyway - it doesn't matter to the analysis what the gene names are. However, I *do* need to do analyses that rely on knowing which of the genes are in the gold standard set. I honestly don't see how these analyses could possibly infringe upon the intellectual property of those providing the Challenge set, or affect in any way their ability to publish or patent. However, if anyone disagrees with this, I would welcome further discussion before we get much further in the publication process.