Difference between revisions of "Classification"
|  (→Example - Weighted Voting (WV)) | |||
| Line 53: | Line 53: | ||
| A microarray data set must be loaded.  For this example we will use the tutorial B-cell dataset. | A microarray data set must be loaded.  For this example we will use the tutorial B-cell dataset. | ||
| − | ===Setting up the  | + | ===Setting up the Arrays=== | 
| For the classification run, three sets of data must be provided - case, control, and a dataset on which the resulting classifier will be tested. | For the classification run, three sets of data must be provided - case, control, and a dataset on which the resulting classifier will be tested. | ||
| Line 102: | Line 102: | ||
| [[Image:Tutorial-GP_KNN-ClassPredictions.png]] | [[Image:Tutorial-GP_KNN-ClassPredictions.png]] | ||
| + | |||
| + | |||
| ==Example - Weighted Voting (WV)== | ==Example - Weighted Voting (WV)== | ||
| + | |||
| For the Weighted Voting example we will use the same data setup as described above for KNN.  The server settings are also the same. | For the Weighted Voting example we will use the same data setup as described above for KNN.  The server settings are also the same. | ||
Revision as of 16:59, 22 May 2007
| Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials | Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot | 
Contents
Outline
This tutorial contains
- . an overview of classification algorithms available through geWorkbench and
- . instructions....
Overview
At present, interfaces to two GenePattern classification algorithms have been implemented in geWorkbench. These are:
- WV Classifier (Weighted-Voting)
- KNN Classifier (K-Nearest Neighbors)
WV Classifier - weighted voting
The Weighted Voting algorithm makes a weighted linear combination of relevant “marker” or “informative” features obtained in the training set to provide a classification scheme for new samples. Target classes (classes 0 and 1) can be for example defined based on a phenotype such as morphological class or treatment outcome. The selection of classifier input features (marker features) is accomplished either by computing a signal-to-noise statistic Sx = (μ0 - μ1)/( σ0 + σ1) where μ0 is the mean of class 0 and σ0 is the standard deviation of class 0 or by reading in a list of user provided features. The class predictor is uniquely defined by the initial set of samples and markers. In addition to computing Sx, the algorithm also finds the decision boundaries (half way) between the class means: Bx = (μ0 + μ1)/2 for each feature x. To predict the class of a test sample y, each feature x in the feature set casts a vote: Vx = Sx (Gxy – Bx) and the final vote for class 0 or 1 is sign(Sx Vx). The strength or confidence in the prediction of the winning class is (Vwin - Vlose)/(Vwin + Vlose) (i.e., the relative margin of victory for the vote). Notice that this algorithm is quite similar to Naïve Bayes (see the appendix in Slonim et al. 2000).
References: Golub T.R., Slonim D.K., et al. “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, 531- 537 (1999). Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. (2000) Class prediction and discovery using gene expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB) 2000. ACM Press, New York , pp. 263–272.
KNN Classifier - K-Nearest Neighbor
The K-Nearest Neighbor algorithm classifies a sample by assigning it the label most frequently represented among the k nearest samples. No explicit model for the probability density of the classes is formed; each point is estimated locally from the surrounding points. Target classes for prediction (classes 0 and 1) can be defined based on a phenotype such as morphological class or treatment outcome. The class predictor is uniquely defined by the initial set of samples and marker genes. The K-Nearest Neighbor algorithm stores the training instances and uses a distance function to determine which k members of the training set are closest to an unknown test instance. Once the k-nearest training instances have been found, their class assignments are used to predict the class for the test instance by a majority vote. Our implementation of the K-Nearest Neighbor algorithm allows the votes of the k neighbors to be un-weighted, weighted by the reciprocal of the rank of the neighbor's distance (e.g., the closest neighbor is given weight 1/1, next closest neighbor is given weight 1/2, etc.), or by the reciprocal of the distance. Either the Cosine or Euclidean distance measures can be used. The confidence is the proportion of votes for the winning class. There are many references for this type of classifier (with several of the early important papers listed below).
References: Golub T.R., Slonim D.K., et al. “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, 531-537 (1999). Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. (2000) Class prediction and discovery using gene expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB) 2000. ACM Press, New York , pp. 263–272. Johns, M. V. (1961) An empirical Bayes approach to non-parametric two-way classification. In Solomon, H., editor, Studies in item analysis and prediction. Palo Alto , CA : Stanford University Press. Cover, T. M. and Hart, P. E. (1967) Nearest neighbor pattern classification, IEEE Trans. Info. Theory, IT-13, 21-27, January 1967.
Graphical User Interface
Example - K-Nearest Neighbors (KNN)
Prerequisites
A microarray data set must be loaded. For this example we will use the tutorial B-cell dataset.
Setting up the Arrays
For the classification run, three sets of data must be provided - case, control, and a dataset on which the resulting classifier will be tested.
The figure below shows the four broad classes present in the tutorial dataset. Note that we are viewing the Array/Phenotype group named "ultrashort". We will use the non-GC B-cell and non-GC tumor classes. Because the non-GC tumor class has 46 members,we will use part of this as the case and the remainder as the test dataset.
If we now change to the marker group named "megashort", we see a breakdown of the samples into more detailed cell line designations. By comparing with the members in "ultrashort", we see we can set up the test as follows:
- Control: non-GC (non-GC B-cell)
- Case: B-CLL (non-GC tumor)
- Test: HCL, PEL (non-GC tumor)
Thus the expected outcome of the classification would be that all members of the test set would be identified as members of the Case group.
Running the classification
You can connect to any running GenePattern server to run these tutorials. An example configuration of the "GenePattern Server Settings" tab is shown here:
On the parameters tab, leave the default values as shown in the figure below.
- The first step is generation of the classifier and cross-validation. Click the "Test via Cross-Validation" button. When the run has completed, the numbers of True and False positives and negatives will be shown under the heading "Cross Validation Results".
- For the second step, the test classification run, now just push the "Analyze" button. When the calculation is complete, the results will be stored in the Arrays/Phenotypes component.
- In this run, 7 of the test arrays were correctly identified as Case, and 5 were identified as control, as shown in the figure below of the Arrays/Phenotypes component.
Example - Weighted Voting (WV)
For the Weighted Voting example we will use the same data setup as described above for KNN. The server settings are also the same.
- Run the classification and cross-validation step as before by pushing "Test via Cross Validation".
- Now hit the "Analyze" button.
The result shows that in this case all test arrays were classified as control, not the expected outcome.
In this case, you would want to investigate the parameter settings further....









