Classification

Revision as of 19:27, 21 May 2007 by Smith (talk | contribs)

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot




Outline

This tutorial contains

  1. . an overview of classification algorithms available through geWorkbench and
  2. . instructions....


Overview

At present, interfaces to two GenePattern classification algorithms have been implemented in geWorkbench. These are:

  • WV Classifier (Weighted-Voting)
  • KNN Classifier (K-Nearest Neighbors)

WV Classifier - weighted voting

The Weighted Voting algorithm makes a weighted linear combination of relevant “marker” or “informative” features obtained in the training set to provide a classification scheme for new samples. Target classes (classes 0 and 1) can be for example defined based on a phenotype such as morphological class or treatment outcome. The selection of classifier input features (marker features) is accomplished either by computing a signal-to-noise statistic Sx = (μ0 - μ1)/( σ0 + σ1) where μ0 is the mean of class 0 and σ0 is the standard deviation of class 0 or by reading in a list of user provided features. The class predictor is uniquely defined by the initial set of samples and markers. In addition to computing Sx, the algorithm also finds the decision boundaries (half way) between the class means: Bx = (μ0 + μ1)/2 for each feature x. To predict the class of a test sample y, each feature x in the feature set casts a vote: Vx = Sx (Gxy – Bx) and the final vote for class 0 or 1 is sign(Sx Vx). The strength or confidence in the prediction of the winning class is (Vwin - Vlose)/(Vwin + Vlose) (i.e., the relative margin of victory for the vote). Notice that this algorithm is quite similar to Naïve Bayes (see the appendix in Slonim et al. 2000).

References: Golub T.R., Slonim D.K., et al. “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, 531- 537 (1999). Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. (2000) Class prediction and discovery using gene expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB) 2000. ACM Press, New York , pp. 263–272.


KNN Classifier - K-Nearest Neighbor

The K-Nearest Neighbor algorithm classifies a sample by assigning it the label most frequently represented among the k nearest samples. No explicit model for the probability density of the classes is formed; each point is estimated locally from the surrounding points. Target classes for prediction (classes 0 and 1) can be defined based on a phenotype such as morphological class or treatment outcome. The class predictor is uniquely defined by the initial set of samples and marker genes. The K-Nearest Neighbor algorithm stores the training instances and uses a distance function to determine which k members of the training set are closest to an unknown test instance. Once the k-nearest training instances have been found, their class assignments are used to predict the class for the test instance by a majority vote. Our implementation of the K-Nearest Neighbor algorithm allows the votes of the k neighbors to be un-weighted, weighted by the reciprocal of the rank of the neighbor's distance (e.g., the closest neighbor is given weight 1/1, next closest neighbor is given weight 1/2, etc.), or by the reciprocal of the distance. Either the Cosine or Euclidean distance measures can be used. The confidence is the proportion of votes for the winning class. There are many references for this type of classifier (with several of the early important papers listed below).

References: Golub T.R., Slonim D.K., et al. “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, 531-537 (1999). Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. (2000) Class prediction and discovery using gene expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB) 2000. ACM Press, New York , pp. 263–272. Johns, M. V. (1961) An empirical Bayes approach to non-parametric two-way classification. In Solomon, H., editor, Studies in item analysis and prediction. Palo Alto , CA  : Stanford University Press. Cover, T. M. and Hart, P. E. (1967) Nearest neighbor pattern classification, IEEE Trans. Info. Theory, IT-13, 21-27, January 1967.