Difference between revisions of "MARINa"

(Inputs)
(Main Tab)
 
(130 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
{{TutorialsTopNav}}
 +
 
==Overview==
 
==Overview==
MARINa will run the gene set enrichment analysis (GSEA) algorithm for each transcription factor (TF) on the chip to calculate if the regulon of a TF is enriched for genes that are differentially expressed among the 2 classes of microarrays. Following all GSEA runs MARINa will also perform an additional "shadow" analyses.  
+
This chapter details the MARINa (MAster Regulator INference algorithm) method of [[Master_Regulator_Analysis|Master Regulator Analysis]].  Please see the [[Master_Regulator_Analysis|Master Regulator Analysis]] chapter for a higher-level introduction.
 +
 
 +
MARINa uses gene set enrichment analysis (GSEA)(Subramanian et al, 2005) to calculate if the regulon of a TF is enriched for genes that are differentially expressed among 2 classes of microarrays.  Please note that this does not involve any of the special gene lists available for use with the GSEA algorithm, such as [http://www.broadinstitute.org/gsea/msigdb/index.jsp. MSigDB]. Following the GSEA runs, MARINa performs an additional "shadow" analyses.  The "Synergy" analysis option is not currently supported.
 +
 
 +
The MARINa algorithm runs on a computational cluster at Columbia and access is restricted.  It is run from geWorkbench using a grid service, selected using the "Services" tab.
  
 +
 +
The MARINa algorithm differs substantially from the "FET" method described in the [[MRA-FET|MRA-FET]] chapter.
 +
# The list of signature markers is determined using a t-test directly in the MARINa algorithm, rather than being loaded separately as for the FET method.
 +
# The list of candidate Master Regulators is taken from the hub markers in the network, rather than using a separate list as in the "FET" method.
 +
# MARINa uses "case" and "control" designations of arrays in the Arrays component if provided.  However, unlike the FET method, it will also run if only "case" arrays are designated.  This should be used only where the loaded expression dataset has been pre-processed to represent fold-change values.  geWorkbench itself cannot produce such a dataset.
 +
 +
A special 2-tailed version of GSEA is used, as described in Lim WK, Lyashenko E, Califano A (2009).
  
 
==Inputs==
 
==Inputs==
* '''Network''' - the network (e.g. from ARACNe) upon which MARINa will operate.  
+
===Main Tab===
 +
These inputs and the GUI are described in detail in the chapter [[Master_Regulator_Analysis|Master Regulator Analysis]].
 +
 
 +
* '''Network''' - the network (e.g. from ARACNe) upon which MARINa will operate.
 +
** If the network is loaded into MARINa as gene symbols or Entrez IDs, it will be transformed (expanded) to include all probesets annotated to each such gene if an annotation file has been loaded for the expression dataset.
 +
* '''GSEA P-Value''': The enrichment score p-value below which a regulon is considered enriched in differentially expressed genes.
 +
 
 +
===MARINa Parameters Tab===
 
* '''GSEA and shadow/synergy analysis run parameters''':
 
* '''GSEA and shadow/synergy analysis run parameters''':
 
** '''Minimum Number of Targets''': minimum number of genes in a regulon for a GSEA run (a positive integer).
 
** '''Minimum Number of Targets''': minimum number of genes in a regulon for a GSEA run (a positive integer).
** '''Minimum Number of Samples''': minimum number of arrays in the input microarray set in order to run sample-based permutations in GSEA (a positive integer).
+
** '''Minimum Number of Samples''': minimum number of arrays PER CLASS (e.g. case or control) in the input microarray set in order to run sample-based permutations in GSEA (a positive integer).  If sample-based permutations cannot be used, then probe shuffling will be used instead.  The default is set to 6, but having a least 7 is better and 8 per class will give much better significance.
 
** '''Number of GSEA permuations''': number of GSEA permutations to compute enrichment score distribution (a positive integer).
 
** '''Number of GSEA permuations''': number of GSEA permutations to compute enrichment score distribution (a positive integer).
 
** '''GSEA Tail''': whether 1-tailed (1) or 2-tailed GSEA (2) is to be performed.  If all correlations in the network file are greater than zero, then one-tail GSEA will be performed regardless of the value the user assigns to "tail".
 
** '''GSEA Tail''': whether 1-tailed (1) or 2-tailed GSEA (2) is to be performed.  If all correlations in the network file are greater than zero, then one-tail GSEA will be performed regardless of the value the user assigns to "tail".
 
** '''Shadow P-value''': significance threshold for shadow analysis.
 
** '''Shadow P-value''': significance threshold for shadow analysis.
 
** '''SynergyP-value''': significance threshold for synergy analysis.
 
** '''SynergyP-value''': significance threshold for synergy analysis.
* '''Retrieve prior result with ID''' (checkbox) - if checked, all parameter fields are disabled and the prior ID field is enabled. The user can retrieve the results of a previous run.
+
* '''Retrieve prior result with ID''' (checkbox) - if checked, all parameter fields are disabled and the prior ID field is enabled. The user can retrieve the results of a previous run.  In this case, the "case" and "control" array sets do not need to be activated - they are retrieved from the MARINa run directory on the server.
 +
 
 +
====GSEA Tail====
 +
If a two-tailed GSEA run is requested, the MARINa algorithm will divide a TF's regulon genes into positive and negative correlation groups, based on the Spearman's correlation between the expression of the TF and each gene in its regulon.  However, if all correlations are found to be positive, only a single GSEA run will be performed.  The correlation values are calculated by geWorkbench, or can also be provided directly if the "5-column network file" is used.
 +
 
 +
==Sample Shuffling==
 +
Sample shuffling is done with random draws.  If the number of samples per class is too low, the number of unique draws may be much lower than the number of permutations and lead to a false calculation of power.
 +
 
 +
The number of possible permutations is given by N!/2*{(N/2)!(N/2)!], where N is the total number of samples in the two classes together.  The two in the divisor is because each draw of case and control arrays can occur in either order with identical result.  That is, the two outcomes are not distinguishable in the calculation.
  
==other input==
+
Note also that because random draws are performed, the number of unique draws will be less than the number of drawsIf the number of samples per class is low, then many repeats of the same draw will be encountered.  
** '''GSEA P-Value''': This parameter appears on the "Main" parameters tabIt indicates the enrichment score p-value below which a regulon is considered enriched in differentially expressed genes.
 
  
===set by program===
+
For N=14 (that is, 7 per class), the number of unique results is 1716.  For N=16 (8 per class), there are 6435 unique draws.
** '''paired''': boolean value indicating if the microarray samples should be treated as paired or not.
 
  
 +
In addition, it is better if the number of samples per class is balanced, rather than say 6 in one class and 60 in the other class.
  
===Network alternative 5-column file format===
+
Finally, it is recommended to treat the calculated p-values with scepticism.  One common correction is to multiply calculated p-values by a factor of 10, so a calculated p-value of 0.001 is treated instead as 0.01.  This correction is not included in geWorkbench, but can be set by the user if so desired.  That is, if you want a p-value of 0.01, instead set a p-value of 0.001.
MARINa in geWorkbench will normally be set up to work with a network that is either already in the Project area, or that is loaded from disk in a standard format, e.g. an  ARACNe-style adjacency matrix or SIF file.  geWorkbench will generate a file in the following 5-column format which is then sent to the remote MARINa service.  However, a file in this format can also be directly loaded into the MARINa component.
+
 
 +
==Unpaired/Paired Runs==
 +
Normally, the user will specify two classes of arrays, e.g. "case" and "control".  Under these conditions, an unpaired calculation is performed.  However, if the user only activates one class of array ("case"), then "paired" analysis is performed.  The assumption here is that the values are not direct expression values but are fold-change values for paired samples from patients.  Such a pre-analyzed dataset would need to be loaded into geWorkbench, as there is no built-in method to performed such a paired analysis.
 +
 
 +
==Network alternative 5-column file format==
 +
MARINa in geWorkbench will normally be set up to work with a network that is either already in the [[Workspace]], or that is loaded from disk in a standard format, e.g. an  ARACNe-style adjacency matrix or SIF file.  geWorkbench will generate a file in the following 5-column format which is then sent to the remote MARINa service.  However, a file in this format can also be directly loaded into the MARINa component.  In this case, the values contained in the file are used directly, and are not replaced by values calculated by geWorkbench.
  
 
Each line represents a network edge and comprises five tab-delimited columns:
 
Each line represents a network edge and comprises five tab-delimited columns:
** '''Transcription factor id''': A string that provides the transcription factor end of the edge. This is usually a probeset id or a gene symbol.
+
* '''Transcription factor id''': A string that provides the transcription factor end of the edge. This is usually a probeset id or a gene symbol.
** '''Target id''': A string that provides the target end of the edge. This is usually a probeset id or a gene symbol
+
* '''Target id''': A string that provides the target end of the edge. This is usually a probeset id or a gene symbol
** '''Mutual information''': The mutual information (MI) of the edge (a real number). If the edge MI is not known/available, the value 1 can be entered here.
+
* '''Mutual information''': The mutual information (MI) of the edge (a real number). If the edge MI is not known/available, the value 1 can be entered here.
** '''Spearman's correlation''': The Spearman's correlation for the edge, computed on the original microarray set that gave rise to the ARACNe network (a real number). If not known/available, the value 1 can be entered here.
+
* '''Spearman's correlation''': The Spearman's correlation for the edge, computed on the original microarray set that gave rise to the ARACNe network (a real number). If not known/available, the value 1 can be entered here.
** '''P-value for Spearman's correlation'''. The p-value associated with Spearman's correlation found in the previous column (a real number between 0 and 1). It this p-value is not known/available, a value of 0 can be entered here.
+
* '''P-value for Spearman's correlation'''. The p-value associated with Spearman's correlation found in the previous column (a real number between 0 and 1). It this p-value is not known/available, a value of 0 can be entered here.
* '''Microarray set file'''. Provides the expression data that MARINa uses for calculating differential expression. Data in this file are given in the .exp format used by geWorkbench ([[Media:Sample_marina_expression_data.txt | here are the first five lines from a sample file]]).
+
 
* '''Class 1 (cases) and Class 2 (controls) files'''. The arrays that comprise the 2 classes which are used for the calculation of differential expression are provided in 2 files, listing the arrays in each of the 2 classes. Here are 2 such example files: [[Media:Resistant_cellines.txt | resistant_cellines.txt]] and [[Media:Sensitive_cellines.txt | sensitive_cellines.txt]]. In each file, every array is listed in a separate line.
+
==Note on use of Activated Marker Sets==
 +
Activated marker sets should not normally be used during a MARINa run.  If a marker set is activated, any markers not in the activated set will not be considered as possible targets.
 +
 
 +
==Example data files for running MARINa==
 +
MARINa runs, even on a large computational cluster, can be very time-consuming. We include here a small test dataset with which to demonstrate running MARINa.
 +
 
 +
[[Media:Mra_grid_test.zip|Example data files (.zip)]]. Includes expression file, network file (5-column format) and files defining two microarray classes. The individual files are:
 +
* mra_grid_data.exp - Affymetrix File Matrix format expression file.
 +
* mra_grid_5colnetwork.txt - A network file matching the expression data.  This file is in the "5-column" format and can only be loaded directly into the MRA component. Network files in the adjacency matrix and SIF formats should normally be used, this file is for demonstration only.
 +
* mra_grid_ixclass1.csv - array set definition file, for loading into the Arrays component to define phenotype class 1 (Case).
 +
* mra_grid_ixclass2.csv - array set definition file, for loading into the Arrays component to define phenotype class 2 (Control).
 +
 
 +
 
 +
'''Note''' - This is a much smaller dataset than the one used to generate results shown further below.
  
 
==Running MARINa==
 
==Running MARINa==
 +
===Prerequisites===
 +
* '''Expression Data'''
 +
** An expression data set must be loaded in geWorkbench. 
 +
** The expression data should be log transformed for best results with the t-test.
 +
* '''Phenotypic classes''' - specified in the Arrays component in the currently active list view.
 +
** '''Two classes (unpaired analysis)''' -  Normally, "case" (reported as class1) and "control" (reported as class2) array sets should be defined and activated in the Arrays component.  There can be no arrays that belong to both classes.  If there are, a warning will be displayed and the analysis will not be launched.  With two specified classes, the calculation will be run in "unpaired" mode.
 +
** '''Single class (paired analysis)''' - If only "case" arrays are activated, then a "paired" analysis will be run.  This mode assumes that the expression data already represents fold-change results, a type of data manipulation that is not directly supported in geWorkbench.  This data should be either e.g. log2(case/control) or log2(case) - log2(control).
 +
* '''Network file''' - a network that is already present in the [[Workspace]] as a child of the expression dataset can be used, or a network can be loaded directly into the MRA component via its Main tab.
 +
 +
 +
Example of activating "case" and "control" array sets in the [[Array_Sets| Arrays]] component.
 +
 +
 +
[[Image:MARINa_activate_array_sets.png]]
 +
 +
===Setting the MARINa Parameters===
 +
 +
Please note that MARINa analysis can only be selected in the Services tab as a grid service.  Just opening the MARINa parameters tab does not select the MARINa service.
 +
 +
 +
====Main Tab====
 +
 +
 +
[[Image:MARINa_Parameters_Main.png|{{ImageMaxWidth}}]]
 +
 +
 +
 +
* In the Main tab of the MRA component, select or load a network.
 +
* Set the desired GSEA p-value.
 +
 +
====Selecting the MARINa grid service====
 +
* You must search for and activate the MARINa grid service in the MRA services tab.  If the local service is selected it will run the FET method.
 +
 +
The figure below shows the Services tab after the MARINa grid service has been retrieved and selected.
 +
 +
[[Image:MRA_Services_Grid_Selected.png|{{ImageMaxWidth}}]]
 +
 +
====MARINa Parameters====
 +
* Inspect and if desired change the parameters.  See the section on "Inputs" above.
 +
 +
 +
[[Image:MARINa_Parameters_Parameters.png|{{ImageMaxWidth}}]]
 +
 +
====MARINa Parameters - Retrieve prior result====
 +
 +
*If the "Retrieve prior results" box is checked, all other parameter fields will be disabled and disregarded.
 +
* In this case, it is not necessary to activate "case" and "control" sets.  They will be retrieved from the server for use in generating the "bar-code" display.
 +
* Enter the MARINa runID from a previous run of MARINa (this runID is found in the name of the MARINa result node in the [[Workspace]], e.g. "mra30980", see an example in the Results section below).
 +
 +
 +
 +
[[Image:MARINa_Retrieve_Prior.png|{{ImageMaxWidth}}]]
 +
 +
====Analyze====
 +
 +
Push the Analyze button to launch the analysis on the grid service.  You will be prompted to enter a grid username and password.
 +
 +
==Viewing MARINa Results==
 +
 +
The MARINa calculation result is placed in the [[Workspace]] as a new data node.  The name of the node takes the form "MARINa Result: mra''num'', where ''num'' is a job id assigned on the server.
 +
 +
 +
[[Image:MARINa_Result_Node.png]]
 +
 +
 +
===Results Viewer===
 +
 +
 +
 +
 +
[[Image:MARINa_Main_Viewer_full.png|{{ImageMaxWidth}}]]
 +
 +
 +
'''Note''' - if no significant MRs are found, an empty result node is returned to the [[Workspace]]. The MRA viewer will appear but be empty.
 +
 +
 +
====Summary Listing====
 +
 +
[[Image:MARINa_summary_listing.png]]
 +
 +
====First Row of Controls====
 +
* '''Symbol''' - display the markers using their gene symbol (if available)
 +
* '''Probeset''' - display markers using their marker (probeset) name.
 +
* '''Graphs for top''' ...  - Restrict the "bar-code graph" to at most the specified number of entries.
 +
* '''Bar height''' ... - set the height of the vertical lines in the bar graph in pixels.
 +
* '''Bars for'''
 +
** '''Regulon''' - draw bars for each marker in the hub marker's regulon
 +
** '''GSEA Leading Edge''' - draw bars for only those markers in the GSEA leading edge set.
 +
 +
====Second Row of Controls====
 +
=====Export Table=====
 +
This command will export the entire master regulator results table to a file.  It exports the same information shown on screen, sorted in the same way if the table has been sorted on one of the columns.  The user can choose to export the table in CSV (.csv) or tab-delimited text format (.txt).
 +
 +
 +
=====Export all targets=====
 +
This command writes a file to disk containing each MR in the table, along with each MR's GSEA leading-edge targets and a differential expression value for each target (-log10(P-value)*sign(t-value)).
 +
 +
Each master regulator is listed on a line, followed by its targets with their transformed t-test values.  Each MR is separated by a blank line from the preceding section.  The order in the file is not changed by sorting the results table prior to export.
 +
 +
'''Export File format''':
 +
 +
marker, gene name, -log10(P-value)*sign(t-value)
 +
 +
Example:
 +
 +
220462_at, CSRNP3
 +
200660_at, S100A11, 12.541623
 +
201474_s_at, ITGA3, 7.4126143
 +
202910_s_at, CD97, 10.785
 +
....
 +
 +
202614_at, SLC30A9
 +
160020_at, MMP14, 4.415267
 +
200808_s_at, ZYX, 9.006654
 +
200859_x_at, FLNA, 8.309419
 +
....
 +
 +
 +
Exported files automatically receive a ".csv" file name extension.
 +
 +
=====Add Targets to Set=====
 +
Create a new marker set in the Markers component containing the GSEA leading-edge targets for the selected master regulator.  The set is named after the master regulator.
 +
 +
====MARINa Viewer Table Columns====
 +
* '''Master Regulator''': transcription factor gene name or probeset id.
 +
* '''GSEA P-value''': p-value of the GSEA result.
 +
* '''Markers in regulon''':
 +
* '''NumLedge''': sum of NumLedgePos and NumLedgeNeg from the Table Viewer.  The number of genes in the combined GSEA leading edge.
 +
* '''OddRatio''': odds of a regulon gene being in the GSEA leading edge set / odds of a regulon gene being in the GSEA trailing edge set. 
 +
* '''NES''': GSEA normalized enrichment score for the regulon of the TF.
 +
* '''absNES''': absolute value of NES
 +
* '''Mode''': taken from the sign of the NES.  Indicates whether the activity of the TF is positive or repressed in the "case" vs "control" samples.
 +
 +
===Detailed Listing===
 +
 +
The detailed list shows the genes/markers contained in the intersection set of the  MR regulon and the signature.
 +
 +
 +
[[Image:MARINa_Detailed_listing.png]]
 +
  
 +
The genes are displayed in a table with the following columns:
 +
* '''Markers in leading edge''': the names of the genes in the GSEA leading-edge set. Either the gene name or the marker/probe set name is used (based on the choice of "Symbol" or "Probe Set" radio buttons).
 +
* '''-log10(p-value) * sign (t-value)''': A modified test statistic combining the -log10(p-value) with the sign of the t-value.  The sign of the t-value indicates positive or negative differential expression.
  
  
==Results==
+
====Export Table====
The MARINa results file contains one line for each TF whose GSEA enrichment score renders it significant above the user-specified p-value threshold (''pvalue_gsea''), as long as such TF is not "shadowed" by other significant TFs. '''NOTE''': for the time being, we will ignore below (and in our implementation of the grid service) the results of the synergy analysis.
+
This command will export to a file on disk the contents of the detailed target results table.
  
An [[Media:Pemetrexed_res_sens_marina_result.txt | example results file is shown here]]. The semantics of the columns are as follows:
+
The file can be written in either tab-delimited (.txt) or CSV (.csv) format.
  
 +
The columns exported are:
 +
 +
* Markers in GSEA leading edge set
 +
* -log10(P-value) * sign of t-value
 +
 +
 +
 +
===Bar Graph View===
 +
====Description====
 +
The bar-code graph is created based on ranked differential expression results for all markers in the dataset.  However, only markers in the TF's regulon or GSEA-leading edge set (depending on the setting chosen) are drawn as vertical bars, allowing their positions in the entire set of markers to be visualized.
 +
 +
The value used to calculate the differential expression display is '''-log10(p-value) * sign (t-value)''', as in the detailed table display described above.
 +
 +
 +
[[Image:MARINa_GBM_bar_code_detail.png|{{ImageMaxWidth}}]]
 +
 +
* '''Vertical bars''' - The vertical bars correspond to ranked positions of the markers belonging to each TF's regulon or intersection set (depending on the setting chosen). 
 +
* '''Bar position on horizontal axis''' - bars for displayed markers are positioned using their rank in a list of all markers ordered by (-log10(p-value) * sign (t-value)), calculated using a t-test for differential expression.
 +
* '''Bar Color''' - The color of each bar indicates the sign of the Spearman's Correlation between the expression profile of the TF and its targets (calculated using data from all microarrays in the experiment, not just those in the case and control sets):
 +
** ''Red'' means that the two markers are positively correlated (r >= 0) while
 +
** ''Blue'' means that correlation is negative (r < 0).
 +
** The color intensity of each bar is scaled to represent the number of overlapping bars at any given point in the graph.
 +
* '''Gradient''' - The red-blue gradient at the bottom of the graph qualitatively represents the ranking between the lowest (blue)  and the highest (red) test statistic. The white area in the middle represents the middle of the ranking (not necessarily zero differential expression).  This gradient does not represent the colors used for the bars themselves, only the relative position in the ranked differential expression results.
 +
 +
====Detailed Examples====
 +
Detail for CEBPD:
 +
 +
[[Image:MARINa_CEBPD_barcodes.png|{{ImageMaxWidth}}]]
 +
 +
The bar graph shown above, for CEBPD, indicates that the positive regulon of CEBPD is up-regulated in the "case" mesenchymal phenotype, whereas the negative regulon is down-regulated.  CEBPD is activated in the "case", mesenchymal phenotype.
 +
 +
 +
Detail for ZNF248:
 +
 +
[[Image:MARINa_znf248_barcodes.png|{{ImageMaxWidth}}]]
 +
 +
 +
The bar graph shown above, for ZNF248, indicates that the positive regulon of CEBPD is down-regulated in the "case" mesenchymal phenotype, whereas the negative regulon is up-regulated.  ZNF248 activity is repressed in the "case", mesenchymal phenotype.
 +
 +
====Save Image====
 +
Via a right-click menu on the bar graph, the user can save an image of the displayed bar graph to
 +
* the [[Workspace]] as an image snapshot, or
 +
* directly to a file on disk.  Available formats are PNG, JPEG, TIF and BMP.
 +
 +
 +
[[Image:MARINa_graph_save_image.png]]
 +
 +
===Table Viewer===
 +
 +
====Result Columns====
 +
 +
This example was run with sample shuffling. When probe shuffling is used, the MeanClass1 and MeanClass2 columns are not shown.
 +
 +
 +
[[Image:MARINa_Table_Viewer_full.png|{{ImageMaxWidth}}]]
 +
 +
 +
 +
The MARINa results file contains one line for each TF whose GSEA enrichment score renders it significant above the user-specified p-value threshold (''pvalue_gsea''), as long as such TF is not "shadowed" by other significant TFs.
 +
 +
'''NOTE''': synergy analysis is not performed in this version of MARINa.
 +
 +
The columns represented in the results are as follows:
 
* '''TFsym''': transcription factor id (usually this is the probeset id).
 
* '''TFsym''': transcription factor id (usually this is the probeset id).
 
* '''GeneName''': transcription factor gene name.
 
* '''GeneName''': transcription factor gene name.
Line 52: Line 303:
 
* '''NES''': GSEA normalized enrichment score for the regulon of the TF.
 
* '''NES''': GSEA normalized enrichment score for the regulon of the TF.
 
* '''absNES''': absolute value of NES
 
* '''absNES''': absolute value of NES
* '''PV''': p-value of normalized (?) enrichment score.
+
* '''PV''': p-value of normalized enrichment score (NES).
* '''OddRatio''': (NumLedge/(microarray set genes in the regulon of the TF))/((number of differentially expressed genes left of the leading edge)/(total number of microarray set genes)) - <Aris>please check with Mukesh that this is the correct definition</Aris>.
+
* '''OddRatio''': odds of a regulon gene being in the GSEA leading edge set / odds of a regulon gene being in the GSEA trailing edge set.
* '''TScore''' - ???
+
* '''TScore''' - t-test t-statistic.
 
* '''MeanClass1''': mean expression value of TF among all "Class 1" arrays.
 
* '''MeanClass1''': mean expression value of TF among all "Class 1" arrays.
 
* '''MeanClass2''': mean expression value of TF among all "Class 2" arrays.
 
* '''MeanClass2''': mean expression value of TF among all "Class 2" arrays.
* Note - there are no MeanClass1 and MeanClass2 when probe shuffling is performed.
+
* Note - there are no columns for MeanClass1 and MeanClass2 when probe shuffling is performed.
* '''Original MRA/Recovered_MRA''': a value of 1 in this column means that the TF was found to be enriched by GSEA (above the significance level ''pvalue_gsea'') <u>and</u> that it was not shadowed by any other TF. A value of 0 means that the TF was found to be enriched by GSEA <u>and</u> that it was shadowed by another TF <u>and</u> that it remained enriched even after the common targets with the other TF were removed from its regulon.
+
* '''Original MRA/Recovered_MRA''': a value of 1 in this column means that the TF was found to be enriched by GSEA (above the significance level ''pvalue_gsea'') <u>and</u> that it was not shadowed by any other TF. A value of 0 means that the TF was found to be enriched by GSEA <u>and</u> that it was shadowed by another TF <u>and</u> that it remained enriched even after the common targets with the other TF were removed from its regulon.  For details on Shadow analysis please see pg. 17 of the [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913282/bin/msb201031-s1.pdf Supplemental text] to [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2913282 Lefebvre et al. 2010.].
 +
 
 +
====Controls====
 +
* '''Sorting''' - The table can be sorted on the value of any column by clicking on the column header.
 +
 
 +
* '''Export Table button''' - Export the table to a CSV format file.  The table contents are exported in the same order shown on screen, preserving sorting on any column values.
  
==Extensions for the geWorkbench MRA component==
+
=References=
To invoke the MARINa service from within the geWorkbench MRA component (and display the analysis results) we should make the following changes:
+
<span id="Basso2005"></span>
 +
* Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A (2005). Reverse engineering of regulatory networks in human B cells. Nat Genet 37(4):382-390 ([http://www.nature.com/ng/journal/v37/n4/abs/ng1532.html link to paper]).
 +
<span id="Lefebvre2010"></span>
  
===MRA Analysis component===
+
* Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL, Doetsch F, Colman H, Lasorella A, Aldape K, Califano A, Iavarone A  (2010)  The transcriptional network for mesenchymal transformation of brain tumors.  Nature 463(7279):318-25.
  
* '''MARINa'''. In here we specify the params that are specific to the MARINa analysis.These are:
+
* Lefebvre C, Rajbhandari P, Alvarez MJ, Bandaru P, Lim WK, Sato M, Wang K, Sumazin P, Kustagi M, Bisikirska BC, Basso K, Beltrao P, Krogan N, Gautier J, Dalla-Favera R, Califano A (2010) A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers. Mol Syst Biol. 6:377. PMID: [http://www.ncbi.nlm.nih.gov/pubmed/20531406 20531406].
** The Class 1 (i.e. case) and Class 2 (i.e. control) arrays. These should be selected (activated) in the Arrays component available array sets.
+
<span id="Lim2009"></span>
*** Note the two selected array sets have no arrays in common. An error popup should notify the user if either of these conditions are violated.
 
** The values for the applicable parameters listed under "GSEA and shadow/synergy analysis run parameters" above.  
 
*** "paired",  - if the user only includes one class (control), then it is paired.  If both classes are provided, it is unpaired.
 
  
 +
* Lim WK, Lyashenko E, Califano A (2009) Master regulators used as breast cancer metastasis classifier. Pac Symp Biocomput. 2009:504-15.  PMID [http://www.ncbi.nlm.nih.gov/pubmed/19209726 19209726], ([http://psb.stanford.edu/psb-online/proceedings/psb09/lim.pdf link to paper]).
  
===MRA Results Visualization===
+
* Subramanian A1, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.  Proc Natl Acad Sci U S A. 102(43):15545-50.  PMID:  [http://www.ncbi.nlm.nih.gov/pubmed/16199517 16199517].

Latest revision as of 18:45, 6 March 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Overview

This chapter details the MARINa (MAster Regulator INference algorithm) method of Master Regulator Analysis. Please see the Master Regulator Analysis chapter for a higher-level introduction.

MARINa uses gene set enrichment analysis (GSEA)(Subramanian et al, 2005) to calculate if the regulon of a TF is enriched for genes that are differentially expressed among 2 classes of microarrays. Please note that this does not involve any of the special gene lists available for use with the GSEA algorithm, such as MSigDB. Following the GSEA runs, MARINa performs an additional "shadow" analyses. The "Synergy" analysis option is not currently supported.

The MARINa algorithm runs on a computational cluster at Columbia and access is restricted. It is run from geWorkbench using a grid service, selected using the "Services" tab.


The MARINa algorithm differs substantially from the "FET" method described in the MRA-FET chapter.

  1. The list of signature markers is determined using a t-test directly in the MARINa algorithm, rather than being loaded separately as for the FET method.
  2. The list of candidate Master Regulators is taken from the hub markers in the network, rather than using a separate list as in the "FET" method.
  3. MARINa uses "case" and "control" designations of arrays in the Arrays component if provided. However, unlike the FET method, it will also run if only "case" arrays are designated. This should be used only where the loaded expression dataset has been pre-processed to represent fold-change values. geWorkbench itself cannot produce such a dataset.

A special 2-tailed version of GSEA is used, as described in Lim WK, Lyashenko E, Califano A (2009).

Inputs

Main Tab

These inputs and the GUI are described in detail in the chapter Master Regulator Analysis.

  • Network - the network (e.g. from ARACNe) upon which MARINa will operate.
    • If the network is loaded into MARINa as gene symbols or Entrez IDs, it will be transformed (expanded) to include all probesets annotated to each such gene if an annotation file has been loaded for the expression dataset.
  • GSEA P-Value: The enrichment score p-value below which a regulon is considered enriched in differentially expressed genes.

MARINa Parameters Tab

  • GSEA and shadow/synergy analysis run parameters:
    • Minimum Number of Targets: minimum number of genes in a regulon for a GSEA run (a positive integer).
    • Minimum Number of Samples: minimum number of arrays PER CLASS (e.g. case or control) in the input microarray set in order to run sample-based permutations in GSEA (a positive integer). If sample-based permutations cannot be used, then probe shuffling will be used instead. The default is set to 6, but having a least 7 is better and 8 per class will give much better significance.
    • Number of GSEA permuations: number of GSEA permutations to compute enrichment score distribution (a positive integer).
    • GSEA Tail: whether 1-tailed (1) or 2-tailed GSEA (2) is to be performed. If all correlations in the network file are greater than zero, then one-tail GSEA will be performed regardless of the value the user assigns to "tail".
    • Shadow P-value: significance threshold for shadow analysis.
    • SynergyP-value: significance threshold for synergy analysis.
  • Retrieve prior result with ID (checkbox) - if checked, all parameter fields are disabled and the prior ID field is enabled. The user can retrieve the results of a previous run. In this case, the "case" and "control" array sets do not need to be activated - they are retrieved from the MARINa run directory on the server.

GSEA Tail

If a two-tailed GSEA run is requested, the MARINa algorithm will divide a TF's regulon genes into positive and negative correlation groups, based on the Spearman's correlation between the expression of the TF and each gene in its regulon. However, if all correlations are found to be positive, only a single GSEA run will be performed. The correlation values are calculated by geWorkbench, or can also be provided directly if the "5-column network file" is used.

Sample Shuffling

Sample shuffling is done with random draws. If the number of samples per class is too low, the number of unique draws may be much lower than the number of permutations and lead to a false calculation of power.

The number of possible permutations is given by N!/2*{(N/2)!(N/2)!], where N is the total number of samples in the two classes together. The two in the divisor is because each draw of case and control arrays can occur in either order with identical result. That is, the two outcomes are not distinguishable in the calculation.

Note also that because random draws are performed, the number of unique draws will be less than the number of draws. If the number of samples per class is low, then many repeats of the same draw will be encountered.

For N=14 (that is, 7 per class), the number of unique results is 1716. For N=16 (8 per class), there are 6435 unique draws.

In addition, it is better if the number of samples per class is balanced, rather than say 6 in one class and 60 in the other class.

Finally, it is recommended to treat the calculated p-values with scepticism. One common correction is to multiply calculated p-values by a factor of 10, so a calculated p-value of 0.001 is treated instead as 0.01. This correction is not included in geWorkbench, but can be set by the user if so desired. That is, if you want a p-value of 0.01, instead set a p-value of 0.001.

Unpaired/Paired Runs

Normally, the user will specify two classes of arrays, e.g. "case" and "control". Under these conditions, an unpaired calculation is performed. However, if the user only activates one class of array ("case"), then "paired" analysis is performed. The assumption here is that the values are not direct expression values but are fold-change values for paired samples from patients. Such a pre-analyzed dataset would need to be loaded into geWorkbench, as there is no built-in method to performed such a paired analysis.

Network alternative 5-column file format

MARINa in geWorkbench will normally be set up to work with a network that is either already in the Workspace, or that is loaded from disk in a standard format, e.g. an ARACNe-style adjacency matrix or SIF file. geWorkbench will generate a file in the following 5-column format which is then sent to the remote MARINa service. However, a file in this format can also be directly loaded into the MARINa component. In this case, the values contained in the file are used directly, and are not replaced by values calculated by geWorkbench.

Each line represents a network edge and comprises five tab-delimited columns:

  • Transcription factor id: A string that provides the transcription factor end of the edge. This is usually a probeset id or a gene symbol.
  • Target id: A string that provides the target end of the edge. This is usually a probeset id or a gene symbol
  • Mutual information: The mutual information (MI) of the edge (a real number). If the edge MI is not known/available, the value 1 can be entered here.
  • Spearman's correlation: The Spearman's correlation for the edge, computed on the original microarray set that gave rise to the ARACNe network (a real number). If not known/available, the value 1 can be entered here.
  • P-value for Spearman's correlation. The p-value associated with Spearman's correlation found in the previous column (a real number between 0 and 1). It this p-value is not known/available, a value of 0 can be entered here.

Note on use of Activated Marker Sets

Activated marker sets should not normally be used during a MARINa run. If a marker set is activated, any markers not in the activated set will not be considered as possible targets.

Example data files for running MARINa

MARINa runs, even on a large computational cluster, can be very time-consuming. We include here a small test dataset with which to demonstrate running MARINa.

Example data files (.zip). Includes expression file, network file (5-column format) and files defining two microarray classes. The individual files are:

  • mra_grid_data.exp - Affymetrix File Matrix format expression file.
  • mra_grid_5colnetwork.txt - A network file matching the expression data. This file is in the "5-column" format and can only be loaded directly into the MRA component. Network files in the adjacency matrix and SIF formats should normally be used, this file is for demonstration only.
  • mra_grid_ixclass1.csv - array set definition file, for loading into the Arrays component to define phenotype class 1 (Case).
  • mra_grid_ixclass2.csv - array set definition file, for loading into the Arrays component to define phenotype class 2 (Control).


Note - This is a much smaller dataset than the one used to generate results shown further below.

Running MARINa

Prerequisites

  • Expression Data
    • An expression data set must be loaded in geWorkbench.
    • The expression data should be log transformed for best results with the t-test.
  • Phenotypic classes - specified in the Arrays component in the currently active list view.
    • Two classes (unpaired analysis) - Normally, "case" (reported as class1) and "control" (reported as class2) array sets should be defined and activated in the Arrays component. There can be no arrays that belong to both classes. If there are, a warning will be displayed and the analysis will not be launched. With two specified classes, the calculation will be run in "unpaired" mode.
    • Single class (paired analysis) - If only "case" arrays are activated, then a "paired" analysis will be run. This mode assumes that the expression data already represents fold-change results, a type of data manipulation that is not directly supported in geWorkbench. This data should be either e.g. log2(case/control) or log2(case) - log2(control).
  • Network file - a network that is already present in the Workspace as a child of the expression dataset can be used, or a network can be loaded directly into the MRA component via its Main tab.


Example of activating "case" and "control" array sets in the Arrays component.


MARINa activate array sets.png

Setting the MARINa Parameters

Please note that MARINa analysis can only be selected in the Services tab as a grid service. Just opening the MARINa parameters tab does not select the MARINa service.


Main Tab

MARINa Parameters Main.png


  • In the Main tab of the MRA component, select or load a network.
  • Set the desired GSEA p-value.

Selecting the MARINa grid service

  • You must search for and activate the MARINa grid service in the MRA services tab. If the local service is selected it will run the FET method.

The figure below shows the Services tab after the MARINa grid service has been retrieved and selected.

MRA Services Grid Selected.png

MARINa Parameters

  • Inspect and if desired change the parameters. See the section on "Inputs" above.


MARINa Parameters Parameters.png

MARINa Parameters - Retrieve prior result

  • If the "Retrieve prior results" box is checked, all other parameter fields will be disabled and disregarded.
  • In this case, it is not necessary to activate "case" and "control" sets. They will be retrieved from the server for use in generating the "bar-code" display.
  • Enter the MARINa runID from a previous run of MARINa (this runID is found in the name of the MARINa result node in the Workspace, e.g. "mra30980", see an example in the Results section below).


MARINa Retrieve Prior.png

Analyze

Push the Analyze button to launch the analysis on the grid service. You will be prompted to enter a grid username and password.

Viewing MARINa Results

The MARINa calculation result is placed in the Workspace as a new data node. The name of the node takes the form "MARINa Result: mranum, where num is a job id assigned on the server.


MARINa Result Node.png


Results Viewer

MARINa Main Viewer full.png


Note - if no significant MRs are found, an empty result node is returned to the Workspace. The MRA viewer will appear but be empty.


Summary Listing

MARINa summary listing.png

First Row of Controls

  • Symbol - display the markers using their gene symbol (if available)
  • Probeset - display markers using their marker (probeset) name.
  • Graphs for top ... - Restrict the "bar-code graph" to at most the specified number of entries.
  • Bar height ... - set the height of the vertical lines in the bar graph in pixels.
  • Bars for
    • Regulon - draw bars for each marker in the hub marker's regulon
    • GSEA Leading Edge - draw bars for only those markers in the GSEA leading edge set.

Second Row of Controls

Export Table

This command will export the entire master regulator results table to a file. It exports the same information shown on screen, sorted in the same way if the table has been sorted on one of the columns. The user can choose to export the table in CSV (.csv) or tab-delimited text format (.txt).


Export all targets

This command writes a file to disk containing each MR in the table, along with each MR's GSEA leading-edge targets and a differential expression value for each target (-log10(P-value)*sign(t-value)).

Each master regulator is listed on a line, followed by its targets with their transformed t-test values. Each MR is separated by a blank line from the preceding section. The order in the file is not changed by sorting the results table prior to export.

Export File format:

marker, gene name, -log10(P-value)*sign(t-value)

Example:

220462_at, CSRNP3
200660_at, S100A11, 12.541623
201474_s_at, ITGA3, 7.4126143
202910_s_at, CD97, 10.785
....
202614_at, SLC30A9
160020_at, MMP14, 4.415267
200808_s_at, ZYX, 9.006654
200859_x_at, FLNA, 8.309419
....


Exported files automatically receive a ".csv" file name extension.

Add Targets to Set

Create a new marker set in the Markers component containing the GSEA leading-edge targets for the selected master regulator. The set is named after the master regulator.

MARINa Viewer Table Columns

  • Master Regulator: transcription factor gene name or probeset id.
  • GSEA P-value: p-value of the GSEA result.
  • Markers in regulon:
  • NumLedge: sum of NumLedgePos and NumLedgeNeg from the Table Viewer. The number of genes in the combined GSEA leading edge.
  • OddRatio: odds of a regulon gene being in the GSEA leading edge set / odds of a regulon gene being in the GSEA trailing edge set.
  • NES: GSEA normalized enrichment score for the regulon of the TF.
  • absNES: absolute value of NES
  • Mode: taken from the sign of the NES. Indicates whether the activity of the TF is positive or repressed in the "case" vs "control" samples.

Detailed Listing

The detailed list shows the genes/markers contained in the intersection set of the MR regulon and the signature.


MARINa Detailed listing.png


The genes are displayed in a table with the following columns:

  • Markers in leading edge: the names of the genes in the GSEA leading-edge set. Either the gene name or the marker/probe set name is used (based on the choice of "Symbol" or "Probe Set" radio buttons).
  • -log10(p-value) * sign (t-value): A modified test statistic combining the -log10(p-value) with the sign of the t-value. The sign of the t-value indicates positive or negative differential expression.


Export Table

This command will export to a file on disk the contents of the detailed target results table.

The file can be written in either tab-delimited (.txt) or CSV (.csv) format.

The columns exported are:

  • Markers in GSEA leading edge set
  • -log10(P-value) * sign of t-value


Bar Graph View

Description

The bar-code graph is created based on ranked differential expression results for all markers in the dataset. However, only markers in the TF's regulon or GSEA-leading edge set (depending on the setting chosen) are drawn as vertical bars, allowing their positions in the entire set of markers to be visualized.

The value used to calculate the differential expression display is -log10(p-value) * sign (t-value), as in the detailed table display described above.


MARINa GBM bar code detail.png

  • Vertical bars - The vertical bars correspond to ranked positions of the markers belonging to each TF's regulon or intersection set (depending on the setting chosen).
  • Bar position on horizontal axis - bars for displayed markers are positioned using their rank in a list of all markers ordered by (-log10(p-value) * sign (t-value)), calculated using a t-test for differential expression.
  • Bar Color - The color of each bar indicates the sign of the Spearman's Correlation between the expression profile of the TF and its targets (calculated using data from all microarrays in the experiment, not just those in the case and control sets):
    • Red means that the two markers are positively correlated (r >= 0) while
    • Blue means that correlation is negative (r < 0).
    • The color intensity of each bar is scaled to represent the number of overlapping bars at any given point in the graph.
  • Gradient - The red-blue gradient at the bottom of the graph qualitatively represents the ranking between the lowest (blue) and the highest (red) test statistic. The white area in the middle represents the middle of the ranking (not necessarily zero differential expression). This gradient does not represent the colors used for the bars themselves, only the relative position in the ranked differential expression results.

Detailed Examples

Detail for CEBPD:

MARINa CEBPD barcodes.png

The bar graph shown above, for CEBPD, indicates that the positive regulon of CEBPD is up-regulated in the "case" mesenchymal phenotype, whereas the negative regulon is down-regulated. CEBPD is activated in the "case", mesenchymal phenotype.


Detail for ZNF248:

MARINa znf248 barcodes.png


The bar graph shown above, for ZNF248, indicates that the positive regulon of CEBPD is down-regulated in the "case" mesenchymal phenotype, whereas the negative regulon is up-regulated. ZNF248 activity is repressed in the "case", mesenchymal phenotype.

Save Image

Via a right-click menu on the bar graph, the user can save an image of the displayed bar graph to

  • the Workspace as an image snapshot, or
  • directly to a file on disk. Available formats are PNG, JPEG, TIF and BMP.


MARINa graph save image.png

Table Viewer

Result Columns

This example was run with sample shuffling. When probe shuffling is used, the MeanClass1 and MeanClass2 columns are not shown.


MARINa Table Viewer full.png


The MARINa results file contains one line for each TF whose GSEA enrichment score renders it significant above the user-specified p-value threshold (pvalue_gsea), as long as such TF is not "shadowed" by other significant TFs.

NOTE: synergy analysis is not performed in this version of MARINa.

The columns represented in the results are as follows:

  • TFsym: transcription factor id (usually this is the probeset id).
  • GeneName: transcription factor gene name.
  • NumPosGSet: number of genes (among those in the regulon of the transcription factor) which are positively regulated by the TF. Only genes represented in the microarray set are counted.
  • NumNegGSet: number of genes (among those in the regulon of the transcription factor) which are negatively regulated by the TF. Only genes represented in the microarray set are counted.
  • NumLedgePos: number of genes (among those counted under the NumPosGSet column) which are in the GSEA leading edge.
  • NumLedgeNeg: number of genes (among those counted under the NumNegGSet column) which are in the GSEA leading edge.
  • NumLedge: sum of NumLedgePos and NumLedgeNeg.
  • ES: GSEA enrichment score for the regulon of the TF.
  • NES: GSEA normalized enrichment score for the regulon of the TF.
  • absNES: absolute value of NES
  • PV: p-value of normalized enrichment score (NES).
  • OddRatio: odds of a regulon gene being in the GSEA leading edge set / odds of a regulon gene being in the GSEA trailing edge set.
  • TScore - t-test t-statistic.
  • MeanClass1: mean expression value of TF among all "Class 1" arrays.
  • MeanClass2: mean expression value of TF among all "Class 2" arrays.
  • Note - there are no columns for MeanClass1 and MeanClass2 when probe shuffling is performed.
  • Original MRA/Recovered_MRA: a value of 1 in this column means that the TF was found to be enriched by GSEA (above the significance level pvalue_gsea) and that it was not shadowed by any other TF. A value of 0 means that the TF was found to be enriched by GSEA and that it was shadowed by another TF and that it remained enriched even after the common targets with the other TF were removed from its regulon. For details on Shadow analysis please see pg. 17 of the Supplemental text to Lefebvre et al. 2010..

Controls

  • Sorting - The table can be sorted on the value of any column by clicking on the column header.
  • Export Table button - Export the table to a CSV format file. The table contents are exported in the same order shown on screen, preserving sorting on any column values.

References

  • Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A (2005). Reverse engineering of regulatory networks in human B cells. Nat Genet 37(4):382-390 (link to paper).

  • Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL, Doetsch F, Colman H, Lasorella A, Aldape K, Califano A, Iavarone A (2010) The transcriptional network for mesenchymal transformation of brain tumors. Nature 463(7279):318-25.
  • Lefebvre C, Rajbhandari P, Alvarez MJ, Bandaru P, Lim WK, Sato M, Wang K, Sumazin P, Kustagi M, Bisikirska BC, Basso K, Beltrao P, Krogan N, Gautier J, Dalla-Favera R, Califano A (2010) A human B-cell interactome identifies MYB and FOXM1 as master regulators of proliferation in germinal centers. Mol Syst Biol. 6:377. PMID: 20531406.

  • Lim WK, Lyashenko E, Califano A (2009) Master regulators used as breast cancer metastasis classifier. Pac Symp Biocomput. 2009:504-15. PMID 19209726, (link to paper).
  • Subramanian A1, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 102(43):15545-50. PMID: 16199517.