ARACNe

Revision as of 17:26, 30 January 2015 by Smith (talk | contribs) (P-Value)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Overview

ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) (Basso 2005, Margolin 2006a, 2006b) is an information-theoretic algorithm used to identify transcriptional interactions between gene products using microarray gene expression profile data. By proper selection of samples, a tissue/phenotype-specific set of pairwise regulatory interactions can be obtained – an “interactome”. Such an interactome can form the basis for more complex analysis of cellular regulatory networks. ARACNe has been used to reconstruct networks in mammalian cells through appropriate choice of dataset.

For a dataset with a simple, monotonic relationship between input and output, analysis with a normal (e.g. Pearson's) correlation function may be the most suitable method. Where the input/output function is non-linear or irregular, a method based on the calculation of mutual information, such as ARACNe, may be able to find relationships that Pearson's correlation will not find. Calculation of the mutual information does not require a monotonic relationship. ARACNe has proven to be well suited for the reverse engineering of regulatory networks in the context of specific cellular types.

ARACNe performs best with a dataset containing data from a minimum of 100 microarrays (see Margolin, Wang et al. 2006) and representing a number of different states of the same cellular system - for example, cells lines of varying phenotype, or cells subjected to a variety of experimental perturbations. Initial work with ARACNe was performed using a large collection (about 340) of B-cell lines of various phenotypes (Basso et al. 2005). For the "Adaptive Partitioning" option, there is no upper limit on the number of arrays that can be included. For the "Fixed Bandwidth" option, arrays in excess of 300 can lead to long computation times. A subset of the B-Cell dataset, derived from 100 arrays, is included with geWorkbench (Bcell-100.exp).

Within the constraints of available computational power and memory, the more samples that are used, the greater the accuracy of the mutual information calculation.

ARACNe can perform two separate calculations:

  1. Mutual Information: The mutual information (MI) of one or more marker's expression profile(s) is calculated against all other active markers.
  2. Data Processing Inequality (DPI): The DPI calculation is used to remove the weakest interaction (edge) between any three markers. That is, if a MI value is available between each of three possible pairings of three markers, the weakest interaction of the three will be removed from the output. This has the intent of removing indirect interactions. For example, if A->B->C, the indirect interaction A->C will likely be weaker than A->B or B->C and would be removed. A tolerance can be set which relaxes this screening to account for uncertainty in the MI calculation.

Parameters described below allow one to incorporate a list of putative transcription factors and optimize the run to discover targets that they may regulate.

Current versions of geWorkbench (starting with release 1.7.0) use a new implementation of ARACNe, ARACNe2, that supports two important new features:

  1. Algorithm - A new algorithm, termed "Adaptive Partitoning", which is much faster than the original "fixed bandwidth" method was added.
  2. Mode - The user can choose to custom-calculate optimal run parameters for a given dataset. See the option "Preprocessing" below.


Further information on ARACNe is available in the References section below.

Setting up an ARACNe run

Prerequisites

  • To use the ARACNe routine, first check that it has been loaded in the Component Configuration Manager.
  • ARACNe is found in the list of available analysis routines in the lower-right Commands quadrant of geWorkbench.
  • A microarray dataset of sufficient size and phenotypic diversity is needed (See the Overview, above).
  • Load the microarray dataset into the Workspace. If available, associate a gene annotation file with the dataset. This will allow the results to be displayed in consolidated fashion in Cytoscape by gene rather than by marker (individual probeset) name.
  • Warning on too few arrays - if a dataset with fewer than 100 microarrays is submitted for ARACNe analysis, a warning message will appear notifying the user of the suggested minimum of 100 arrays.
  • Java maximum heap memory - for large datasets, or for running bootstrapping, you may need to use a larger Java maximum heap memory size. Instructions for starting geWorkbench with larger amounts of memory requested can be found in the geWorkbench Installation Instructions. These increased memory settings can only be used on machines with correspondingly larger amounts of RAM.

Parameters and Settings

ARACNe Parameters.png

Algorithm

Two algorithms are offered with which to calculate the pairwise mutual information between markers:

  • Adaptive Partitioning (default) - should generally be used for all new calculations.
  • Fixed Bandwidth - previous, slower algorithm using a fixed Gaussian kernel.

Adaptive Partitioning

Adaptive Partitioning was added with the incorporation of the ARACNe2 code into geWorkbench. Adaptive Partitioning is much faster than the original Fixed Bandwidth method, and is also considered to produce superior results under certain circumstances. Unlike the Fixed Bandwidth method, it does not used a fixed kernel-width when calculating the MI.

Fixed Bandwidth

Fixed Bandwidth was the original algorithm used in ARACNe and is included for compatibility with previous releases. This method uses a kernel-width parameter for a Gaussian function used to calculate the MI.

Mode

Used to control the calculation and use of runtime parameters from the input dataset.

  • PREPROCESSING - calculates the required parameters and writes them to parameter files.
  • DISCOVERY - The ARACNe mutual information calculation is run. Uses pre-calculated parameter files as needed if they are present.
  • COMPLETE - Preprocessing and Discovery runs are combined into a single step.

Preprocessing

In this mode, runtime parameters are calculated, but no MI calculation is performed. Preprocessing for a given combination of dataset and algorithm need be run only once. The results are written to one or two files in the geWorkbench settings directory (~/.geworkbench/sys_data). The names used for these files incorporate both the name of the dataset (taken from the name shown in the Workspace) and the name of the algorithm, and thus are specific to the particular combination. Each time ARACNe is run in Discovery mode, it will look for the dataset-specific parameter files in its root directory. If the files are not found (Preprocessing has not been run), default parameter values will be used.

  • Fixed Bandwidth (FBW) algorithm - two files are written to the geWorkbench root directory, one containing parameters for calculating the kernel width, and the other containing parameters for calculating a MI threshold from a specified P-value.
  • Adaptive Partitioning (AP) algorithm - only the parameter file for calculating a MI threshold from a specified P-value is written. No kernel-width parameter is used.

Note - if the name of the expression dataset in the Workspace is changed after ARACNe preprocessing has been run, the corresponding parameter file(s) that was/were created will not be found when the discovery run is performed. Please do not alter the expression dataset name after ARACNe preprocessing has been run.

Preprocessing File Naming

As described just above, ARACNe preprocessing creates new parameter files in the geWorkbench settings directory (~/.geworkbench/sys_data).

Preprocessing parameter file names are formed by appending the following to the dataset name:

  • Fixed Bandwidth
    • Kernel width - "_ARACNe_FBW_kernel.txt"
    • Threshold - "_ARACNe_FBW_threshold.txt"
  • Adaptive Partitioning
    • Threshold - "_ARACNe_AP_threshold.txt"

Note - if the name of the expression dataset in the Workspace is changed after ARACNe preprocessing has been run, the corresponding parameter file(s) that was/were created will not be found when the discovery run is performed. Please do not alter the expression dataset name after ARACNe preprocessing has been run.

Preprocessing files included with geWorkbench

Example preprocessing are no longer included with geWorkbench.

DISCOVERY

The ARACNe mutual information and the DPI (if selected) calculations are run. If dataset-specific parameter files are present, they will be used as needed (based on settings selected for Kernel Width and Threshold).

COMPLETE

A preprocessing run will be performed followed immediately by a Discovery run. The dataset-specific parameter files created during the Preprocssing step will be used if needed (based on settings selected for Kernel Width and Threshold).

When is preprocessing not needed?

The preprocessing step can be time consuming. If you are for example using Adaptive Partioning, and decide you do not need to specify a p-value threshold for accepting edges, then you can just set a MI value as the threshold and proceed directly to Discovery mode. This will however make interpreting results more difficult, as raw MI values depend on many factors and cannot be directly evaluated.

If ARACNe does not find the dataset-specific parameter files it needs as described above, it will use by default parameters calculated from the B-cell dataset (see Margolin et al., 2006).

Hub Marker(s)

Specifies which gene markers will be treated as "hubs" in the ARACNE mutual information (MI) calculation. The mutual information is calculated for each specified hub marker against all other markers in the submitted dataset. For many uses, it is suggested to use a defined list of known transcription factors as hub markers, rather than using the "All-vs-All" setting.

  • All vs All - The MI of every pair of markers in the dataset is computed, that is, each is used as a hub.
  • From Sets - allows a set of markers defined in the Markers component to be chosen from a pulldown menu. Alternatively, the user can type in the names of desired markers directly as a comma separated list.
  • From File - allows a comma-separated list of markers to be read in from a file by clicking Load Markers..

Hub marker(s) must appear in active marker set

If a set of markers is activated in the Markers component, rather than using all markers, then the chosen hub marker(s)must also be included in an active set. If the hub marker is missing from the active sets, then an error dialog will be displayed. In the below picture, the marker 1973_s_at was entered into the hub field without being part of a subset of markers that had been activated:

T ARACNe hub not in dataset.png

Threshold Type

Mutual Info.

Use the raw MI value calculated by ARACNe. Only interactions with MI above the threshold will be included in the final network. the MI can be any positive value or zero, but not negative.

P-Value

Use a p-value calculated from the MI values as the threshold. For best results, the preprocessing step must be run first to generate the parameters needed to calculate p-values from MI values for the particular data set.

Threshold

Enter the desired threshold value into the text field.

  • Note - Using a P-value for the threshold is preferred to using the raw MI value, as the MI value conveys no information about significance.

Correction

If the threshold type is chosen as P-Value, an option to apply the Bonferroni correction is offered in the adjacent pulldown menu.

  • No Correction (default) - the user can enter any desired threshold into the P-value field. No additional correction is applied.
  • Bonferroni Correction - divide the specified p-value threshold by the (number of markers)*(number of hub genes tested).

Note - prior to release 2.6.0, the option "Correct by # of markers" was offered instead.

Kernel width

The Kernel width is a scaling parameter used for fitting a Gaussian function to the data when running the FIXED_BANDWIDTH algorithm only, otherwise this field is disabled. If used, the value can be either inferred or specified directly.

  • Inferred: If PREPROCESSING has been run on the dataset (mode is set to PREPROCESSING or COMPLETE), the kernel width is calculated directly from those results. If PREPROCESSING has not been run, the kernel width is inferred based on parameters fitted to a large B-cell dataset (Margolin et al, 2006), extrapolated for the number of samples in the dataset being tested.
  • Specify: The user can enter a value for the kernel width directly, e.g. based on a prior calculation with this dataset.

DPI Tolerance

The Data Processing Inequality can be used to remove the effects of indirect interactions, e.g. if TF1->TF2->Target, DPI can be used to remove the indirect action of TF1 on the target. It is specifically intended to "remove indirect interactions mediated through two transcriptional interactions" (ARACNe Manual). The DPI tolerance specifies the degree of sampling error to be accepted, as with a finite sample size an exact value MI can not be calculated. The higher the tolerance specified, the fewer the edges that will be removed.

  • Do Not Apply - Do not use the DPI.
  • Apply - In the text box, enter the percentage of the estimated MI to be considered as sampling error, expressed as a real number between 0.0 and 1.0. E.g. for 10%, enter 0.1.
    • NOTE - The recommended value for the DPI Tolerance, when used, is now 0.0 (zero) for most purposes.

For a full discussion of the theory and use of DPI in ARACNe, please see Margolin et al. (2006).

DPI Target List

  • The DPI target list can be used to prioritize transcriptional interactions when reverse engineering interaction networks.
  • It prevents transcriptional interactions from being removed by non-transcriptional interactions when DPI is run.
  • The Target List comes into play when DPI examines a triangle of interactions which contains one TF and two non-TFs. If the weakest of the three interactions involves the TF (a TF-nonTF edge), then that edge would be removed by a simple application of the DPI. However, if the TF is included on the DPI Target List, the TF-nonTF edge will not be removed.
  • That is, use of the Target List prevents edges originating on a TF from being removed in favor of an edge between two non-TFs.


The DPI Target List is used to screen out interactions of genes that are tightly co-expressed but are not in a regulatory relationship to each other, for example genes for two proteins that are in a physical complex and hence always produced in the same amounts. A comma-separated list can be typed in, or the list can be loaded from an external file. If used, the DPI Target List should contain all markers that are annotated as transcription factors. Signaling proteins could also be included.

  • Details: If the box is checked, the user selects and loads a file which specifies markers (which should be a list of one or more presumptive transcription factors) which will be given preferential treatment during the DPI edge-removal step.

For further explanation and figures on the DPI Target list, see Chapter 3, note 7 of the ARACNe Manual.

Bootstrapping

Bootstrap analysis can be used to generate a more reliable estimate of statistical significance for the interactions. Please see Margolin et al. 2006, Nature Protocols, Vol 1, No. 2, pg. 663-672 for further details (full reference below). Briefly, repeated runs of ARACNE are made, with arrays drawn at random from the full dataset with replacement. The same number of arrays is drawn each time as is present in the original dataset. A permutation test is then used to obtain a null distribution, against which the statistical significance of support for each network edge connection (interaction) can be measured.

  • Bootstrap number: Specifies the number of bootstrapping runs to perform.
  • Consensus threshold (for bootstrapping only): After the bootstrapping runs are made, a permutation test is used to estimate the significance of interactions. The consensus threshold sets the cutoff point for calling the interactions significant and returning them in the final adjacency matrix
  • Note - bootstrapping does not replace the need to filter the individual ARACNe runs using a p-value or MI threshold. That initial screening reduces the initial network to a tractable size, and is a prerequisite for the bootstrapping permutation step.


Merge multiple probesets

(Replaces the method "Choose edges with highest MI" offered in version 2.2.0).

Checking this box will cause interactions to be summarized at the gene level for each hub marker. The links to individual probesets will not be retained. Thus when this option is selected, the adjacency matrix will contain a single line per hub gene. This option depends on an annotation file being loaded along with the microarray dataset.

Background

On a microarray analysis platform, genes may be represented by more than one marker (probeset). The mapping between markers and genes is specified in the annotation file, if it is read in at the time that the data is loaded. The ARACNe analysis in geWorkbench is performed at the level of probesets. In some cases, an interaction between two genes may be represented by more than one edge, each such edge involving an alternate probeset for at least one of the genes.

When the "Merge multiple probesets" option is not chosen, the full ARACNe adjacency matrix, as calculated at the probeset level, will be retained and placed as a data node in the Workspace.

Merge multiple probesets selected

  • Edges - If a particular hub-target interaction is represented by more than one edge, only the edge with the highest mutual information (MI) will be retained.
  • Adjacency Matrix - The final adjacency matrix stored to the Workspace will contain gene symbols, not the particular marker ids. That is, the data is summarized at the gene level.

Technical Note

  • Multiple Gene IDs - In some cases, a marker may be annotated to more than one gene in the annotation file. Only the first such gene name on an annotation line is used when determining if two probesets map to the same gene.

Array and Marker Set Overrides

  • All Markers: checking this box overrides any activated marker set in the Markers component.
  • All Arrays: checking this box overrides any activated array set in the Arrays/Phenotypes component.

Analysis Actions

  • Analyze - start the ARACNe analysis
  • Save Settings, Delete Settings - The geWorkbench analysis framework provides a standard method for saving one or more different sets of parameter settings per each type of analysis component. Please see the Analysis Framework chapter for further details.

Services (Local vs Grid)

ARACNe can be run either locally within geWorkbench, or remotely as a grid job on caGrid. See the Grid Services section for further details on setting up a grid job.

Special Note on running in PREPROCESSING mode on caGRID

When ARACNew is run in PREPROCESSING mode on a grid node, it writes the parameter files to its execution directory on the grid node and exits. No file is returned to geWorkbench. As currently implemented, the ARACNe server detects the lack of a file to return (normally it returns an adjacency matrix) and reports an error. This error can simply be ignored. If ARACNe2 is run in COMPLETE or DISCOVERY mode this error will not occur because both return adjacency matrices.

Adjacency Matrix Result Node

The result of an ARACNe run is an adjacency matrix, placed as a new data node in the Workspace as a child of the microarray dataset from which it was generated.

The adjacency matrix contains one row for each "hub" marker for which ARACNe was run. Above-threshold interactions with targets are listed, together with the MI value of each such interaction.

The adjacency matrix can be written to disk using either the "Save" or the "Export" commands. "Save" can be found by right-clicking on the adjacency matrix node, and "Export" is available in the top level "File" menu. The node to export must be selected.

IMPORTANT NOTE - For geWorkbench versions between 2.2.1 and 2.4.1, the row for a given hub may not contain all of that hub's interactions. In those versions, while each hub appears on one line, each edge is represented only once. If an edge exists between two hubs, it will appear on the line of only one of those hubs.

Beginning with version 2.5.0, all significant interactions for each hub are listed on its line in the adjacency matrix.


The file format is described further in the File Formats chapter.

Prior to release 2.2.1, the adjacency matrix, if written to disk, was not ordered by hub marker.

Viewing ARACNe results

  • The result of an ARACNe run is an "adjacency matrix". it contains the mutual information value for each pair of markers which exceeded the specified MI threshold. The adjacency matrix is placed into the Workspace as a child of the dataset that was analyzed.
  • The adjacency matrix can be visualized automatically in the Cytoscape component, as shown below.

ARACNe Example Result.png


The integration between Cytoscape and geWorkbench allows for two-way interactions between them:

  1. Nodes selected in Cytoscape appear in the Marker Sets component in the set "Cytoscape selection".
  2. Any set of markers in the Marker Sets component can be projected onto the Cytoscape display, which will cause any matching nodes there to be highlighted.

This interaction is demonstrated further on the Cytoscape tutorial page.

The Cytoscape Viewer maintains a list of networks which it has currently loaded. It allows individual loaded networks to be deleted. However, the network can be reloaded by clicking on its entry in the Workspace. Cytoscape controls are more fully described in the Cytoscape component tutorial.

Dataset History

Details about each run are maintained in the Dataset History component. With the ARACNe result node highlighted in the Workspace, the Dataset History display includes the following information:

  • Input file name
  • Output file name
  • Algorithm
  • Mode
  • No. bins
  • MI threshold
  • MI threshold calculated from P-Value - If supplied, the p-value used to set the MI threshold.
  • DPI tolerance
  • Bootstrapping - number of bootstrapping runs (if > 1).
  • DPI Target List
  • Merge multiple probesets (yes/no)
  • Hub markers setting
  • Hub markers (list)
  • A listing of the microarrays used.
  • A listing of the markers used.


(The remote ("grid") version of ARACNe may not report all of these parameters).


An example of the Dataset History, ommiting the lists of arrays and markers, is:

Analysis started at: 2014-04-17 17:07:16
Generated with ARACNE run with paramters:
[PARA] Input file:    Bcell-100.exp
[PARA] Output file:   Bcell-100_k0.174_t0.062_e0.0.adj
[PARA] Algorithm:     ADAPTIVE_PARTITIONING
[PARA] Mode:     DISCOVERY
[PARA] No. bins:      6
[PARA] MI threshold:  0.06164290017653682
[PARA] MI threshold calculated from P-Value:  0.009999999776482582
[PARA] DPI tolerance: 0.0
[PARA] Bootstrapping: 1
[PARA] DPI Target List: 
[PARA] Merge multiple probesets: no
[PARA] Setting for Hub Markers: From Sets: Selection
[PARA] Hub markers: 35158_at
Generated with ARACNE run with data:

Example of running ARACNe

This example uses the Bcell-100.exp dataset available in the data/public_data directory of geWorkbench, and further described on the Download page. Briefly, this dataset is composed of 100 Affymetrix HG-U95Av2 arrays on which various B-cell lines, both normal and cancerous, were analyzed. Thus it explores a potentially wide variety of expression phenotypes.

Prerequisites

  1. (Optional) Obtain the annotation file for the HG-U95Av2 array type from the Affymetrix NetAffx website (http://www.affymetrix.com/analysis/index.affx). The name will be similar to "HG_U95Av2.na31.annot.csv", where na31 is the version number. Loading the annotation file associates gene names and other information with the Affymetrix probeset IDs (see the geWorkbench FAQ for details on obtaining these files).
  2. Download the file Mapk_list.csv to your computer.
  3. Load the Bcell-100.exp dataset into geWorkbench as type "Affymetrix File Matrix". (See Local Data Files).
  4. When prompted, and if desired, load the annotation file.

Hub markers can either be loaded directly into the ARACNe component, as described below, or can first be loaded into the Markers component as a new set, and then this set used to specify the hubs.


Setting up the parameters and starting ARACNe

1. In the geWorkbench commands area, select the "ARACNe" analysis.

2. Set "Hub Markers" source to be "From File".

3. Press "Load Markers" and load the file "Mapk_list.csv" into "Hub Markers".

4. Set up the ARACNe parameters as shown

  • Mode: Discovery
  • Algorithm: Adaptive Partitioning.
  • Threshold Type: P-value
  • Threshold Value: 0.01 (default)
  • Correction: Bonferroni
  • DPI Tolerance: Do Not Apply

ARACNe Parameters Example.png

5. Press the "Analyze" button to launch the job.

6. The resulting network is the one depicted above in the "Viewing ARACNe results" section.

Technical Notes

  • The set of hub markers, if supplied, is the same as using the "-s" subnet option with the standalone ARACNe executable. "... a list of probes for which a subnetwork will be constructed".
  • The DPI target list, if supplied, is the same as using the "-l" transcription factor option with the standalone ARACNe executable. "...a list of probes annotated as transcription factors in the input dataset".
    • "This option is ideal for transcriptional network reconstruction. If provided, DPI will not remove any connection of a transcription factor (TF) by connections between two probes not annotated as TFs. (However, this option has no effect if the list of hubs is already limited to transcription factors).
  • If DPI is applied, the recommended tolerance value is now zero. Previously, values up to 0.15 were recommended.
  • It is recommended to always use bootstrapping when reconstructing transcriptional networks. For larger networks, it may be necessary to use the stand-alone version of ARACNe on a computational cluster to carry this out.
  • In the stand-alone version of ARACNe, preprocessing is done using separate Matlab scripts. These steps have been directly incorporated into the Java version of ARACNe used in geWorkbench, ARACNe2.
  • Running with too few arrays can cause NaNs in Adaptive partioning preprocessing step, and in a NullPointerException in Fixed Bandwidth (Mantis issue 2030).
  • The results of an ARACNe run depend to a small extent on the particular order of the arrays in the dataset - that is, reshuffling the arrays can give a numerically different MI. Some actions in geWorkbench can also change the order of the arrays used in the calculation, e.g. compare the case where all arrays are used without array sets activated with the cases where the same arrays are used, but as members of activated array sets. A small difference in result may be seen.

References