NCBC2008

From Informatics

Jump to: navigation, search

Contents

Presentations for the NCBC meeting Aug. 13/14 2008

Introduction to geWorkbench

Welcome to the geWorkbench demo. geWorkbench is an cross-platform, Java application which allows you to perform a number of interesting analyses on microarray and sequence data in an integrated fashion. It also supports several services developed through the caBIG program, including retrieval of gene annotations through caBIO, retrieval of microarray data from caArray, and running analysis routines remotely using caGrid-based services.

geWorkbench GUI

geWorkbench is divided into four functional areas. At upper left is the project management area, at lower left is the Marker and Array set management area, at lower right is the command area, where a number of analytical routines are available, and at upper right is the data viewing and manipulation area.

Other components

geWorkbench also allows you to work with protein and nucleotide sequences. For example you can download upstream sequence associated with markers of interest using the Sequence Retriever component, and then run searches for known Transcription Factor binding sites on them. There is also an algorithm called Splash which can be used to discover patterns in a set of sequences.

There are also a number of additional capabilities, including several made available through the Broad Institute’s GenePattern server, such as Principal Component Analysis and K-Nearest Neighbor analysis.

pre processing, basic classification of micro arrays

  • preparation:
  1. open tflist.csv as text file
  2. ensure that relative is selected in preferences
  3. load caArray table: array.nci.nih.gov:8080; JaglaB; Robo45$x
  • Demonstration
  1. Connect to local version of caArray and demonstrate how one could retrieve data from there.
  2. Open bcell_mas5_254_filtered_noambig.exp as U95 chip set
  3. Show the online help where the formats are described (Project panel - Open Dataset)
  4. show visualiztations:
    1. tabular microarray viewer
    2. Microarray viewer; (relative: each marker relative to the mean over all arrays and divided by the standard deviation)
    3. color mosaic;
    4. expression value distribution;
  5. threshold normalizer: set everything below 1 to 1
  6. Apply log 2 transformation
  7. Demonstrate the concept of marker and gene panels and how they affect various components.
    1. in Arrays/Phenotypes create a group for normal (N4-13...) and cancer (CB...)
    2. make cancer "Case"
  8. Demonstrate the group union/intersection/xor functions.
    1. Mark both, normal and cancer, apply union operation and call set all
  9. load marker set "tfs-071108.csv"
  10. Demonstrate the visual properties for marker and array groups. Show an example by
    1. activating both array groups, and
    2. using the scatter plot component to create a scatter plot between 2 markers.
    3. show all markers concept to override selection
    4. change color for an array set
  11. show the list of analysis components,
    1. point out PCA, KNN as Gene pattern components.
    2. show grid services for hierarchical clustering
  12. Marker Annotation (retrieve info from CaBio)
  13. Go to the t-test analysis and run with
    1. ensure that array sets are selected
    2. nothing selected in marker sets
    3. log 2 normalized checked,
    4. p-value cut-off at 5E-05,
    5. just bonferroni (11 genes show up)
  14. activate significant genes
  15. find myc in marker sets and activate
  16. get the nucleotide sequence for myc through sequence retriever (upstream and downstream sequence from transcription initiation site)
    1. show that protein sequences can be downloaded from EBI
    2. add nucleotide sequence to project
  17. Promoter Panel (JASPER: biological validated motives), PSSM, p-value calculated by randomizing the sequence
    1. select Profiles 3,4,5
    2. run analysis
    3. show sequence view

...

  1. if time permits list components that can work with sequences.

Master Regulator, Aracne, multiple microarray experiments

  1. introduce the 4 regions of geWorkbench:
    1. project panel
    2. visualiztion panel
    3. marker/array set panel
    4. analysis panel
  2. Load bcell_mas5_20_filtered_noambig.txt as tab delimited file and U95 chip set
  3. show tabular microarray viewer
  4. threshold normalizer 1
  5. log2 normalize
  6. Show dataset History => data has been filtered and normalized
  7. Show some visualization components:
    1. Tabular view.
    2. Scatter plot.
    3. Gene expression profiles.
  8. Demonstrate the group union/intersection/xor functions.
    1. Mark both, normal and cancer, apply union operation and call set all
  9. load marker set "tfs-071108.csv"
  10. Cellular Network Knowledge Base The CNKB is a repository of interactions between pairs of proteins (these interactions can be computationally or experimentally derived). Both direct, physical interactions can be captured as well as indirect transcriptional relationships (where an interaction is between a transcription factor and its transcriptional gene target). For the purposes of the present document, an interaction is assumed to be a binary relationship between two genes. Each interaction has an associated confidence indicator (a value between 0 and 1) characterizing our level of confidence for the method aggregate (computational or experimental) used to derive the interaction.
    1. find "v-myc" (1973_s_at) and BCL3 in marker list, activate
      1. v-myc is the viral homolog of c-myc an important transcription factor in the regulation of B-Cell activities and the rapid groth of cancer cells
      2. BCL3: This protein functions as a transcriptional co-activator that activates through its association with NF-kappa B homodimers, is implicated in B-cell leukemia
    2. Marker Annotations retrieve annotations
    3. CNKB;
      1. Refresh
      2. => the green graph indicates a number of protein-protein interactions at a given confidence level. We assign confidence levels based on computational and other models in the database.
      3. activate protein-protein
      4. set threshold to 0.5 using slider
      5. create network
      6. rename adjacency matrix to "v-myc BCL3"
      7. mark all proteins (ctrl a)
      8. rename "selected Genes to "v-myc BCL3"

=> this is one way of creating an interaction network

  1. Gene Ontology
    1. The annotation we loaded contains the quantin

last number shows how many markers in the annotation file (here the U95 annotation file) have been annotated with a given GO annotation. The left number shows the number of markers from the selection with the same property.

    1. remove any unwanted markers from previous steps in Selection.
    2. activate "v-myc BCL3"
    3. Function - Map List
    4. add to set "selected and descendants" under binding - nucleotide binding - dna binding - transcription factor activity
    5. combine all marker set genes to new set called "v-myc interest"
    6. save marker set to desktop "v-myc interest.csv"
    7. show table view
      1. sort by p-value: list of hits gives you a sense of the biology that might be involved.
  1. ARACNE Algorithm for the Reconstruction of Accurate Cellular Networks. ARACNE is an information-theoretic algorithm used to identify transcriptional interactions between gene products using microarray expression profile data. Similar to other algorithms, ARACNE predicts potential functional associations among genes, or novel functions for uncharacterized genes, by identifying statistical dependencies between gene products.Once a microarray set consisting of mRNA expression profiles under diverse phenotypic conditions has been assembled, ARACNE generates a putative transcriptional network in two computational steps. First, gene pairs that exhibit correlated transcriptional responses across a diverse set of microarrays are identified by measuring the mutual information (MI) between their mRNA expression profiles. Key elements in this step are the determination of the parameters for the computation of the mutual information (i.e., the kernel width of the estimator) and of the MI threshold for statistical independence. In the second step, ARACNE eliminates those statistical dependencies that may be of an indirect nature, such as between two genes that are separated by intermediate steps in a transcriptional cascade. Such genes will likely have correlated expression profiles, resulting in high MI, and may otherwise be selected as candidate interacting genes. Indirect interactions are eliminated by applying a well-known property of mutual information called the Data Processing Inequality (DPI). Given a transcription factor, application of the DPI, under appropriate assumptions, will thus generate predictions about which other genes may be its direct transcriptional targets or its upstream transcriptional regulators. The final result is a matrix of candidate interactions, also called an adjacency matrix, which can be used for further network visualization and analysis.
    1. Hub Marker(s): If the value of this drop-down is “All vs. All” then MI values are computed for all pairs of eligible markers (see below for a definition of eligible markers). If instead the value is “List” the user can manually enter a list of comma separated marker ids (or load such a list from a CSV file). In that case the MI values are computed only for pairs of markers (A, B) where A comes from the user defined list and B comes from the set of eligible markers.
    2. v-myc
    3. select "all markers" and "all arrays" This is to get a base-line of all genes in our experiments that are somehow relevant to v-myc

=> usually one needs at least about 100 arrays so that all results produce here are meaningless

    1. Run ARACNE with a threshold type mutual inf = 0.3 (all other parameters remain the same).

=> another way to create an interaction network

  1. Cellular Network Knowledge Base is another method that was introduced in the previous presentation.
  1. Master regulator The goal of the Master Regulator Analysis (MRA) is to identify which transcription factors (TFs) control the regulation of a set of target genes (TGs) that demonstrate significant differential expression across 2 cellular phenotypes, for which microarray gene expression profiles (GEP) are available (“Cases”, “Controls”).
    1. we have a network of interactions (BCi.affyIDs.adj)
      1. that is a pregenerated network from CNKB
    2. load the selection of transcription factors
    3. p-value cutoff = 0.005
    4. select all markers
    5. t-distribution
    6. Adjusted Bonferroni
    7. Equal
    8. 59 is the subset of markers that are differentially expressed in the regulon. 270 is the number of markers in the regulon

Using the master regulator component we can identify transcription factors and their networks of interaction partner that are most implicated in biological question at hand. based on fischer exact test. the bottom line represents the t-test values for all the genes in the microarray set. ordered from highest to lowest t-test: left up-regulated, right down regulated.

structure analysis, pudge, skyline, MarkUs

  1. load myc.fasta
  2. run pudge
  3. run Skyline
    1. SCRIPT FLOWCHART
    2. make output directories MODELS and ANALYSIS in the PWD
    3. parse PDB the file:
      1. run DSSP to detemine secondary structure
      2. extract sequence (replace CAS by CYS, and MSE by MET)
    4. run one round of BLAST with the PDB sequence and attempt to find the identifier of the full length sequence; if successful, report the offset of the structure to the full-length sequence
    5. run PSI-BLAST (if instructed)
    6. parse the PSI-BLAST output file:
      1. make a report table with e-values, sequence identities, parts of the query/hit sequences aligned, the nr-database identifiers, species and gi numbers for the hits. If a hit's species is not reported in the PSI-BLAST output, its full sequence is used in a one-round BLAST search without a filter to try and pull out a 100% identical hit and transfer the species name. If this doesn't work, the species is left unknown
      2. write all aligned fragments into a FASTA file
      3. use hits' database identifiers to pull out their full length sequences and report those sequences into a FASTA file
    7. remove redundant fragments, i.e. the ones which: a) belong to the same species, b) are more than 95% identical within their regions aligned with the PDB sequence. During this step, a directory called REDUNDANCY is created, which consists of subdirectories named by species that contian files with pairwise alignments of the sequences. From each cluster, take one fragment (whose full-length sequence is longest), to be the representative in the non-redundant set. The other sequences are assumed to be (non)modelable the same way as is obtained with the cluster representative
    8. make 4 ClustalW alignments of the sequences in the non-redundant set:
      1. only fragments aligned to the PDB sequence
      2. only fragments aligned to the PDB sequence, directed by the DSSP-assigned secondary structure elements for the PDB sequence
      3. full-length sequences
      4. full-length sequences, directed by the DSSP-assigned secondary structure elements for the PDB sequence
    9. from the PSI-BLAST pairwise alignments, construct multiple alignments of the non-redundant aligned fragments
    10. make models for the non-redundant fragments and place all the data in the subdirectories of MODELS
    11. run ProsaII on the models and pick one with the best Z-score
    12. for the best model, calculate the pG score from the Z-score
    13. renumber the best model to start from the residue specified in the PSI-BLAST alignment
    14. report all model data (Z-score, pG, score, regions in the ProsaII profile above zero) in a file
    15. make a summary output file and report statistics. The pG score of the redundant cluster representative, which was actually modelled, is transferred to all other sequences from the cluster
    16. OUTPUT
      1. *.log log file recording steps and times
      2. *.dssp dssp output on the input structure
      3. *.ss secondaryu structure assignment for the input structure (based on the DSSP output)
      4. *.faa sequence extracted from the input PDB file (i.e. PDB sequence)
      5. *.out BLAST output file for the PDB sequence
      6. *.ali PSI-BLAST alignment (input structure or similar structures)
      7. *.fragment.fsa sequence fragments of the PSI-BLAST hits aligned to the PDB sequence (input structure or similar structures)
      8. *.full.fsa full-length sequences of the PSI-BLAST hits (input structure or similar structures)
      9. *.hits.dat summary table of PSI-BLAST hits (input structure or similar structures)
      10. *.redundant clusters of redundant sequences
      11. *.fragment.ali ClustalW alignment 7a
      12. *.fragment.ali.ss ClustalW alignment 7b
      13. *.full.ali ClustalW alignment 7c
      14. *.full.ali.ss ClustalW alignment 7d
      15. *.gap multiple alignment extracted from the pairwise PSI-BLAST alignments (with gaps in the PDB sequence)
      16. *.nogap multiple alignment extracted from the pairwise PSI-BLAST alignments (without gaps in the PDB sequence)
      17. *.models information on the models (pG, Z-score, e-value, sequence id, modelled region, misfolded regions)
      18. *.leverage summary output file
      19. *.comparison comparison of the hit lists among the input structure and any homologous structures found by PSI-BLAST
  4. run MarkUs
    1. First Page
      1. The Analysis Panel on the left contains a brief summary of the protein structure and links to the structure and sequence analysis results.
      2. As Structure Viewer the function annotation server utilizes the AstexViewerTM.
      3. The Structure Panel on the right allows to map structure features and annotations on the protein structure.
      4. The Sequence Analysis section provides access to individual result pages of the applied sequence analysis methods.
        The amino acid conservation analysis is done by Consurf 1 and can be based on different multiple sequence alignments. In this example sequence homologs have been collected by three PSI-BLAST 2 iterations using E-value cutoffs of 1e-10, 1e-3, and 1e-1 filter by 90%, 80%, and 80% sequence identity, respectively. If the target sequence matches a Pfam 3 family, the Pfam seeds and full alignments are used for the conservation analysis additionally.
        To map the conservation scores for one of these analyses onto the structure you have to select the analysis from the pull down menu. That will reload the summary page and the conservation scores for the selected analysis can be mapped using the Structure Panel.
        Detailed descriptions for the different analysis results can be browsed here: Conservation (Consurf 1), Multiple Sequence Alignments (Muscle 4, ClustalW 5), Sequence Homologs (PSI-BLAST 2), Domains (InterProScan 6).
      5. For structure visualization, mapping of pipline results, and interactive manipulation, the AstexViewerTM 12 has been implemented.
      6. The checkboxes on the left of the properties panel can be used to turn on and off the entire protein surface or to display the molecular surfaces for binding sites identified by SCREEN.
      7. The D button maps the electrostatic potentials calculated by DelPhi on the molecular surface of the entire protein. The colors blue and red are representing positive and negative potentials, respectively.
      8. The C buttons map the amino acid conservation scores calculated by Consurf on the molecular surface of the entire protein or of the cavities identified by SCREEN. The colors spectrum from red to cyan represents decreasing conservation.
    2. SKAN
      1. This page is the central tool of the Function Annotation Server for analyzing structure function relationships. It allows to browse structure homologs, highlight function annotations, filter by controlled vocabularies like the Gene Ontology and define subsets for further analysis. The structure alignments are identified using the structure alignment methods Skan 7,8 and Dali 9.
      2. The Structure Alignment Page consists of three sections:
        1. The Structure Alignment Map is a graphical representation of structure homologs identified for the target structure.
        2. The Map Control Panel provides tools to manipulate the Structure Alignment Map by integrating various sources of function annotation. It consists of a menu to select annotation types and a context specific feature panel.
        3. The table of Structure Alignment Table lists the properties of the pairwise structure alignments and colors hits by SCOP families.
      3. The Structure Alignment Map (SAM) is a graphical representation of structure homologs identified by structure alignment tools like Skan or Dali for a given target structure. The ruler on top of the SAM represents the target sequence numbered sequentially.
        Above the ruler target residues are marked that are forming cavities identified by SCREEN. The color spectrum from red to cyan corresponds to the Consurf scores and indicates decreasing conservation. Moving the mouse over a cavity residue opens a pop up window providing the residue position, conservation score, and the min and max conservation scores for the given analysis.
      4. The L fields indicate structure hits that contain Ligands like inhibitors, ions, etc. The information on ligands is extracted from the entire PDB entry. Thus hit structures can be marked as containing ligands even if these chemical compounds are not in contact with the protein structure. Moving the mouse over a 'L' button will open a pop up window listing the three letter code of the chemical compound, the chain the compound is associated with and the residue number. The three letter code is linked to the MSDchem database at the EBI.
      5. The A fields provide protein specific Annotations extracted from various 3rd party databases. Moving the mouse over an 'A' button will open a pop up window. After initially loading or reloading the page the annotation windows will only contain the PDB identifier linked to the Protein Data Bank and the corresponding UniProt accession code linked to UniProt. Using the SAM Control Panel further protein specific annotations can be loaded like Gene Ontology terms and Enzyme Nomenclature EC numbers.
      6. Mapping the GO terms will display the GO Terms control panel beneath the menu. Assigned GO terms can be browsed in the Annotation Pop Up window. To identify structures annotated with a certain GO term, the term can be specified in the text field next to the search button or by selecting one of the linked terms in the annotation pop up window (e.g. GO:0016491 oxidoreductase activity).
      7. A search will result in coloring the structure homologs according the GO terms relationship using the color code: Ancestor, Self Match, Child. The interface will always report the most specific term. Thus if a structure is annotated with the query term (e.g. GO:0016491) but is also annotated with a more specific child term (e.g. GO:0009055) the structure will be colored as child hit. The hit term will be marked in bold (e.g. electron carrier activity). The highlighted A field indicates the structure a GO term has been searched for.
    3. The Cavities
      1. Surface accessible cavities are predicted using SCREEN 10. The cavities page will report the cavities ranked according to the Random Forrest classifier implemented in SCREEN. For each cavity various properties are listed: the number of residues and the individual amino acids forming the cavity, geometrical properties like the surface area, the diameter, the largest and second largest moment of inertia (MOI), the average curvature, average depth, and the maximum depth, composition properties like the sidechain entropy, the amino acid frequencies, percentage secondary structure elements and polar, non-polar residues, and physico-chemical properties derived using DelPhi 11 like the solvation energy, average charge, electrostatic potential or field.
      2. The Druggability Index, that ranges from 0 (non-drug-binding cavity) to 1 (drug-binding cavity), represents a measure of the classifier's assessment of the cavity's druggability, with > 0.59 as threshold that yields 99% precision in predicting drug-binding cavities.
        Cavities can be mapped on the protein structure using the Structure Panel.
Personal tools