Difference between revisions of "Pattern Discovery"

(New figures)
Line 98: Line 98:
  
 
* Push '''Create''' to start the search.
 
* Push '''Create''' to start the search.
 +
 +
The progress bar will show the several stages of a search.
 +
 +
[[Image:T_PatternDiscovery_RunProgress.png]]
 +
 +
  
 
==Viewing results==
 
==Viewing results==
  
The result of the search can be viewed both in the '''Pattern Discovery''' module itself and in other sequence viewer modules such as "Sequence" and "Promoter".  In '''Pattern Discovery''' the results are returned in a table, and the hits for the motif(s) selected in the table will be displayed superimposed on the sequences above.  In this picture, the results were first sorted by Z-Score, and the two motifs with the highest scores were selected for display on the sequence.
+
The result of the search can be viewed both in the '''Pattern Discovery''' module itself and in other sequence viewer modules such as "Sequence" and "Promoter".  In '''Pattern Discovery''' the results are returned in a table, and the hits for the motif(s) selected in the table will be displayed superimposed on the sequences above.  In this picture, the results were first sorted by the number of tokens.
  
 
+
[[[[Image:T_PatternDiscovery_Params_Basic_histone_result_exact.png]]]]
[[Image:T_PatternDiscovery_64seqs.png]]
 
  
 
==Adding results to the Projects Folder==
 
==Adding results to the Projects Folder==
Line 152: Line 157:
  
  
==New figures==
+
==Example Pattern Discovery run==
 +
 
 +
1. We will use a file containing a number of histone sequences, [[Media:H1H5_HistoneDB_NHGRI.fasta | H1H5_HistoneDB_NHGRI.fasta]].
 +
 
 +
2. We will try parameters set to allow longer matches.  No changes from default are made in the other parameter tabs.  In particular, this search uses exact matching of the sequence letters, without substitutions.
 +
 
 +
 
 +
 
 +
 
 +
[[Image:T_PatternDiscovery_Params_Basic_histone_run.png]]
 +
 
 +
 
 +
3. The result (sequence line view):
  
 
[[Image:T_PatternDiscovery_Histones_Result.png]]
 
[[Image:T_PatternDiscovery_Histones_Result.png]]
  
 +
4. The result displayed in full sequence view:
 +
 +
[[Image:T_PatternDiscovery_Params_Basic_histone_result_exact_seqs.png]]
 +
 +
 +
5. We can also change the Advanced Parameters to use the BLOSUM50 similarity matrix rather than requiring exact matches.
  
 
[[Image:T_PatternDiscovery_Params_Advanced_BLOSUM.png]]
 
[[Image:T_PatternDiscovery_Params_Advanced_BLOSUM.png]]
  
[[Image:T_PatternDiscovery_Params_Basic.png]]
 
  
 +
6. The result in the line view:
 
[[Image:T_PatternDiscovery_Params_Basic_histone_result_blosum50.png]]
 
[[Image:T_PatternDiscovery_Params_Basic_histone_result_blosum50.png]]
 +
 +
7. The result in full sequence view.
  
 
[[Image:T_PatternDiscovery_Params_Basic_histone_result_blosum50_seqs.png]]
 
[[Image:T_PatternDiscovery_Params_Basic_histone_result_blosum50_seqs.png]]
  
[[Image:T_PatternDiscovery_Params_Basic_histone_result_exact.png]]
 
  
[[Image:T_PatternDiscovery_Params_Basic_histone_result_exact_seqs.png]]
+
==Extra screenshots==
  
[[Image:T_PatternDiscovery_Params_Basic_histone_run.png]]
+
[[Image:T_PatternDiscovery_Params_Basic.png]]
 
 
[[Image:T_PatternDiscovery_RunProgress.png]]
 

Revision as of 11:46, 12 February 2010

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot




Outline

In this tutorial, running a pattern discovery algorithm on a set of protein sequences is described. The steps include:

  • Setting parameters
  • Creating a new session
  • Running the job
  • Viewing the results

Overview

The geWorkbench Pattern Discovery module uses an algorithm called SPLASH (Califano, 2000) to search for common patterns in sets of protein or DNA sequences. This type of search could be used, for example, to search for common structural or regulatory elements in otherwise unrelated sequences.


(Note - there currently is no provision for filtering out repeated sequences from genomic seqeuence. Results on DNA sequences should be evaluated in this light).

Component layout

T PatternDiscovery Params Basic initial.png


Top-level run controls

T PatternDiscovery Run Controls.png

  • Norm - select the "normal" pattern discovery algorithm.
  • Exhaustive - select the "exhaustive" pattern discovery algorithm. This takes much more time.
  • Curling arrow - start a new pattern discovery run.
  • Stop sign - stop the current pattern discovery run
  • File folder - load a pattern file
  • Progress bar - To the right of the control buttons is a progress bar that reports on run progress.

Setting parameters

A number of parameters can be adjusted by the user to adjust the sensitivity of the search.

Basic tab

T PatternDiscovery Params Basic only.png


  • Support - Can be input in number of sequences or in % of sequences containing a given motif.
    • Support Percent % - The pattern must appear in at least given percentage of sequences.
    • # Support Sequences - The pattern must appear in at least the given number of sequences.
    • # Support Occurances - The pattern must occur at least the given number of times in the set of sequences (can be more than once per sequence).

T PatternDiscovery Params Basic support options.png


  • Min Tokens - The minimum number of characters in a discovered motif.
  • Density Window - A sliding window in which at least the number of tokens set in "Density Tokens" must be found.
  • Density Tokens - the minimum number of matching characters within the "Density Window".

Exhaustive tab

Parameters specific to the "exhaustive" search algorithm can be set in this tab.

T PatternDiscovery Params Exhaustive.png

  • Dec. support (%) -
  • Dec. density support -
  • Min support -
  • Min pattern number -

Limits

T PatternDiscovery Params Limits.png

  • Max Pattern Number -
  • Max run time (sec) -

Advanced tab

T PatternDiscovery Params Advanced.png

  • Exact Only -
  • Count Sequences -
  • Similarity matrix choice (default BLOSUM50) -
  • Similarity threshold -


Run Pattern Discovery

  • Pushing on the button with the curling arrow icon will bring up the session creation box:
  • A user name must be entered, but it can be any name.


T PatternDiscovery SessionConnect.png


  • Push Create to start the search.

The progress bar will show the several stages of a search.

T PatternDiscovery RunProgress.png


Viewing results

The result of the search can be viewed both in the Pattern Discovery module itself and in other sequence viewer modules such as "Sequence" and "Promoter". In Pattern Discovery the results are returned in a table, and the hits for the motif(s) selected in the table will be displayed superimposed on the sequences above. In this picture, the results were first sorted by the number of tokens.

[[T PatternDiscovery Params Basic histone result exact.png]]

Adding results to the Projects Folder

The results of a run of Pattern Discovery are automatically placed in the Project Folder:


T PatternDiscovery ProjFolder.png


Options to Save or Mask Patterns

Several operations are possible on the returned patterns. The options menu can be seen by right-clicking on a selection of one or more returned patterns.

1. The patterns can be saved, either with their positions on the original sequences, or just as regular expressions.

2. The patterns can be masked out of the query sequence.

The options shown in the picture below are:

Mask Pattern - The selected pattern(s) will be masked out of the sequence for future searches.

Unmask All Patterns - Undo the masking.

Save Patterns (Regex Only) - This will save the selected pattern(s) in the form of regular expressions, that is, letters and wild-card characters.

Save Selected Patterns - This will save both the selected pattern(s) and their hits to the query sequences. The locations (positions on the query sequences) saved are specific to the particular input file used. The name of this file is saved in the pattern file.

Save All Patterns - This will save both all of the patterns and their hits to the query sequences. The locations positions on the query sequences) saved are specific to the particular input file used. The name of this file is saved in the pattern file.

Add Patterns to Project - This will add the patterns found as a data node in the Projects Folder, with the position information on the input sequences.


T PatternDiscovery SavePatternsMenu.png

Logical complexities in the display...

  1. The other sequence display components, including Sequence, Promoter, and Position Histogram, are only available when the parent sequence object is selected in the Project Folder. The results of the Pattern Discovery run are still available in that component.
  2. Selecting another sequence object will cause the Pattern Discovery component to be cleared. However, the results can be reloaded by again selecting the Pattern Discovery result in the Project Folder.
  3. Pattern Discovery results can only be displayed in the context of the sequences from which they were derived.

References

Califano, A. (2000). SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics, Apr;16(4):341-57 (link to paper).


Example Pattern Discovery run

1. We will use a file containing a number of histone sequences, H1H5_HistoneDB_NHGRI.fasta.

2. We will try parameters set to allow longer matches. No changes from default are made in the other parameter tabs. In particular, this search uses exact matching of the sequence letters, without substitutions.



T PatternDiscovery Params Basic histone run.png


3. The result (sequence line view):

T PatternDiscovery Histones Result.png

4. The result displayed in full sequence view:

T PatternDiscovery Params Basic histone result exact seqs.png


5. We can also change the Advanced Parameters to use the BLOSUM50 similarity matrix rather than requiring exact matches.

T PatternDiscovery Params Advanced BLOSUM.png


6. The result in the line view: T PatternDiscovery Params Basic histone result blosum50.png

7. The result in full sequence view.

T PatternDiscovery Params Basic histone result blosum50 seqs.png


Extra screenshots

T PatternDiscovery Params Basic.png