Difference between revisions of "Pattern Discovery"

(Viewing results)
(Details as Hover Text)
 
(98 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{TutorialsTopNav}}
 
{{TutorialsTopNav}}
  
 
__TOC__
 
 
 
==Outline==
 
In this tutorial, running a pattern discovery algorithm on a set of protein sequences is described.  The steps include:
 
* Setting parameters
 
* Creating a new session
 
* Running the job
 
* Viewing the results
 
  
 
==Overview==
 
==Overview==
The geWorkbench '''Pattern Discovery''' module uses an algorithm called '''SPLASH''' (Califano, 2000) to search for common patterns in sets of protein or DNA sequences. This type of search could be used, for example, to search for common structural or regulatory elements in otherwise unrelated sequences.
 
 
 
 
('''Note''' - there currently is no provision for filtering out repeated sequences from genomic seqeuence.  Results on DNA sequences should be evaluated in this light).
 
 
==Component layout==
 
 
The Pattern discovery component control area has three sections.
 
# Run controls are shown at top,
 
# the hit display area is in the center (empty in the figure below), and
 
# the parameter settings are at bottom.
 
  
[[Image:T_PatternDiscovery_Params_Basic_initial.png]]
+
Sequence Pattern Discovery is the process of identifying nucleotide or amino acid arrangements, also called motifs, that are enriched in a set of sequences. Such motifs may identify regions that have been preserved by evolution and which therefore may play a key functional or structural role. geWorkbench provides two modes of Sequence Pattern Discovery: Regular Discovery and Exhaustive Discovery.
  
==Top-level run controls==
+
Regular Discovery is based on the algorithm SPLASH (Califano, A., 2000); it generates a list of all regular expression patterns (motifs) that satisfy a user-defined minimum support and a minimum density criteria. The former determines the minimum number of times a pattern must occur in the sequence set to be reported. This can also be expressed as the minimum percent of sequences that must contain the pattern. The latter determines how sparse the pattern can be, in other words the minimum number of matching characters k (any character except for the dot character “.”) over a window of predefined length w.
  
[[Image:T_PatternDiscovery_Run_Controls.png]]
+
SPLASH-based motif discovery is extremely efficient and can process most large protein super-families in a few minutes on a conventional workstation. Discovery is uniquely effective in identifying sparse patterns using extremely low-density constraints, and the results obtained with Discovery can provide the core for a large number of more specific local alignments. 
  
* '''Norm''' - select the "normal" pattern discovery algorithm.
+
Exhaustive Discovery starts from a relatively high minimum support (e.g. patterns occurring in 75% of the sequences) and it progressively reduces the support, until a statistically significant pattern is discovered. Discovered patterns are reported and then masked in the sequence set so that they are no longer discovered. Then the process continues iteratively until the minimum support reaches a lower user-defined limit. Exhaustive Discovery, thus, produces a list of non-overlapping motifs in order of support.
* '''Exhaustive''' - select the "exhaustive" pattern discovery algorithm.  This takes much more time.
 
  
* '''Curling arrow''' - start a new pattern discovery run.
+
The Pattern Discovery component is available when a sequence (FASTA format) file has been loaded in the [[Workspace]], and its data node is selected.
* '''Stop sign''' - Cancel the current pattern discovery run
 
* '''File folder''' - load a pattern file
 
  
* '''Progress bar''' - To the right of the control buttons is a progress bar that reports on the current stage of discovery being executed.
 
  
 
==Setting parameters==
 
==Setting parameters==
Line 48: Line 22:
  
  
[[Image:T_PatternDiscovery_Params_Basic_only.png]]
+
[[Image:PatternDiscovery_Params_Basic.png]]
 +
 
  
 
Three options for specifying support are available via a pulldown menu:
 
Three options for specifying support are available via a pulldown menu:
* '''Support Percent %''' - The pattern must appear in at least given percentage of sequences.
+
* '''Support (Percent of Sequences)''' - The pattern must appear in at least given percentage of sequences.
* '''# Support Sequences''' - The pattern must appear in at least the given number of sequences.
+
* '''Support (Number of Sequences)''' - The pattern must appear in at least the given number of sequences.
* '''# Support Occurrences''' - The pattern must occur at least the given number of times in the set of sequences (can be more than once per sequence).
+
* '''Support (Number of Occurrences)''' - The pattern must occur at least the given number of times in the set of sequences (can be more than once per sequence).
  
[[Image:T_PatternDiscovery_Params_Basic_support_options.png]]
 
  
 +
[[Image:PatternDiscovery_Params_Basic_support_options.png]]
  
* '''Min Tokens''' - The minimum number of characters in a discovered motif.
+
Additional options are:
* '''Density Window''' - A sliding window in which at least the number of tokens set in "Density Tokens" must be found.
+
* '''Minimum Tokens''' - The minimum number of tokens in a discovered motif.
* '''Density Tokens''' - the minimum number of matching characters within the "Density Window".
+
* '''Density Window''' - A sliding window in which at least the number of tokens set in "Density Window Min. Tokens" must be found.
 +
* '''Density Window Min. Tokens''' - the minimum number of matching full character tokens (not wildcards) within the "Density Window".
  
 
===Exhaustive tab===
 
===Exhaustive tab===
 
Parameters specific to the "exhaustive" search algorithm can be set in this tab.
 
Parameters specific to the "exhaustive" search algorithm can be set in this tab.
  
[[Image:T_PatternDiscovery_Params_Exhaustive.png]]
+
[[Image:PatternDiscovery_Params_Exhaustive.png]]
  
* '''Dec. support (%)''' - sets the size of intervals by which support level is decremented in successive searches (default is 5)
+
* '''Decrement support (%)''' - sets the size of intervals by which support level is decreased in successive searches (default is 5). The decrease is multiplicative, e.g. if one enters 5%, support will be reduce to 95% of its previous value at each step.
* '''Dec. density support''' - not used.
+
* '''Minimum Support (Number of Sequences)''' - sets the lower limit on the number of sequences that must contain a specific motif (default is 10). 
* '''Min support''' - sets the lower limit on the percentage of sequences that must contain a specific motif (default is 10%).
+
** '''Note''' - In versions of geWorkbench prior to 2.2.1, if a % sign was included in the minimum support text field, the percentage entered was applied to the initial minimum support value.  For example, if the initial minimum support was 70%, and "10%" is entered in this field, the final stopping value for the calculation would be 7%.
* '''Min pattern number''' - sets a lower limit on the number of motifs in a cluster.
+
* '''Minimum Pattern Number''' - sets a lower limit on the number of motifs in a cluster (point at which support decrease stops).
  
 
===Limits===
 
===Limits===
  
[[Image:T_PatternDiscovery_Params_Limits.png]]
+
[[Image:PatternDiscovery_Params_Limits.png]]
  
* '''Max Pattern Number''' - limits the number of patterns to discover.
+
* '''Max. Pattern Number''' - limits the number of patterns to discover. The actual upper limit to the number of patterns the server will return is 99,999.
* '''Max run time (sec)''' - limits search time.
 
  
 
===Advanced tab===
 
===Advanced tab===
  
[[Image:T_PatternDiscovery_Params_Advanced.png]]
+
* '''Exact Only''' (default checked) 
 +
** When checked, no substitution matrix will be used.  Exact character matches are required. 
 +
** When checked, the choices for similarity matrix and similarity threshold are disabled.
 +
** '''Note''' - "Exact only" should always be used for DNA as no DNA substitution matrix is provided.
 +
** When unchecked, the choices for similarity matrix and threshold are enabled.
  
* '''Exact Only''' (default checked)-  When checked, no substitution matrix will be used.  Exact exact character matches are required.  "Exact only" should always be used for DNA as no DNA substition matrix is provided.
 
* '''Count Sequences''' (default not checked, but see following) - intended to allow additional sorting options.  Will be automatically checked (selected) if the support option "Support percentage" or "# Support Sequences" is chosen in the basic parameters tab.
 
 
* '''Similarity matrix choice (default BLOSUM50)''' - Other choices are BLOSUM62 and BLOSUM100.
 
* '''Similarity matrix choice (default BLOSUM50)''' - Other choices are BLOSUM62 and BLOSUM100.
* '''Similarity threshold''' -
+
* '''Similarity threshold''' (default 2) - pairs of amino acids with a score higher than the specified threshold in the chosen BLOSUM similarity matrix are considered similar.  The input is restricted to integers.  (Note that the threshold is however stored as a double).
 +
 
 +
The Advanced tab when "Exact Only" is checked:
 +
 
 +
 
 +
[[Image:PatternDiscovery_Params_Advanced_Exact.png]]
 +
 
 +
 
 +
The Advanced tab when the "Exact Only" is unchecked:
  
  
 +
[[Image:PatternDiscovery_Params_Advanced_Matrix.png]]
  
 
==Run Pattern Discovery==
 
==Run Pattern Discovery==
Line 95: Line 81:
  
  
[[Image:T_PatternDiscovery_SessionConnect.png]]
+
[[Image:PatternDiscovery_SessionConnect.png]]
  
 
===Discovery Session===
 
===Discovery Session===
Line 104: Line 90:
  
 
===Discovery Session Server===
 
===Discovery Session Server===
* '''Server and Port''': Columbia supports a Pattern Discovery server at splash.cu-genome.org, Port 80.
+
* '''Server and Port''': Columbia supports a Pattern Discovery server at splash.c2b2.columbia.edu, Port 80.
 
* '''Username''': Any name can be entered to identify the job.
 
* '''Username''': Any name can be entered to identify the job.
 
* '''Password''': none currently required.
 
* '''Password''': none currently required.
  
 +
 +
The progress bar will show the sequence upload:
  
  
 +
[[Image:PatternDiscovery_UploadProgressBar.png]]
  
  
The progress bar will show the several stages of a search:
+
and run steps:
* Uploading
 
* Processing seeds
 
* Discovering
 
* Collating
 
* Done
 
  
[[Image:T_PatternDiscovery_RunProgress.png]]
+
 
 +
[[Image:PatternDiscovery_RunningProgressBar.png]]
  
 
==Viewing results==
 
==Viewing results==
Line 129: Line 114:
 
Motifs found are listed in a table with the following columns:
 
Motifs found are listed in a table with the following columns:
  
* '''Hits''' - The total number of times a motif was found in the set of sequences, including any multiple hits in individual sequences.   
+
* '''Hits''' - The total number of times a motif was found in the sequence dataset, including any multiple hits in individual sequences.   
* '''Sequences Hit''' - The number of sequences which contained at least one occurence of the discovered motif.
+
* '''Sequences Hit''' - The number of sequences which contained at least one occurrence of the discovered motif.
* '''# of Tokens''' - The length of the discovered motif (characters)
+
* '''# of Tokens''' - the number of full-character tokens in the motif.
* '''ZScore''' - A measure of the probability of finding this motif by chance.
+
* '''ZScore''' - a measure of how often the motif would be found in a random set of sequences of the same size and composition as the current dataset.
* '''Motif''' - The motif found.
+
* '''Motif''' - a sequence of tokens, which may be full characterd or wildcard.  
** A period (.) matches any character.
+
** A period (.) represents a wildcard and matches any character.
** Square brackets are used to indicate multiple possible characters at a given position (occurs when a substitution matrix is used during discovery).
+
** Square brackets are used to indicate multiple possible characters at a given position (occurs when a substitution matrix (BLOSUM) is used during discovery).
  
 
===Result (Exact matches) sequence line view===
 
===Result (Exact matches) sequence line view===
  
 +
====Single Pattern====
 
Each sequence is shown as a line with length proportional to the sequence length.  All sequences are left-aligned.
 
Each sequence is shown as a line with length proportional to the sequence length.  All sequences are left-aligned.
  
 
The position of matches along the sequence are shown using colored boxes (here blue).
 
The position of matches along the sequence are shown using colored boxes (here blue).
  
 +
If a given sequence hit is clicked, it will appear in the horizontally scrollable detail view below.
  
  
[[Image:T_PatternDiscovery_Params_Basic_histone_result_exact.png]]
+
[[Image:PatternDiscovery_Basic_histone_result_exact.png]]
  
If more than one pattern is selected, each will be displayed using a separate color:
 
  
[[Image:T_PatternDiscovery_Histones_Result_exact_2.png]]
+
If the "All/Matching Pattern" box is checked (red arrow), only sequences that have a match to a selected pattern will be shown.
  
===Result (Exact matches) full sequence view===
 
  
In full sequence view, the actual sequence letters are displayed and matches are again outlined in colored boxes (here blue).
+
[[Image:PatternDiscovery_Basic_histone_result_exact_Matching.png]]
 +
 
 +
====Multiple Patterns====
 +
 
 +
If more than one pattern is selected, each will be displayed using a separate color (note we have scrolled down in the result list, so these are not the same hits as above):
  
[[Image:T_PatternDiscovery_Params_Basic_histone_result_exact_seqs.png]]
+
[[Image:PatternDiscovery_Basic_Histones_Result_exact_3.png]]
  
  
 +
====Details as Hover Text====
 +
If the mouse cursor is placed over a particular pattern match, details of any matches at that point are displayed as hover text.
  
 +
Details include
 +
* current cursor location on sequence
 +
* pattern(s) matching
 +
* start and end positions of matching patterns in angle brackets <>.
  
===Result (BLOSUM50) in sequence line view===
+
[[Image:PatternDiscovery_Basic_Histones_Result_exact_3_hover.png]]
  
BLOSUM50 was selected in the Advanced parameters tab. 
 
  
[[Image:T_PatternDiscovery_Params_Basic_histone_result_blosum50.png]]
+
The result of a different search using the BLOSUM50 substitution matrix shows the regular expression in the hover text as well as the sequence location in the detail strip below:
  
  
===Result (BLOSUM50) full sequence view===
+
[[Image:Pattern_discovery_sequence_tooltip3.png]]
  
BLOSUM50 was selected in the Advanced parameters tab. 
+
===Result (Exact matches) full sequence view===
  
[[Image:T_PatternDiscovery_Params_Basic_histone_result_blosum50_seqs.png]]
+
In full sequence view, the actual sequence letters are displayed and matches are again outlined in colored boxes (here blue).
  
 +
[[Image:PatternDiscovery_Basic_histone_result_exact_seqs.png]]
  
 +
===Result (BLOSUM50) full sequence view===
  
 +
BLOSUM50 was selected in the Advanced parameters tab and the search repeated with the same parameters otherwise.  Note that the "All/Matching Pattern" box is checked.  The correspondence between the highlighted sequence and the selected regular expression result can be seen.
  
 +
[[Image:PatternDiscovery_Basic_histone_result_blosum50_seqs.png]]
  
 
===Position Histogram===
 
===Position Histogram===
Line 181: Line 179:
 
The Position Histogram displays (binned) support of each motif along the length of the sequences.
 
The Position Histogram displays (binned) support of each motif along the length of the sequences.
  
* Support - The number of times a motif is found among all the sequences at a particular position or within a particular range of positions (bin step size) is counted. The result is displayed as fractional support for the motif at that position.
+
* '''Support''' - Out of all occurrences of a particular pattern, the percentage that start within the bin beginning at a particular location.
 +
* '''Position''' - Position along the sequences, out to the last location containing a motif match.
  
* '''Step''' - the sequences are divided into bins of the specified "step" size.   
+
* '''Step''' - the sequences are divided into bins of the specified "step" size.  The step size is entered as an integer.
 
* '''Plot Position''' - push to draw a new histogram.
 
* '''Plot Position''' - push to draw a new histogram.
* '''Image snapshot''' - place a copy of the histogram as an image into the Project Folders component.
+
* '''Image snapshot''' - place a copy of the histogram as an image into the [[Workspace]].
 
 
  
  
 
Note that this will only be useful to the extent that the sequences are in some way aligned before analysis.
 
Note that this will only be useful to the extent that the sequences are in some way aligned before analysis.
  
The figure below show the same two motifs selected as in the previous diagram.  Each is displayed in a unique color.
+
The figure below shows three motifs selected.  Each is displayed in a unique color in the position histogram.
  
[[Image:T_PatternDiscovery_Histones_Result_exact_Position_Histogram.png]]
+
[[Image:PatternDiscovery_Basic_Histones_Result_exact_3_histogram.png]]
  
Note that in this example, the sequences are NOT aligned.
+
==Adding results to the Workspace==
  
==Adding results to the Projects Folder==
+
The results of a run of '''Pattern Discovery''' are automatically placed in the [[Workspace]]:
  
The results of a run of '''Pattern Discovery''' are automatically placed in the Project Folder:
 
  
 
+
[[Image:PatternDiscovery_Result_Node.png]]
[[Image:T_PatternDiscovery_Result_Node.png]]
 
  
 
==Options to Save or Mask Patterns==
 
==Options to Save or Mask Patterns==
Line 215: Line 211:
 
The options shown in the picture below are:
 
The options shown in the picture below are:
  
* '''Mask Pattern''' - The selected pattern(s) will be masked out of the sequence for future searches.
+
* '''Mask Pattern''' - The selected pattern(s) will be masked out of the sequence for future searches.  They will not be re-discovered.
  
 
* '''Unmask All Patterns''' - Undo the masking.
 
* '''Unmask All Patterns''' - Undo the masking.
Line 225: Line 221:
 
* '''Save All Patterns''' - This will save both all of the patterns and their hits to the query sequences.  The locations positions on the query sequences) saved are specific to the particular input file used.  The name of this file is saved in the pattern file.
 
* '''Save All Patterns''' - This will save both all of the patterns and their hits to the query sequences.  The locations positions on the query sequences) saved are specific to the particular input file used.  The name of this file is saved in the pattern file.
  
* '''Add Patterns to Project''' - This will add the patterns found as a data node in the Projects Folder, with the position information on the input sequences.
+
'''Note''' - for the last two options, the location of the sequence hit is indicated relative to the position of the sequence in the list of sequences actually searched.  If only a subset of sequences were active (a marker set was activated in the Markers component), then the position recorded is relative to the sequence order in this subset.  However, if the marker set is subsequently deactivated and the saved pattern file reloaded from disk, it will still match against the proper sequences. The sequence line numbers recorded in the file are not used. 
  
  
[[Image:T_PatternDiscovery_Right_click_menu.png]]
+
 
 +
[[Image:PatternDiscovery_Right_click_menu.png]]
  
 
==Notes on display of results==
 
==Notes on display of results==
  
 
# Pattern Discovery results can only be displayed in the context of the sequences from which they were derived.
 
# Pattern Discovery results can only be displayed in the context of the sequences from which they were derived.
 
+
# To reload saved pattern files, see the [[Local_Data_Files|| Local Data Files]] tutorial.
 +
# There is no provision for filtering out repeats from genomic DNA sequence.  Affected sequences should be masked before loading into geWorkbench.
  
 
==Example Pattern Discovery runs==
 
==Example Pattern Discovery runs==
Line 240: Line 238:
  
 
# Make sure that Pattern Discovery and Position Histogram (if desired) are loaded in the Component Configuration Manager.
 
# Make sure that Pattern Discovery and Position Histogram (if desired) are loaded in the Component Configuration Manager.
# Load a file containing the sequence or sequences to be analyzed into the Project Folders component.
+
# Load a file containing the sequence or sequences to be analyzed into the [[Workspace]].
  
 
For this example, we use a file containing a number of histone sequences, [[Media:H1H5_HistoneDB_NHGRI.fasta | H1H5_HistoneDB_NHGRI.fasta]].
 
For this example, we use a file containing a number of histone sequences, [[Media:H1H5_HistoneDB_NHGRI.fasta | H1H5_HistoneDB_NHGRI.fasta]].
  
 
The Pattern Discovery component appears in the Command area of geWorkbench (lower right quadrant) when a sequence data node is selected.
 
The Pattern Discovery component appears in the Command area of geWorkbench (lower right quadrant) when a sequence data node is selected.
 +
 +
===Sequence Selection===
 +
 +
By default, Pattern Discovery will be run on all sequences in the currently selected sequence data node.  However, subsets of sequences can be created and activated in the Markers component.  The Pattern Discovery component will respect any activated marker sets, restricting discovery to those sequences in activated sets.  If no marker sets are activated, then all sequences will be used.
  
 
===Setup (Exact matches)===
 
===Setup (Exact matches)===
Line 258: Line 260:
  
  
[[Image:T_PatternDiscovery_Params_Basic_histone_run.png]]
+
[[Image:PatternDiscovery_Params_Basic_histone_setup_new.png]]
 +
 
  
  
 
The results are pictured above in '''Viewing Results'''.
 
The results are pictured above in '''Viewing Results'''.
 
  
 
===Setup (BLOSUM50)===
 
===Setup (BLOSUM50)===
 
If the "Exact only" checkbox in the Advanced Parameters tab is unchecked, a selected BLOSUM substitution matrix for protein sequences will be used.   
 
If the "Exact only" checkbox in the Advanced Parameters tab is unchecked, a selected BLOSUM substitution matrix for protein sequences will be used.   
  
[[Image:T_PatternDiscovery_Params_Advanced_BLOSUM.png]]
+
* Similarity Threshold - the threshold is input as an integer number.  (However, note that it is passed to the Splash server as a double).
 +
 
 +
[[Image:PatternDiscovery_Params_Advanced_Matrix.png]]
  
 
The results of repeating the same run as before but using BLOSUM50 are pictured above in '''Viewing Results'''.
 
The results of repeating the same run as before but using BLOSUM50 are pictured above in '''Viewing Results'''.

Latest revision as of 21:39, 26 January 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Overview

Sequence Pattern Discovery is the process of identifying nucleotide or amino acid arrangements, also called motifs, that are enriched in a set of sequences. Such motifs may identify regions that have been preserved by evolution and which therefore may play a key functional or structural role. geWorkbench provides two modes of Sequence Pattern Discovery: Regular Discovery and Exhaustive Discovery.

Regular Discovery is based on the algorithm SPLASH (Califano, A., 2000); it generates a list of all regular expression patterns (motifs) that satisfy a user-defined minimum support and a minimum density criteria. The former determines the minimum number of times a pattern must occur in the sequence set to be reported. This can also be expressed as the minimum percent of sequences that must contain the pattern. The latter determines how sparse the pattern can be, in other words the minimum number of matching characters k (any character except for the dot character “.”) over a window of predefined length w.

SPLASH-based motif discovery is extremely efficient and can process most large protein super-families in a few minutes on a conventional workstation. Discovery is uniquely effective in identifying sparse patterns using extremely low-density constraints, and the results obtained with Discovery can provide the core for a large number of more specific local alignments.

Exhaustive Discovery starts from a relatively high minimum support (e.g. patterns occurring in 75% of the sequences) and it progressively reduces the support, until a statistically significant pattern is discovered. Discovered patterns are reported and then masked in the sequence set so that they are no longer discovered. Then the process continues iteratively until the minimum support reaches a lower user-defined limit. Exhaustive Discovery, thus, produces a list of non-overlapping motifs in order of support.

The Pattern Discovery component is available when a sequence (FASTA format) file has been loaded in the Workspace, and its data node is selected.


Setting parameters

A number of parameters can be adjusted by the user to adjust the sensitivity of the search.

Basic tab

PatternDiscovery Params Basic.png


Three options for specifying support are available via a pulldown menu:

  • Support (Percent of Sequences) - The pattern must appear in at least given percentage of sequences.
  • Support (Number of Sequences) - The pattern must appear in at least the given number of sequences.
  • Support (Number of Occurrences) - The pattern must occur at least the given number of times in the set of sequences (can be more than once per sequence).


PatternDiscovery Params Basic support options.png

Additional options are:

  • Minimum Tokens - The minimum number of tokens in a discovered motif.
  • Density Window - A sliding window in which at least the number of tokens set in "Density Window Min. Tokens" must be found.
  • Density Window Min. Tokens - the minimum number of matching full character tokens (not wildcards) within the "Density Window".

Exhaustive tab

Parameters specific to the "exhaustive" search algorithm can be set in this tab.

PatternDiscovery Params Exhaustive.png

  • Decrement support (%) - sets the size of intervals by which support level is decreased in successive searches (default is 5). The decrease is multiplicative, e.g. if one enters 5%, support will be reduce to 95% of its previous value at each step.
  • Minimum Support (Number of Sequences) - sets the lower limit on the number of sequences that must contain a specific motif (default is 10).
    • Note - In versions of geWorkbench prior to 2.2.1, if a % sign was included in the minimum support text field, the percentage entered was applied to the initial minimum support value. For example, if the initial minimum support was 70%, and "10%" is entered in this field, the final stopping value for the calculation would be 7%.
  • Minimum Pattern Number - sets a lower limit on the number of motifs in a cluster (point at which support decrease stops).

Limits

PatternDiscovery Params Limits.png

  • Max. Pattern Number - limits the number of patterns to discover. The actual upper limit to the number of patterns the server will return is 99,999.

Advanced tab

  • Exact Only (default checked)
    • When checked, no substitution matrix will be used. Exact character matches are required.
    • When checked, the choices for similarity matrix and similarity threshold are disabled.
    • Note - "Exact only" should always be used for DNA as no DNA substitution matrix is provided.
    • When unchecked, the choices for similarity matrix and threshold are enabled.
  • Similarity matrix choice (default BLOSUM50) - Other choices are BLOSUM62 and BLOSUM100.
  • Similarity threshold (default 2) - pairs of amino acids with a score higher than the specified threshold in the chosen BLOSUM similarity matrix are considered similar. The input is restricted to integers. (Note that the threshold is however stored as a double).

The Advanced tab when "Exact Only" is checked:


PatternDiscovery Params Advanced Exact.png


The Advanced tab when the "Exact Only" is unchecked:


PatternDiscovery Params Advanced Matrix.png

Run Pattern Discovery

  • Pushing on the button with the curling arrow icon will bring up the session creation box:


PatternDiscovery SessionConnect.png

Discovery Session

  • Discovery Session Name: A name is auto-generated for identifying the job on the server, but any name can be entered.
  • Create - Push to start the search.
  • Cancel- Cancel the discovery run.

Discovery Session Server

  • Server and Port: Columbia supports a Pattern Discovery server at splash.c2b2.columbia.edu, Port 80.
  • Username: Any name can be entered to identify the job.
  • Password: none currently required.


The progress bar will show the sequence upload:


PatternDiscovery UploadProgressBar.png


and run steps:


PatternDiscovery RunningProgressBar.png

Viewing results

The result of the search can be viewed both in the Pattern Discovery module itself and in sequence viewer modules such as "Sequence" and "Promoter" and "Position Histogram". In Pattern Discovery the results are returned in a table, and the hits for the motif(s) selected in the table will be displayed superimposed on the sequences displayed e.g. in the Sequence Viewer located above in the Visual area. In this picture, the results were first sorted by the number of tokens.

Note that if a substitution matrix was used during discovery, the motif may contain a range of possible residues, enclosed by square brackets, for a particular position. This is depicted below under the BLOSUM50 result headings.

Motifs found are listed in a table with the following columns:

  • Hits - The total number of times a motif was found in the sequence dataset, including any multiple hits in individual sequences.
  • Sequences Hit - The number of sequences which contained at least one occurrence of the discovered motif.
  • # of Tokens - the number of full-character tokens in the motif.
  • ZScore - a measure of how often the motif would be found in a random set of sequences of the same size and composition as the current dataset.
  • Motif - a sequence of tokens, which may be full characterd or wildcard.
    • A period (.) represents a wildcard and matches any character.
    • Square brackets are used to indicate multiple possible characters at a given position (occurs when a substitution matrix (BLOSUM) is used during discovery).

Result (Exact matches) sequence line view

Single Pattern

Each sequence is shown as a line with length proportional to the sequence length. All sequences are left-aligned.

The position of matches along the sequence are shown using colored boxes (here blue).

If a given sequence hit is clicked, it will appear in the horizontally scrollable detail view below.


PatternDiscovery Basic histone result exact.png


If the "All/Matching Pattern" box is checked (red arrow), only sequences that have a match to a selected pattern will be shown.


PatternDiscovery Basic histone result exact Matching.png

Multiple Patterns

If more than one pattern is selected, each will be displayed using a separate color (note we have scrolled down in the result list, so these are not the same hits as above):

PatternDiscovery Basic Histones Result exact 3.png


Details as Hover Text

If the mouse cursor is placed over a particular pattern match, details of any matches at that point are displayed as hover text.

Details include

  • current cursor location on sequence
  • pattern(s) matching
  • start and end positions of matching patterns in angle brackets <>.

PatternDiscovery Basic Histones Result exact 3 hover.png


The result of a different search using the BLOSUM50 substitution matrix shows the regular expression in the hover text as well as the sequence location in the detail strip below:


Pattern discovery sequence tooltip3.png

Result (Exact matches) full sequence view

In full sequence view, the actual sequence letters are displayed and matches are again outlined in colored boxes (here blue).

PatternDiscovery Basic histone result exact seqs.png

Result (BLOSUM50) full sequence view

BLOSUM50 was selected in the Advanced parameters tab and the search repeated with the same parameters otherwise. Note that the "All/Matching Pattern" box is checked. The correspondence between the highlighted sequence and the selected regular expression result can be seen.

PatternDiscovery Basic histone result blosum50 seqs.png

Position Histogram

The Position Histogram displays (binned) support of each motif along the length of the sequences.

  • Support - Out of all occurrences of a particular pattern, the percentage that start within the bin beginning at a particular location.
  • Position - Position along the sequences, out to the last location containing a motif match.
  • Step - the sequences are divided into bins of the specified "step" size. The step size is entered as an integer.
  • Plot Position - push to draw a new histogram.
  • Image snapshot - place a copy of the histogram as an image into the Workspace.


Note that this will only be useful to the extent that the sequences are in some way aligned before analysis.

The figure below shows three motifs selected. Each is displayed in a unique color in the position histogram.

PatternDiscovery Basic Histones Result exact 3 histogram.png

Adding results to the Workspace

The results of a run of Pattern Discovery are automatically placed in the Workspace:


PatternDiscovery Result Node.png

Options to Save or Mask Patterns

Several operations are possible on the returned patterns. The options menu can be seen by right-clicking on a selection of one or more returned patterns.

1. The patterns can be saved, either with their positions on the original sequences, or just as regular expressions.

2. The patterns can be masked out of the query sequence.

The options shown in the picture below are:

  • Mask Pattern - The selected pattern(s) will be masked out of the sequence for future searches. They will not be re-discovered.
  • Unmask All Patterns - Undo the masking.
  • Save Patterns (Regex Only) - This will save the selected pattern(s) in the form of regular expressions, that is, letters and wild-card characters.
  • Save Selected Patterns - This will save both the selected pattern(s) and their hits to the query sequences. The locations (positions on the query sequences) saved are specific to the particular input file used. The name of this file is saved in the pattern file.
  • Save All Patterns - This will save both all of the patterns and their hits to the query sequences. The locations positions on the query sequences) saved are specific to the particular input file used. The name of this file is saved in the pattern file.

Note - for the last two options, the location of the sequence hit is indicated relative to the position of the sequence in the list of sequences actually searched. If only a subset of sequences were active (a marker set was activated in the Markers component), then the position recorded is relative to the sequence order in this subset. However, if the marker set is subsequently deactivated and the saved pattern file reloaded from disk, it will still match against the proper sequences. The sequence line numbers recorded in the file are not used.


PatternDiscovery Right click menu.png

Notes on display of results

  1. Pattern Discovery results can only be displayed in the context of the sequences from which they were derived.
  2. To reload saved pattern files, see the | Local Data Files tutorial.
  3. There is no provision for filtering out repeats from genomic DNA sequence. Affected sequences should be masked before loading into geWorkbench.

Example Pattern Discovery runs

Prerequisites

  1. Make sure that Pattern Discovery and Position Histogram (if desired) are loaded in the Component Configuration Manager.
  2. Load a file containing the sequence or sequences to be analyzed into the Workspace.

For this example, we use a file containing a number of histone sequences, H1H5_HistoneDB_NHGRI.fasta.

The Pattern Discovery component appears in the Command area of geWorkbench (lower right quadrant) when a sequence data node is selected.

Sequence Selection

By default, Pattern Discovery will be run on all sequences in the currently selected sequence data node. However, subsets of sequences can be created and activated in the Markers component. The Pattern Discovery component will respect any activated marker sets, restricting discovery to those sequences in activated sets. If no marker sets are activated, then all sequences will be used.

Setup (Exact matches)

We will try parameters set to allow longer matches. No changes from default are made in the other parameter tabs. In particular, this search uses exact matching of the sequence letters, without substitutions.

  • Support Percent: 30
  • Min tokens: 10
  • Density Window: 7
  • Density tokens: 4

These parameters were empirically chosen. You can try variations to see how it affects the result.


PatternDiscovery Params Basic histone setup new.png


The results are pictured above in Viewing Results.

Setup (BLOSUM50)

If the "Exact only" checkbox in the Advanced Parameters tab is unchecked, a selected BLOSUM substitution matrix for protein sequences will be used.

  • Similarity Threshold - the threshold is input as an integer number. (However, note that it is passed to the Splash server as a double).

PatternDiscovery Params Advanced Matrix.png

The results of repeating the same run as before but using BLOSUM50 are pictured above in Viewing Results.


References

Califano, A. (2000). SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics, Apr;16(4):341-57 (link to paper).