From Informatics

Overview

Background

In protein structure analysis, the crystal structure of a protein is sought after. As this can become experimentally expensive, a computational technique called protein prediction is used to identify the protein structure using protein sequence information. In protein prediction, there are methods that can be used either independently or together. These are:

Ab initio - a simulation technique
Homology Modeling

These techniques are used in collaboration within PUDGE. That is, PUDGE has a series of steps in the workflow which first utilize homology modeling and then Ab initio. These are detailed below.

Workflow

PUDGE has a number of steps in its workflow, each taking the output of the previous step as the input for the next step. Interacting with PUDGE for advanced users involves viewing and analyzing the output of one step and either rerunning the step with different input, or by modifying the output before it goes into the next step. The following is adapted from Donald Petrey's PUDGE manual:

http://wiki.c2b2.columbia.edu/honiglab/index.php/Structure_prediction_in_the_Honig_Lab

PIPELINE INPUT: A protein sequence. This is called the "target" sequence.

Stage 1 - Template selection

Used to predict the structure of a sequence with an unknown structure using homology, one must first find a protein or set of proteins which are likely to have a structure similar to the target. This is usually accomplished by finding proteins with experimentally determined structures which can be shown to have an evolutionary relationship to the target. Such proteins are usually referred to as "templates". In most homology modeling exercises, useful information can come from more than one template.

INPUT: A "target" protein sequence with an unknown structure
OUTPUT: A list of "template" protein structures which have similar sequences to the target.

Stage 2 - Sequence-to-structure alignment (Change to Sequence Alignment)

(I do not think any structural alignment is being done here - KCS) To make a model, it is necessary to first identify the residues in the target sequence that are structurally equivalent to residues in the template structure. This pairwise correspondence of amino acids is called a "sequence-to-structure alignment", or simply an "alignment". It is not possible, in general, to generate the single "best" alignment without incorporating structural information. It is usually necessary to make many different alignments, especially for remote homologs (and even for close homologs in regions of low sequence identity, e.g. T0142 from CASP6).

INPUT: A list of protein structures which have similar sequences to the target
OUTPUT: A series of residue-residue alignments of the templates to the target

Stage 3 - Model building

Given an alignment, an initial model is made by replacing the residues in the template with their corresponding residues in the target (based on the alignment). The backbone is generally kept fixed within secondary structure elements while the conformations of unconserved side-chains and insertions/deletions are predicted ab initio. Many models may be produced based on the available set of templates and alignments.

INPUT: A series of residue-residue alignments of the templates to the target
OUTPUT: A set of model structures, which is not necessarily equal to the number of alignments

Stage 4 - Model refinement

While the assumption that structures will be similar if sequences are similar, it is usually the case that the target structure will be different from the template structure in one or several regions. Identifying and predicting the structure of these regions is called "model refinement" or simply "refinement". Refinement procedures may again produce multiple models.

INPUT: A series of model structures
OUTPUT: A refined series of model structures

Stage 5 - Model evaluation

The steps described above may produce thousands of models. Choosing which is the best using some scoring function/effective energy function is called "model evaluation". A number of different methods are usually used starting with computationally inexpensive simplified scoring functions/statistical potentials to remove grossly incorrect models and subsequently applying more detailed physical-chemical energy functions to choose the "best" structure.

INPUT: A refined series of model structures
OUTPUT: A set of structures, with a ranking

PIPELINE OUTPUT: A set of structures, with a ranking

Phase 1

PUDGE - geWorkbench Integration

Alpha Release

The alpha phase of this integration will substitute the existing web-form with geWorkbench as a user interface. The inputs on the interface are:

protein sequence
methods

This alpha release will be a basic, non-interactive integration. The sequence is submitted to PUDGE via the existing server functionality (explained below). PUDGE will then run with default parameters and return one PDB file if PUDGE can find any model to fit the sequence. This structure can be viewed using the existing functionality of geWorkbench (JMol).

Development Environment

All development are done from caldev.

The caGrid 1.0 is installed at caldev.

 C:\Documents and Settings\xiaoqing\cagrid-1-0\

For external programs:

Globus ws-core 4.0.3 is installed at:

 C:\Documents and Settings\xiaoqing\cagrid-1-0\external\ws-core-4.0.3

Tomcat 5.0.28 is installed at:

 C:\Documents and Settings\xiaoqing\cagrid-1-0\external\jakarta-tomcat-5.0.28

The PUDGE code is created at:

 C:\Documents and Settings\xiaoqing\PUDGE

Implementation

Set up PUDGE application server. To run standalone PUDGE, the following variables must be set up correctly as listed:

PUDGE=/razor/0/common/pudge
PDB_DIRECTORY=/razor/0/databases/pdb
TROLLTOP=/razor/0/common/pudge/dat/allh.top
SUBMAT=/razor/0/common/pudge/dat/blosum62
JACKALDIR=/razor/0/common/pudge/dat/jackal.dir
MOUNTDIR=/razor/0
JAVA_HOME=/nfs/apollo/1/server_data/www/pudge2/jdk1.5.0_05
PATH=$JAVA_HOME/bin:$PATH

Please note the cluster will use c shell and bash, so both .cshrc and .bashrc need change as the mentioned above. Currently the application server (Tomcat cat) is located at:

/nfs/apollo/1/server_data/www/pudge2/jakarta-tomcat-5.0.28

The user name is: pudgewb

Integration

When pressing submit on the UI, a sequence is submitted to the grid service (addwnld.) This service then creates corresponding fasta file and invokes a perl script to start the PUDGE pipeline. The script is located at:

/razor/0/common/pudge/scr/psub.pl.

The script will run the following command:

perl /razor/0/common/pudge/scr/psub.pl 2resub.cfg

After the perl script is called, the pipeline program will be invoked and then a working folder will be generated at:

/razor/0/pudge/common/pudge/wrk/P<rand4>/

2resub.cfg Format

<Option_Name><colon><tab><Selection_List>

where <Option_Name> is one of the following:

template_selection template_selection_analysis alignment alignment_analysis model_building model_building_analysis model_refinement model_refinement_analysis model_evaluation model_evaluation_analysis target_seq

<colon> and <tab> refer to the literal colon and tab characters

And <Selection_List> refers to a comma delimited (no spaces/tabs) list of options unique to each of: template_selection, template_selection_analysis, alignment, alignment_analysis, model_building, model_building_analysis, model_refinement, model_refinement_analysis, model_evaluation, model_evaluation_analysis

For target_seq, <Selection_List> is always "input_txt", identifying the target file mentioned above.

As discussed above, during the alpha phase PUDGE will use default method parameters to allow quick generation and retrieval of PDB files. These will be:

template_selection:	test_template_selection
template_selection_analysis:	test_template_selection_analysis
alignment:	test_alignment_analysis
model_building:	test_model_building
model_building_analysis:	test_model_building_analysis
model_refinement_analysis:	test_model_refinement_analysis
model_evaluation:	test_model_evaluation
model_evaluation_analysis:	test_model_evaluation_analysis
target_seq:	input_txt

At the end the pipeline program will dump the Protein Data Bank (PDB) file into a create a "models" folder under the working directory. The working folder will have a file titled "lock" when the job is running. The lock file will be removed and another file titled DONEALL.txt will show up when the job is done.
The PDB file will be transferred back to geworkbench and displayed using JMol.
The 2resub.cfg must be included in the tomcat/bin folder and all files in that folder must be executable.

The solution to the classloader issue.

THe standalone PUDGE client can connect with the service without any problem but when the code is plugged into geworkbench, an exception will be displayed:

 java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl

To solve this problem, the j2ee.jar and arrayexpress.jar from the geworkbench core library need updated as follows: the old version of DocumentBuilderFactoryImpl needs to be removed from these two jar files. Fixed versions have been stored in CVS with the following names: j2eeWithoutDoc.jar and arrayexpressWithoutDoc.jar. These files are located in the geWorkbench lib and the caArrayMageom directory, respectively.

Finally, The parameters to run java need be set as following (you can do it either in build.xml or by entering it into the IDE (IDEA etc):

-Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl -Djavax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl

The complete section in the build.xml file, after the addition of the two lines, looks like this:

Phase 2

Use Case

Primary actions

User provides one or more sequences for which he/she wishes to model the 3-D structure.
The sequences are loaded into geWorkbench as FASTA files.
In a geWorkbench PUDGE interface, the user chooses methods for template selection, alignment, model building, refinement and evaluation. At each stage, one or more methods to analyze the stage results may also be chosen.
1. Following the current PUDGE web interface practice, no parameters will be set manually. All analysis are prepackaged.
2. (Alternatively, the selected FASTA file(s) could be sent to the server and the method selection process handed off to the PUDGE server itself via a web browser window. This would allow access to the latest improvements made by the PUDGE group.)
The user submits the job.
The user can monitor job progress:
1. The list of chosen methods and analyses is displayed so that progress through them can be determined.
2. is the job still running?
3. what is the cumulative CPU time used?
4. which stages have completed, or
5. within a stage, which methods have completed (if multiple were chosen)?
The user can examine intermediate and final results. The primary results are:
1. Lists of alignments or structures and their P-values.
2. Output from separate Analysis methods available at each stage - the results are graphical web pages.
In addition, further visualizations are available for the primary alignments or structure files:
1. Web page data visualizations generated by the "Show Results" program.
Final results, both sequence alignments or 3-D structures, can be imported into geWorkbench. The existing web server provides a mechanism for concatenating the result files for export.
Imported alignments or structures can be viewed in geWorkbench.
1. For structure, the existing JMOL viewer can be used.
2. For alignments, a new viewer may be needed - the alignment files are text.
Complete or selected results should also be directly downloadable to the user's computer. The existing web server provides a mechanism for concatenating the result files for export.
The user can resubmit a job from an intermediate step. For example, he/she may wish to choose a different modeling algorithm or analysis.
1. Resubmission is only allowed for completed jobs.
2. Intermediate resubmissions of the job run in the same directory structure under the same jobID.
The user can rerun the entire job, e.g. if it has crashed and must be restarted.

Overview of the current web-based PUDGE

Sequence selection

Method selection

Job status page

Status page:

Results menu choices

Details of current web-based PUDGE

Each pipeline stage result section has a pulldown menu. The menus vary slightly by stage. The contents are listed below by stage: