MPIBLAST

From Informatics

Jump to: navigation, search

Contents

Background for parallel BLAST

NCBI does not support any parallel BLAST running on clusters and does not recommend any existing products. One thing to mention is that NCBI has developed a parallel BLAST using the thread on SMP machines for the speedup of BLAST; the speedup is still limited because the SMP machine has restricted the number of processors due to its architecture.

Commercial Products

TimeLogic. It provides a DeCypher engine which can accelerate the CPU intensive calculation such as BLAST, Smith-WaterMan, HMM, etc. The engine (a physical card) can be installed on PC, Unix or Linux machines. It may not be easy to install on Linux clusters. They sell the card or a whole cluster. Price: The products are a little expensive, $25,000 or up for cluster. $7500 or up for the card.

BioTeam. It provides iNquiry (instant informatics clustering kit'). It is a software package of combination of Sun Grid Engine for resource allocation, Ganglia cluster monitoring system with some popular bioinformatics tools such as Blast, MPI-BLAST, HMMer, BLAT, ClustalW. Price: $4995 for any size single cluster. Additional technical support fee may apply. It also has a bioinformatics portal with Web interface.

Aethia (An Italy Company.) It provides PowerBLAST, which can run BLAST at Linux clusters. No one answered my email. Further search found that NCBI created a PowerBLAST in 1996 and it is not available any more from NCBI. http://www.aethia.com/en/prod_powerblast_en.shtml Price: No one answered my email.

SGI: it provides supercomputers and clusters. One product is called GenomeCluster. In GenomeCluster there is software called CT-BLAST, which can be used to run BLAST at clusters. CT-BLAST is a cluster-aware implementation of the NCBI BLAST program. CT-BLAST does not modify any NCBI BLAST code. Rather, it restructures the flow of input and output from multiple copies of the NCBI BLAST code running on cluster computing nodes. CT-BLAST allows the user to minimize the time to return results by load balancing the queries across the cluster computing nodes.

Price: It looks that they did not sell the software separately. No response yet.


Out of business list:

Paracel Company. The original paracel company ran out of business last year but some one from the company restarted a striking development company which inherited the paracel website. It claims that it can provide support for old paracel machines.

TurboGenomics. It provided a turboBLAST, a product developed together with IBM, but the company changed the name to TurboWorx, then out of business.

Open source/free products:


Almost all active research are based on mpiBLAST, others are stopped or made no progress recently.

mpiBLAST, developed at Los Alamos National Lab is to divide the sequence database into small fragments into each node. When the database can fit into physical memory, the speedup achieverable can be superliner. The latest version of mpiBLAST is based on pioBLAST.

pioBLAST, developmed by North Carolina State University and Oak Ridge National Lab, based on mpiBLAST, optimizated in parallel I/O, virtual partitioning.

Sge_mpiblast scripts. Provided by Scalable Informatics LLC. Licensed under GPL2 license. It is a short shell script to make run mpiBLASt at cluster easier. Call mpiBLAST from SGE (Sun grid engine.)

Par-BLAST, developed by Indiana University, Current the code do not open to the public. I am waiting for the response from them.

Hyper-BLAST, dead. No code can be found now.

The Installation and Configuration of MPIBLAST

Progress made at adcluster

What are installed at adcluster for parallel computing?

MPICH

LAM/MPI


MPICH: Not latest.

Problems related with MPICH.

1. Can compile and run the example Hello World code but cannot run any serious MPI program.

2. It need autoconf 2.58 or higher version. adcluster need upgrade.

LAM-MPI is 7.1.1 version. The latest one is 7.1.2, from the LAMMPI website, there is not huge difference between this two versions.

Which one is better, mpich or lamMPI?

One arguement that MPICH is better because it does not require lamboot step to build a lam universe. In fact if you use "mpiexec" from lam, you can run one-shot MPI program too. Consider the problem we met with MPICH at adcluster, most of the following discussion will concertrate on lamMPI.


Packages required before the Installation

1. Patched NCBI toolkit. Download NCBI October 2004 toolkit and use the patch script to apply patch.

2. MPICH or LAMMPI

3. c/c++ compiler

Installation of mpiBLAST.

  • download the mpi-blast1.4.0.gz file from mpiblast.org website. tar the file into a temp folder.
  • Compile and install mpiBLAST using make, run ./configure --with-ncbi=/path_to_ncbi_path, make, make install command.
  • Configure mpiBLAST by editing the ~/.ncbirc file

The file will look like the following:

  [NCBI]
  Data=/path/to/shared/storage/data
  [BLAST]
  BLASTDB=/path/to/shared/storage
  BLASTMAT=/path/to/shared/storage/data
  [mpiBLAST]
  Shared=/path/to/shared/storage
  Local=/path/to/local/storage

Formatting a database


Before processing blast queries the sequence database must be formatted with mpiformatdb. The command line syntax looks like this:

  mpiformatdb -N 25 -i nt 


The above command would format the nt database into 25 fragments, ideally for 25 worker nodes. mpiformatdb accepts the same command line options as NCBI's formatdb. See the README.formatdb file that comes with the NCBI BLAST distribution for more details.


  mpiformatdb reads the ~/.ncbirc file and creates the formatted database fragments in the shared storage directory.

Run MPIBLAST

The main gateway class is called NsClient.java, which can be used to wrapped different steps together. Now it is located at /users/xiaoqing/Client/src/NsClient.java


Before the run of any class, few steps are required:

1. create your .ncbirc and .mpiblastrc files at your user root directory. Please take a look at xiaoqing's files. (/users/xiaoqing/.mpiblastrc or .ncbirc)

2. update your .bashrc/.bash_profile file, add lamMPI into the path.

 PATH=$PATH:/projects/local/x86.linux/LAM-MPI/7.1.1-rocks-build/bin
 LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:${HOME}:/projects/local/x86.linux/LAM-MPI/7.1.1-rocks-build/lib
 export PATH LD_LIBRARY_PATH


To run the NsClient, use

 Java NsClient mpiblast -p blastp -d pdbaa.fa -i Protein7.aa -o output.txt;

The program will do the following three steps:

1. Check the average load of each node, and base on the check result, idle processors names are added into a machines list file.

2. Check whether we need start run lamboot, and it will run lamboot to start lam deamon.

3. Submit the job to the cluster.


Two help shell scripts are produced at the same time.

1. runlamboot.sh

2. runmpiblast.sh


if directly running of NsClient is not sucessful, you can run the above 2 automatically generated scripts and see whether they can succeed.

NsClient also is a part of web service package, it will be used to generate a web service. Sometimes, a user may want to run lamMPI directly from command line, see the instructions listed below.



Command Line mpiBLAST.



To use lamMPI from command line, you first step is to use `qstat -f` to view which nodes is idle now. Than add the nodes name into a file, say "machines" One node per line. It looks like the following:

 frontend-0  
 compute-0-10  
 compute-0-2
 compute-0-3
 compute-0-4
 compute-0-15
 compute-0-11
 compute-0-12
 compute-0-13 



Then run

 `lamboot -v machines`


mpiblast command line syntax is nearly identical to NCBI's blastall program. Examples:

Running a query on 4 nodes would look like:

Then run:

  mpirun -np 6 mpiblast  -p blastp -d pdbaa.fa -i Protein7.fasta -o blast_results.txt 


The above command would query the sequences in blast_query.fas against the pdbaa database and write out results to the blast_results.txt file in the current working directory. The --config-file argument is optional and specifies the location of mpiblast.conf. To get the best performance, it is important to start at least one more process than the number of processors in the cluster because one of the mpiBLAST processes is dedicated to scheduling, which is not CPU-intensive. Furthermore, mpiBLAST needs at least 3 processes to perform a search. One process performs file output and another schedules search tasks, while any additional processes actually perform search tasks.

Personal tools