mpiBLAST release announcement

Aaron Darling darling at cs.wisc.edu
Fri Feb 21 04:42:39 EST 2003

mathog at caltech.edu (David Mathog) wrote in message news:<20030214104034.7369b190.mathog at caltech.edu>...
> Is your "blastall" a shell that calls regular blastall on the compute
> nodes or did you actually manage to make the merge function coexist with the
> NCBI code?  If the latter I'm really, really, really, impressed, because
> for those of you who have not had the joy of modifying blastall let me tell
> you that the code is hideously complex.  

It's the latter.
One goal in developing mpiBLAST was to make it as simple to use as
possible.  To achieve that goal, we decided to keep its interface as
close to NCBI-BLAST as possible so that users would not need to learn
a new interface.

Each mpiBLAST worker process initializes the NCBI Library with the
appropriate set of command line arguments.  The workers execute the
BLAST search using the library functions, and the library is directed
to output results in ASN.1 format.  The ASN.1 results are communicated
to the master node, which reads them in using the NCBI library ASN.1
BLAST result reader.  Finally, the master node merges the result data
structures and calls the NCBI library's output function on them.

Our initial approach was to merge the text file results, but our goal
of transparently supporting all of the NCBI BLAST output formats left
us with two options:  either implement text parsing and merging for
every possible output format, or directly output results using the
NCBI library.  The latter seemed easier, more maintainable, and more
interesting to program.

> PHIBLAST isn't supported here.  Does yours include it?

Our parallelization is limited to the BLAST search types that are
included in the NCBI blastall tool.

> > Because each node's segment of the database is smaller it can
> > usually reside in the buffer-cache, yielding a significant
> > speedup due to the elimination of disk I/O. 
> Right.  The benefits of file caching are not irrelevant for folks
> with just one machine sitting on their desk.  My package may also
> be used for "serial parallelization" to take advantage of this effect.
> That is, if a researcher has a 10000 entry query and wants to find
> those hits in the human genome on his workstation the database typically
> won't fit into memory and the search will take forever.  This same
> search can be speeded up immensely by fragmenting the database, running
> the same query to completion on each fragment, and then merging the
> results with blastmerge.  The fragmented method is faster so long as:
>   number of database fragments < ratio (uncached run time / cached run time)
> Ie, if splitting the database 3 ways makes it small enough to stay in
> cache, and the ratio is 30, searching the fragments sequentially will
> be 10x faster than searching the entire database at once.

That is a very good point.  Both of our parallel BLASTs can improve
single-node performance with database segmentation.  The NCBI-BLAST
implementation implicitly assumes that the entire database fits in
buffer-cache because it searches the entire database for any given
query before moving on to the next query.  That technique obviously
causes worst-case performance when the DB is larger than core memory.

> > It does not require a dedicated cluster.
> I can't speak for your implementation but here if any other jobs run
> at the same time their load must be exceedingly well balanced.
> Since the merge step cannot complete until all nodes
> finish a CPU hog on just one node will bog down the BLAST system.
> Moreover, these other jobs can't use too much memory either or they'll
> bump the blast databases out of cache.  Since parallel blast itself is
> pretty well balanced we allow two blastalls to run at once on each node,
> one at higher precedence (for shorter jobs) and another at lower precedence
> (for longer jobs).  No adverse interactions from doing so have shown
> up so far. 

mpiBLAST can do some basic load balancing across cluster nodes that
mitigates problems of imbalance.  The database can be fragmented into
a large number of fragments such that each fragment takes a short
period of time to search.  mpiBLAST will assign unsearched fragments
to workers as they complete each fragment search.  Thus, nodes that
are searching fragments quickly will get more work assigned to them.

We have included a patch to NCBI formatdb that allows it to generate
more than 100 fragments.

Good hearing from you, I'm interested in any further insights or
critiques you could provide.

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net