I`m looking into different parallelized BLASTs and would be really
interested in hearing your responses to David`s queries.
mathog at caltech.edu (David Mathog) wrote in message news:<20030214104034.7369b190.mathog at caltech.edu>...
> On 13 Feb 2003 10:16:15 -0000
>darling at cs.wisc.edu (Aaron Darling) wrote:
>> > The mpiBLAST team anounces the release of mpiBLAST, an MPI based parallel
> > implementation of NCBI BLAST. mpiBLAST is a pair of programs that replace
> > formatdb and blastall with versions that execute BLAST jobs in parallel on
> > a cluster of computers with MPI installed.
>> Is your "blastall" a shell that calls regular blastall on the compute
> nodes or did you actually manage to make the merge function coexist with the
> NCBI code? If the latter I'm really, really, really, impressed, because
> for those of you who have not had the joy of modifying blastall let me tell
> you that the code is hideously complex. I looked into that approach and
> gave up and took the easy way out - postprocessing the blastall
> output in a separate program. My parallel blast version has many more
> programs (and scripts) but otherwise sounds like it does pretty much what
> yours does. It may be obtained here:
>>ftp://saf.bio.caltech.edu/pub/software/molbio/parallelblast.tar>> This version uses PVM instead of MPI for running the compute node
> jobs. It doesn't do any message passing other than "start job" and
> "job is completed". There is a cgi front end which is designed for
> on site use - it would need minor modifications to allow off site users.
> The merge step is not built into blastall but carried out separately by
> "blastmerge", a program which was announced here some time ago.
>> There are also some patches to the BLAST distribution (2.2.3) which
> fix a few problems in rpsblast and extend the "gi" mechanism so that
> searches of nr/nt may be restricted by taxon id using only two files.
> (Ie, not a separate gi file for each taxon).
>> PHIBLAST isn't supported here. Does yours include it?
>> > Because each node's segment of the database is smaller it can
> > usually reside in the buffer-cache, yielding a significant
> > speedup due to the elimination of disk I/O.
>> Right. The benefits of file caching are not irrelevant for folks
> with just one machine sitting on their desk. My package may also
> be used for "serial parallelization" to take advantage of this effect.
> That is, if a researcher has a 10000 entry query and wants to find
> those hits in the human genome on his workstation the database typically
> won't fit into memory and the search will take forever. This same
> search can be speeded up immensely by fragmenting the database, running
> the same query to completion on each fragment, and then merging the
> results with blastmerge. The fragmented method is faster so long as:
>> number of database fragments < ratio (uncached run time / cached run time)
>> Ie, if splitting the database 3 ways makes it small enough to stay in
> cache, and the ratio is 30, searching the fragments sequentially will
> be 10x faster than searching the entire database at once.
>> The other advantages to fragmenting N ways are that if the database does
> need to load from local disk it does so N times faster than it would have
> on a single node. Also, and this is the primary reason I wrote my version,
> you can run databases which do not fit into a single node's memory. Our
> 20 node cluster has 20Gb of memory. Subtract 50 Mb for the OS and 100Mb
> for blastall (roughly) and it leaves 17 Gb to cache databases. That's
> more than big enough to hold nt, nr, and a couple of mammalian genomes.
> But 850Mb (the free amount on each node) isn't. When the databases
> finally outgrow cluster memory then you can add more nodes. With
> unfragmented databases you have to add memory to each machine, and
> if they won't support enough memory, replace the lot of them with a model
> that does.
>> > It does not require a dedicated cluster.
>> I can't speak for your implementation but here if any other jobs run
> at the same time their load must be exceedingly well balanced.
> Since the merge step cannot complete until all nodes
> finish a CPU hog on just one node will bog down the BLAST system.
> Moreover, these other jobs can't use too much memory either or they'll
> bump the blast databases out of cache. Since parallel blast itself is
> pretty well balanced we allow two blastalls to run at once on each node,
> one at higher precedence (for shorter jobs) and another at lower precedence
> (for longer jobs). No adverse interactions from doing so have shown
> up so far.
>> There is one other minor "gotcha" for potential users of
> either parallel BLAST. Some linux distributions come with "locate"
> which puts "slocate.cron" into /etc/daily. Even on our 1Gb compute
> nodes when slocate.cron runs it throws everything out of cache. That
> was causing the first job that ran afterwards to be abnormally
> slow since it had to reload the database fragments from local disk.
> Had the fragments been NFS mounted the hit would have been
> even greater. Compute nodes usually don't need reindexing that often
> so move that file to /etc/monthly. It would probably make sense to
> take it out of cron entirely.
>> David Mathog
>mathog at caltech.edu> Manager, Sequence Analysis Facility, Biology Division, Caltech