Many of the key sequence analysis packages are now threaded. That means
they can run on SMP machines and use multiple CPUs to speed things up. The
other type of multicpu "machine" in common use these days is a distributed
system, like a Beowulf cluster. In terms of $/Spec distributed machines
tend to be a lot cheaper than the SMP equivalents, especially so when you
start looking at N >> 2. For this reason, many "supercomputers" are now
The database search algorithms are naturals for distributed calculations.
Recent versions of Fasta come with both threaded and PVM variants. I've
not seen a comparison of the performance though. Has anybody tried it with
an N node SMP machine vs. an N node distributed machine, with equivalent
CPUs on both?
BLAST too seems like a good candidate for distributed computing. For
instance, imagine BLAST on "nr" on an N node distributed machine:
1. format nr into N "equal" sized BLAST databases (for instance, by
assigning sequence j to a database via modulo(j,N)).
2. run each query sequence on N machines, each with one chunk of the
common database preloaded into memory.
3. merge the results from the N machines.
Other than a requirement for putting in a correction for the true database
size it at least seems straightforward. However, while BLAST is available
threaded, I've not been able to find anything which appears to be for use
in distributed systems.
Is there a distributed implementation of BLAST as well?
If not, is it because it's been tried and failed, or because nobody has
attempted it yet?
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech