Announcing Parallelhmmer

David Mathog mathog at caltech.edu
Tue Jul 27 06:50:22 EST 2004

A modified version of Sean Eddy's HMMER 2.3.2 is available here:


Modifications include:

1.  The PVM parts have been reworked so that they can use split
    databases, which greatly reduces the CPU load on the master as
    well as network traffic.  In the original 2.3.2 PVM variant
    on our 20 node beowulf some queries overloaded the master resulting
    in NFS, yp, and other failures on the compute nodes.   This
    has not been observed yet with the split variant but I can't
    say what might happen if you have 2000 compute nodes.  Here
    are a few example run times (split means PVM on 20 compute
    nodes,  otherwise on one compute node with database already
    stored in disk cache):

    hmmpfam of A1HU.pfa against  pfam_fs:                102 seconds
    hmmpfam of A1HU.pfa against (split) pfam_fs:           6 seconds
    hmmsearch of Peptidase_M28 against swissprot:        719 seconds
    hmmsearch of Peptidase_M28 against (split) swissprot: 40 seconds
    hmmsearch of Peptidase_M28 against a 6 frame
     translation of the (split) D. melanogaster genome:  405 seconds

2.  Some of the code has been modified to make it run a little
    faster, at least on Athlons.

3.  It can now read BLAST formatted sequence databases directly
    (allowing it to  use the same databases as my parallelblast
    or, I suspect, those that MPIBLAST, utilize.)
    This is implemented with the blastdb_api software already released.      Taxonid restriction is also supported
    to the extent possible, limited by the current limitation in
    NCBI taxon dmp files of only assigning one taxon to each gi, even
    when that gi describes multiple species.

4.  A cgi script "hmmercontrol.pl" is supplied so that all the
    HMMER programs may be run through the web, and most command line
    options have been implemented.
    Note that this one was written for our needs - the current
    handling of account names and email addresses will NOT be
    sufficient if you want to serve off site users, although such 
    changes would not be difficult to make.  You will definitely
    need to modify the configuration lines at the top, since much
    of that information is site specific.  You might also want to
    give local users higher job priorities.  It uses SGE but PBS
    or any other queueing system should work as well.

    It does not support the graphics options used in the cgi
    supplied by the PFAM/HMMER folks.  However, it does support
    HMMPFAM searches on 6 frame translated nucleic acid databaes
    - sometimes slowly. This type of search takes 2 hours on
    20 Athlon 2200MP nodes against a mammalian genome at 99% cpu
    usage on each node and about 11 seconds against the ecoli genome.

5.  Man pages have been modified to show the new options present in
    most of the prgrams.  There was no current man page for
    sreformat so I could not add the new switches --omit and --retain.

6.  HMMSEARCH has been modified so that it may optionally emit in
    fasta format the hits it finds.  These may then be fed directly
    into HMMALIGN or some separate alignment program without having
    to go back and extract each hit from the database.

See AAAREADME.TXT for complete installation instructions.
In a nutshell:
    A.  download hmmer 2.3.2
    B.  unpack the parallelhmmer and copy various files over
        those in the 2.3.2 distribution.
    C.  ./configure --enable-pvm --enable-lfs --prefix=/usr/common
        (or as appropriate for your site)
    D.  make
    E.  make install
    F.  move the PVM slaves and the extra scripts to their
        proper locations
    G.  split databases out across PVM nodes (PFAM tools supplied here,
        BLAST tools in the parallelblast package).
    H.  set up PVM, test the PVM programs.
    I.  set up the *db.txt files that the cgi script needs
        for a description of your split databases.
    J.  customize, install, and test the cgi script.

Please report bugs comments, etc.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net