FASTA-SWAP and FASTA-PAT: Gene Function Identification
by Searching Protein Queries against Pattern Databases
For sequences unidentified by both BLAST and FASTA database searches,
we have developed two new pattern tools called FASTA-SWAP and
FASTA-PAT, that are modified versions of Bill Pearson's FASTA
program. Here, unlike in profiles, we represent aligned positions as
the presence or absence of amino acids. These tools are fundamentally
different from the earlier FASTAPAT and BLASTPAT because we use our
new binary representation, where each the one million possible aligned
amino acid combinations are coded by a unique number. This code allows
compact representation of any multiple alignment.
Users can currently search 9 databases of multiple alignments:
- EC, a database of multiple alignments of 15,000 sequences
with known EC numbers (this is the database with the highest
information content);
- the Pattern Induced Multiple Alignment (PIMA) Pattern Database,
a comprehensive multiple alignment database with 22,422 patterns as
aligned using PIMA;
- three EntrezClus10 databases, sequence clusters from the PIMA alignment
databases with >= 10 sequences, aligned using PIMA, CLUSTALW or the
MAP program;
- PIR-ALN database;
- BLOCKS database;
- PRINTS database; and
- FSSP, Families of Structurally Similar Proteins database.
Searches can be performed using the Baylor College of Medicine
(Houston, Texas) WWW Search Launcher:
http://gc.bcm.tmc.edu:8088/search-launcher/launcher.html
Searching a sequence against these database is performed using new
log-odds scoring matrices (rapidly calculated "on the fly") utilizing
the one million combinations. In contrast to standard scoring
matrices like PAM or BLOSUM, these new pattern-based matrices
distinguish between conserved and variable positions, increasing
search sensitivity and selectivity. Generally we recommend the more
precise but somewhat slower diagonal search + Smith-Waterman
refinement as implemented in FASTA-SWAP. For queries longer than 1000
residues however, FASTA-PAT using hashing + Smith-Waterman refinement
may be considerably faster.
Searching a sequence against these database is performed using new
log-odds scoring matrices (rapidly calculated "on the fly") utilizing
the one million combinations. In contrast to standard scoring
matrices like PAM or BLOSUM, these new pattern-based matrices
distinguish between conserved and variable positions, increasing
search sensitivity and selectivity. Generally we recommend the more
precise but somewhat slower diagonal search + Smith-Waterman
refinement as implemented in FASTA-SWAP. For queries longer than 1000
residues however, FASTA-PAT using hashing + Smith-Waterman refinement
may be considerably faster.
A more detailed description of these programs is available
at our WWW pages:
http://dot.imgen.bcm.tmc.edu:9331/seq-search/Help/fastpat.html
_________________________________________________________________________
Steve (Istvan) Ladunga, Brent A. Wiese, and Randall F. Smith
Human Genome Center, Department of Molecular and
Human Genetics and Department of Cell Biology,
Baylor College of Medicine,
Houston, TX 77030, USA
istvanl at bcm.tmc.edu; Phone: (713) 798 8089, FAX: (713) 798 5386
_________________________________________________________________________