IUBio

FASTA-SWAP searches of a protein sequence against pattern databases

Istvan Ladunga istvanl at bcm.tmc.edu
Mon Mar 4 13:54:20 EST 1996


FASTA-SWAP and FASTA-PAT: Gene Function Identification 
by Searching Protein Queries against Pattern Databases
 
 
For sequences unidentified by both BLAST and FASTA database searches,
we have developed two new pattern tools called FASTA-SWAP and
FASTA-PAT, that are modified versions of Bill Pearson's FASTA
program. Here, unlike in profiles, we represent aligned positions as
the presence or absence of amino acids.  These tools are fundamentally
different from the earlier FASTAPAT and BLASTPAT because we use our
new binary representation, where each the one million possible aligned
amino acid combinations are coded by a unique number. This code allows
compact representation of any multiple alignment.
 
Users can currently search 9 databases of multiple alignments:
 
- EC, a database of multiple alignments of 15,000 sequences 
  with known EC numbers (this is the database with the highest 
  information content);
 
- the Pattern Induced Multiple Alignment (PIMA) Pattern Database,
  a comprehensive multiple alignment database with 22,422 patterns as
  aligned using PIMA;
 
- three EntrezClus10 databases, sequence clusters from the PIMA alignment
  databases with >= 10 sequences, aligned using PIMA, CLUSTALW or the
  MAP program;
 
- PIR-ALN database;
 
- BLOCKS database;
 
- PRINTS database; and
 
- FSSP, Families of Structurally Similar Proteins database.
 
 
Searches can be performed using the Baylor College of Medicine
(Houston, Texas) WWW Search Launcher:
 
   http://gc.bcm.tmc.edu:8088/search-launcher/launcher.html
 
 
Searching a sequence against these database is performed using new
log-odds scoring matrices (rapidly calculated "on the fly") utilizing
the one million combinations.  In contrast to standard scoring
matrices like PAM or BLOSUM, these new pattern-based matrices
distinguish between conserved and variable positions, increasing
search sensitivity and selectivity. Generally we recommend the more
precise but somewhat slower diagonal search + Smith-Waterman
refinement as implemented in FASTA-SWAP. For queries longer than 1000
residues however, FASTA-PAT using hashing + Smith-Waterman refinement
may be considerably faster.

Searching a sequence against these database is performed using new
log-odds scoring matrices (rapidly calculated "on the fly") utilizing
the one million combinations.  In contrast to standard scoring
matrices like PAM or BLOSUM, these new pattern-based matrices
distinguish between conserved and variable positions, increasing
search sensitivity and selectivity. Generally we recommend the more
precise but somewhat slower diagonal search + Smith-Waterman
refinement as implemented in FASTA-SWAP. For queries longer than 1000
residues however, FASTA-PAT using hashing + Smith-Waterman refinement
may be considerably faster.
 
A more detailed description of these programs is available 
at our WWW pages:
  http://dot.imgen.bcm.tmc.edu:9331/seq-search/Help/fastpat.html
_________________________________________________________________________

 
Steve (Istvan) Ladunga, Brent A. Wiese, and Randall F. Smith
 
Human Genome Center, Department of Molecular and 
Human Genetics and Department of Cell Biology, 
Baylor College of Medicine, 
Houston, TX 77030, USA
istvanl at bcm.tmc.edu; Phone: (713) 798 8089, FAX: (713) 798 5386
_________________________________________________________________________

 





More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net