************************************************************
Announcing Rel. 4.0 of the MBCRR's Protein Pattern Library
and Search Tool (PLSEARCH)
************************************************************
The MBCRR Protein Pattern Library is a database of
"consensus-like" protein sequence patterns, each pattern
derived from a set of homologous sequences in the SWISS-PROT
Protein Sequence Database. Families of related protein
sequences are identified by running the entire SWISS-PROT
database against itself (using BLAST, the NLM/NCBI's new
high-speed similarity search tool); the resulting set of
pair-wise scores are then clustered into families using a
maximal-linkage clustering algorithm. A pattern construc-
tion algorithm (Smith and Smith 1990, PNAS 87:118-122) is
then used to generate a single pattern for each family; the
patterns, which we call amino acid class covering (AACC)
patterns, are functionally equivalent to 'regular expres-
sion' patterns and represent the conserved primary sequence
elements common to all members of each family. This new
release of the pattern library (based on SWISS-PROT rel. 13)
contains 5199 entries: 2026 patterns derived from all fami-
lies of 2 or more members (encompassing 10664 of the 13837
sequences in SWISS-PROT rel. 13) plus the remaining 3173
"non-related" sequences (i.e. from those loci that did not
cluster into any family).
The MBCRR distributes the pattern library with a
dynamic programming-based search tool (PLSEARCH) for match-
ing and aligning newly generated protein sequences against
the pattern database. We have shown that covering patterns
can be more diagnostic for family membership than any of the
individual sequences used to construct a pattern (see Smith
and Smith, 1990) thus pattern searches can be a more sensi-
tive search technique than traditional sequence vs. sequence
database search tools.
Also included in the package is our new multi-sequence
alignment program (PIMA: Pattern-Induced Multi-Alignment).
This program is now being used routinely by the Human Retro-
virus and AIDS Sequence Database Group (Los Alamos Natl.
Labs) to multi-align HIV protein sequences for phylogenetic
analyses.
PLSEARCH is written in 'C' and can run under both Unix
and VMS operating systems; PIMA employs Unix shell scripts
and thus is currently a Unix-only implementation.
The entire package is available electronically and is
free of charge to non-profit organizations (commercial users
must arrange payment of a distribution fee). Copies can be
obtained:
1) directly from the MBCRR via INTERNET anonymous ftp:
mbcrr.harvard.edu = 134.174.51.4; the package is in a
single compressed tar file in the 'plsearch' sub-
directory,
2) by electronic mail from the Univ. of Houston Genbank-
Server: genbank-server at bchs.uh.edu (INTERNET) or
genbank-server%bchs.uh.edu at cunyvm (BITNET/EARN).
Send a mail message containing the line "SEND UNIX HELP"
to start; the files are in the Unix area and are uuen-
coded, compressed text files of approximately 300K each.
The package is also available in the same form via
anonymous FTP to lavaca.uh.edu, 129.7.1.19, in
~ftp/pub/genbank-server/Unix, as plsrchaa, plsrchab,
plsrchac, etc.
or 3) by electronic mail via the EMBL File Server:
send the message "HELP SOFTWARE" to netserv at embl.bitnet
to obtain specifics on retrieving the files.
When using anonymous FTP or e-mail, remember to be sure to
transfer files during off-hours (after 5 PM, machine's local
time); when e-mailing, ask for only a few files at once to
avoid filling up your mail spool area or mailbox.
------------------------------------------------------------
Randall Smith and Temple Smith
Molecular Biology Computer Research Resource,
Galleria Level 1
Dana-Farber Cancer Institute and School of Public Health
Harvard University
44 Binney St., Boston MA 02115 USA
(617)732-3746
INTERNET: rsmith at mbcrr.harvard.edu
BITNET: rsmith%mbcrr at husc6.bitnet
------------------------------------------------------------