The dFLASH Group wishes to announce release 1.1.0 of the dFLASH electronic
mail server. Beginning with this release of the server, we will be supporting
the latest release of the GENBANK, PIR and SWISSPROT databases.
In particular, users can now carry out searches in
GENBANK Release 85 (September 30, 1994)
PIR Release 42 (September 30, 1994) --> DEFAULT Database <--
SWISSPROT Release 30 (October 30, 1994)
Full bibliographic references can optionally be included with the computed
alignments, for all three databases.
Notice that a number of necessary changes and additions have been incorporated
in the "query language". For example, since we now support a larget set of
databases, "target protein" is not a valid directive anymore! The appended help
file describes the changes and available functions in detail.
NEW FEATURES:
o the reported results can now be sorted using a sorting key specified by
the user via the "query language"
o a smart-email filter has been implemented: various specification
errors are now caught and corrected automatically; notifications
are sent to the user for all taken actions.
It is our intention to update the server with the latest release of each of the
above dbases within the first two weeks after it becomes available.
The server is accessible through the Internet and is now operating 24 hours a
day, 7 days a week and can be accessed both directly and through "Grail" of the
Oak Ridge National Lab.
Sincerely,
The dFLASH Group
------------------------------> CUT HERE <-----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! The dFLASH server now supports the GenBank, PIR and SWISSPROT databases. !!
!! The supported releases are: !!
!! GENBANK Release 85 (September 30, 1994) !!
!! PIR Release 42 (September 30, 1994) --> DEFAULT Database <-- !!
!! SWISSPROT Release 30 (October 30, 1994) !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! N O T A B E N E !!
!! The dFLASH server is still under development. If some of the answers do !!
!! not make sense it is very likely that this is due to a bug in our code. !!
!! Please, email bug reports and comments to dflash at watson.ibm.com with !!
!! subject line "bug" or "comments". !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Dear User, welcome to Release 1.1.0 of the the dFLASH server!
The dFLASH server is a "homologous sequence retrieval" program for PROTEIN
and DNA sequences.
dFLASH is a parallel system running on an IBM SP/x architecture. Intra-node
communication, evidence integration and alignment are performed in parallel.
The system has been implemented using IBM's Concert/C language for distributed
programming. The server is available 24 hours a day, 7 days a week and can be
accessed both directly and through "Grail" of the Oak Ridge National Lab.
Incremental changes and improvements made to the server will be reflected
in the "Message of the day" at the beginning of this help file: we recommend
that users periodically issue a `send help' request for up to date information
on the server.
For the moment, we can process requests originating from email addresses of
the form
user@[machine.][subdomain.]institution.type
or
user%machine@[machine.][subdomain.]institution.type
or
"string::user"@[machine.][subdomain.]institution.type
We plan to further expand the accepted formats, depending on demand.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
HOW TO USE THE SERVER: You can use the dFLASH facilities by sending an email
---------------------- message with the appropriate syntax to the address
"dflash at watson.ibm.com" (without the quotes).
SUBJECT LINE: It is important that the "Subject" line of your message contain
------------- one of: { dflash, dFlash, dFLASH, DFLASH }. Messages whose
subject line does NOT conform to this rule, **WILL BE LEFT
UNPROCESSED**. The reason for that restriction is that we want
to be able to automatically distinguish between messages that are
addressed to the server and those that are meant for one of the
group members.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
MESSAGE FORMAT: The typical message-body of an email request looks as follows
---------------
BLOSUM 62 (optional | DIRECTIVE)
VERBOSE 10 20 (optional | DIRECTIVE)
SEQUENCES 100 (optional | DIRECTIVE)
ALIGNMENTS 50 (optional | DIRECTIVE)
THRESHOLD 30 (optional | DIRECTIVE)
KEY XMATCH (optional | DIRECTIVE)
SOURCE PROTEIN (optional | DIRECTIVE)
TARGET SP (optional | DIRECTIVE)
BEGIN (mandatory | DIRECTIVE)
>A_ONE_LINE_TEST_SEQ_LABEL (mandatory -- notice the '>' )
a_sequence_of_{amino_acids,nucleic_acids,spaces,tabs}
1 (mandatory terminator)
The PAM/BLOSUM, VERBOSE, SEQUENCES, ALIGNMENTS, THRESHOLD, KEY, SOURCE and
TARGET directives can appear in any order but they *must* precede the BEGIN
directive. The BEGIN line must be followed by the LABEL line which in turn
should be followed by the test sequence.
The test sequence should contain at least 18(=proteins)/54(=dna) and not
more than 1500 amino acid or nucleotide characters. But it may contain ANY
NUMBER of CARRIAGE RETURN TAB and SPACE characters; the latter are not of
course counted while computing the length of the test sequence. There is NO
case sensitivity in the label and the test sequence itself. If the test
sequence is longer than 1500 characters, the e-mail filter will truncate it to
the first 1500 characters and will send a note to that effect to the originator
of the query; the filter will then submit the truncated sequence to the search
engine.
NOTA BENE: The words appearing on the lines marked DIRECTIVE above can be in
---------- lower case or upper case; in other words, you can have pam or PAM,
threshold or THRESHOLD, alignments or ALIGNMENTS, etc. However,
something like ThReShOlD will not work.
The directive pertaining to the scoring matrix allows the user to specify
the matrix to be used for computing the alignment scores. You can use either
the word PAM followed by a space and the desired distance, or the word BLOSUM
followed by space and the desired distance. Examples: PAM 250, BLOSUM 62 etc.
If no matrix directive is included in the message, PAM 250 is used as the
default. Depending on the values of the directive TARGET (see below) the
matrix directive if present may be ignored.
The VERBOSE line allows the sender to also retrieve the data about authors,
dates, entries, superfamilies etc. that are contained in the original PIR,
SwissProt and GenBank databases. This directive accepts one OR two arguments;
for example:
verbose 15 25
means "send me the text data for the sequences occupying positions 15 through 25
in the final ranking." On the other hand,
verbose 15
means "send me the text data for the sequences occupying the first 15 positions
in the final ranking." If no verbose line appears, no citation data is sent.
The SEQUENCES line allows one to restrict the reported sequences to the
given number. This directive controls the number of entries in the ``short
list'' of recovered database sequences only. If no SEQUENCES line is given,
the server code will set it to an appropriate default value (100).
The ALIGNMENTS line allows one to restrict the reported alignments to the
given number. If no ALIGNMENTS line is given, the server code will set it to
an appropriate default value (100). The ALIGNMENTS value cannot exceed 5000.
Values larger than 5000 are reduced to 5000.
The THRESHOLD line allows one to restrict the number of reported sequences
(and thus alignments) to only those whose Score exceeds the given THRESHOLD
value. If no THRESHOLD line is given the server code will set it to an
appropriate default value. The default values are 50 for DNA sequences, and 80
for protein sequences. There is also a *hard* threshold value of 40 for DNA,
and 30 for PROTEIN sequences; if the user-requested values are smaller than
these hard-thresholds, the requested threshold will be increased accordingly.
NOTA BENE: (1) if the THRESHOLD value is too small, you are running the danger
---------- of upsetting your mailer program since chances are that you will
receive a very big file as a reply from the server.
(2) if the THRESHOLD is too high the list of recovered entries
will be empty, or very short; you should decrease the threshold's
value and resubmit your query.
The KEY line allows the user to specify the key to be used when sorting the
results (retrieved sequences) corresponding to a submitted search request. The
keyword KEY can be followed by one of { SCORE,score, LENGTH,length, PEAK,
peak, GAP,gap, MATCH,match, XMATCH,xmatch }. By setting KEY to one of
{SCORE,score} the user indicates that the retrieved sequences should be sorted
in decreasing order of total computed score. By setting KEY to one of {LENGTH,
length} the user indicates that the retrieved sequences be sorted in decreasing
order of their length. Setting KEY to one of {PEAK,peak} will result in the
retrieved sequences being sorted in decreasing order of the maximum score value
over *any* 18(=proteins)s or 54(=dna) residue window of the recovered match.
Setting KEY to one of {GAP,gap} will result in the retrieved sequences being
sorted in decreasing order of the maximum gap inserted that will result in a
best alignment with the query strand. Setting KEY to one of {MATCH,match} will
result in the retrieved sequences being sorted in decreasing order of the total
(=conservative+exact) number of matches with the query strand. Finally, setting
KEY to one of {XMATCH,xmatch} will sort the retrieved sequences in decreasing
order of the number of exact matches with the query strand. If no KEY directive
is specified, the retrieved sequences will be sorted in order of decreasing
"score".
The SOURCE line allows the user to specify the type of the query strand as
being a { PROTEIN,protein, DNA,dna } sequence. By setting SOURCE to one of
{PROTEIN,protein} the user indicates that the query strand is a sequence of
amino acids. By setting SOURCE to one of {DNA,dna} the user indicates that the
query strand is a sequence of nucleotides.
The TARGET line allows the user to specify the type of the target database
to be one of { PIR,pir, SP,sp, GB,gb }. This way the user controls the
database in which the search will be carried out. If TARGET is set to one of
{PIR,pir}, the search will take place in the PIR database. If TARGET is set to
one of {SP,sp} the search will take place in the SWISSPROT database. If
TARGET is set to one of {GB,gb}, the search will take place in the GenBank
database. Requests for searches in unsupported databases will be *IGNORED* by
the server and generate a complaint message that will be sent back to the
originator of the request.
If *only* SOURCE is specified, then the TARGET will be set automatically: in
particular, if SOURCE is set to one of { protein, PROTEIN } then the search
will be carried in the "PIR" database, whereas if source is set to one
of { dna, DNA } then the search will take place in the "GB" database. If
*neither* SOURCE *nor* TARGET lines are given, the server will assume it is
dealing with an amino acid strand and carry out the search against the "PIR"
database.
The LABEL line allows the user to enter mnemonic information pertaining the
the test sequence, the time of the day etc. The information of this line will
be reproduced in the Subject line of the reply message. Notice that the
LABEL line *must* begin with the character '>'.
All the submitted messages must be terminated by the number '1' This
number can follow the last character of the test sequence or be in a line by
itself.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
A 'SMART' FILTER: The email filter that allows for the above message format
----------------- has been improved in this release. In particular, the
filter is 'smart' enough to catch inconsistencies in the user's message. The
filter will correct them and send a note to the originator of the message.
*Unlike* older releases of the filter, this version will submit the corrected
message to the search engine. The filter will also send one email note to the
originator of the query for *every* change it has carried out; the note(s)
will contain information about the actions that the filter has taken.
For example, if the user's note contains the following lines
sequences 20
alignments 50
verbose 10 30
the filter will reset the value of 'alignments' to 20, and of the 'verbose_to
to 20, and subsequently submit the corrected query to the search engine. Since
two changes took place, the filter will also send two email notes to the
originator of the query detailing the actions it has taken.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
EXAMPLES: Two example inputs follow
---------
Example 1:
pam 250
sequences 50
alignments 30
threshold 100
target pir
begin
> HBA_HUMAN STANDARD; PRT; 141 AA. P01922; HEMOGLOBIN ALPHA
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHG KKVADALTNA
V A H V D D M PNALSALSDLHAHKLRVDPVNFK
llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
1
Note: all amino acids from "VLSP" through "ltskyr will be used
in the search. Not more than the 50 top scoring sequences will be
reported in the short list. Also, the alignments for the top 30
scoring sequences will be returned. No reported sequence will have
score that is less than 100, and the reported sequences will be
sorted in order of decreasing score. The test sequence is declared
to be a sequence of amino acids and should be searched against the
PIR database.
Example 2:
BLOSUM 62
KEY XMATCH
BEGIN
> Sequence sent to dflash on Fri May 20 13:40:17 EDT 1994
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHG KKVADALTNA
V A H V D D M PNALSALSDLHAHKLRVDPVNFK
llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
1
Note: all amino acids from "VLSP" through "ltskyr" will be used
in the search. The server code will set the various parameters to
appropriate default values. The server will treat the test sequence
as a sequence of amino acids (default) and will search against the
"PIR" database (default) with a score threshold set at 80 (default).
The retrieved sequences will be reported in order of decreasing
number of exact matches with the query strand.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
SCORING MATRICES:
-----------------
You can use both PAM and BLOSUM scoring matrices for protein searches. These
can be requested via the optional { pam, PAM, blosum, BLOSUM } directive. The
currently supported distances are
for BLOSUM: 30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 100
for PAM: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,
160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280,
290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410,
420, 430, 440, 450, 460, 470, 480, 490, and 500.
For DNA searches, the PAM/BLOSUM declarations are ignored
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
NOTE ON ALIGNMENT:
------------------
The server's alignment code implements the Smith-Waterman algorithm (dynamic
programming) to align each of the retrieved sequences with the test input. This
is *NOT* to be confused with the indexing method that we use to determine the
candidates to be aligned.
The meaning of the variables in the listing that is returned by the dFLASH
server
.....
....
Score Matrix: PAM250
Max Reported Sequences: 1000
Max Reported Alignments: 10
Score Threshold At: 65
Id Label: Score NRes Ex% Tot% Sig Pk
----------------------------------------------------------------------------
1. HAHU hemoglobin alpha chain - human 655 141 100% 100% 100 89
2. HACZ hemoglobin alpha chain - chimpanzee 655 141 100% 100% 100 89
3. HACZP hemoglobin alpha chain - pygmy chi 655 141 100% 100% 100 89
4. HAGO hemoglobin alpha chain - lowland go 654 141 99% 100% 99 89
5. HAMQP hemoglobin alpha chain - hanuman l 653 141 97% 100% 99 89
6. B27792 hemoglobin alpha-1 chain - orangu 649 141 97% 100% 99 89
7. A25126 hemoglobin alpha-1 chain - Sumatr 649 141 97% 100% 99 89
...
.....
..
is the following:
NRes: the number of residues (amino acids) in the recovered match
Score: sequence similarity score of the recovered sequence based on the
selected mutation matrix
Ex%: percentage of *exact* matching residues
Tot%: percentage of *total* (=exact+conservative) matching residues
Sig: 100 times the ratio between the actual computed score and the score
obtained by matching the retrieved sub-segment with itself; the
denominator is the maximum obtainable score for the sub-segment in
question (all gaps removed).
Peak: the maximum score value over *any* 18(=proteins)s or 54(=dna) residue
window of the recovered match.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
TO OBTAIN HELP:
---------------
You can obtain this message at any moment by sending a message with one of:
{ dflash, dFlash, dFLASH, DFLASH } in the "Subject" line and a body containing
one of { help, HELP, send help, SEND HELP }.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
TO OBTAIN ON-LINE REPRINTS OF PAPERS
------------------------------------
You can obtain reprints (in PostScript) of relevant papers by sending a
message with one of: { dflash, dFlash, dFLASH, DFLASH } in the "Subject" line
and a body containing
one of {flashpaper, FLASHPAPER, send flashpaper, SEND FLASHPAPER }
---> returns to the originator of the
request a copy of the FLASH paper
that will appear in `CABIOS'
one of {dflashpaper, DFLASHPAPER, send dflashpaper, SEND DFLASHPAPER }
---> returns to the originator of the
request a copy of a paper that contains
a description of dFLASH that has
appeared in `IEEE Computational Science
and Engineering'
one of {concertpaper, CONCERTPAPER, send concertpaper, SEND CONCERTPAPER }
---> returns to the originator of the
request a copy of a high-level paper
describing the CONCERT/C language
one of {bayespaper, BAYESPAPER, send bayespaper, SEND BAYESPAPER }
--> returns to the originator of the
request a copy of a paper describing
a computer-vision application based
on similar to dFLASH indexing prin-
ciples that will appear in `CVGIP-IU'
Notice there can only be *one* such request per message! Also, make sure
you do not issue a new paper request until after the previous request has
returned to you all of the postscript files and you have removed the latter
from your mailbox: the returned messages are rather big (between 1 and 4
Megabytes) and are guaranteed to overflow the disk set aside for mail messages
on most systems.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
Thank you for your interest in the dFLASH server.
Sincerely,
The dFLASH Group
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
COMMENTS?? We will appreciate receiving your feedback, suggestions, comments,
---------- or bug reports; all of these can be sent to "dflash at watson.ibm.com"
Please, make sure your "Subject" line contains the word "comments"
or "bug".
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
REFERENCES If you make use of the dFLASH server, please reference
----------
A. Califano and I. Rigoutsos, "FLASH: A Fast Look-up Algorithm for String
Homology." In CABIOS. To appear.
I. Rigoutsos and A. Califano, "Searching In Parallel for Similar Protein
Strings." In IEEE Computational Science and Engineering, June 1994.
If you wish to find out more, you can contact Isidore Rigoutsos and Andrea
Califano at {rigoutso,acal}@watson.ibm.com
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
For more information on the Concert/C language, please refer to
J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy,
A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini,
"High-level language support for programming reliable distributed
systems." In Proceedings of the International Conference on Computer
Languages, April 1992, Oakland, California.
or contact Jim Russell (jrussell at watson.ibm.com)
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
------------------------------> CUT HERE <-----------------------------------