Chris Upton made a request for the help file for fasta searches at genbank. Sending the single message Help
is suppose to retrieve this but the Chris said he'd tried that - so I'm posting the instructions here.
As I have recently moved to a new machine I didn't have my fasta instructions handy (and as I use Thon de Boer's
wonderful little shell script - I rarely use the instructions anymore! - kudos to Thon) so I pulled this up
with WAIS off the biosci source - I'm glad your back biosci, I really missed you! Thanks to all those
who helped get the source up and running again.
As an aside, it is truly amazing how many people don't know about this fasta service - I guess it's one more
argument for the value of participating in the newsgroups.
Hope this helps,
Dan Jacobson
danj at welchgate.welch.jhu.edu
---------------------------------------------------------------------------------------------------------
FASTA Server Help
GenBank now offers the FASTA program for nucleic acid sequence and
protein similarity searching of sequence databases. You can access
the GenBank FASTA Server through a number of different networks,
including Internet, BITNET, EARN, NETNORTH and JANET.
The FASTA program allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
FASTA Server at GenBank. A FASTA sequence similarity search is then
performed against the specified database using the FASTA program
developed by William Pearson and David Lipman as described in their
paper:
Pearson, W.R. and Lipman, D.J. 1988. Improved Tools for
Biological Sequence Comparison. Proc. Natl. Acad. Sci.,
85: 2444-2448.
If you use FASTA as a research tool, we ask that this reference be
cited in your paper. The results of the FASTA search will be returned
to your local mail file as soon as they are processed and can be saved
in a separate disk file.
The following databases are currently available for FASTA searches:
Designator Database
---------- --------
GenBank/all Latest GenBank quarterly release PLUS
sequences added since last release.
GenBank/new GenBank sequences added since last release.
GenBank/primate GenBank subdivisions
GenBank/rodent
GenBank/other_mammalian
GenBank/other_vertebrate
GenBank/invertebrate
GenBank/plant
GenBank/organelle
GenBank/bacterial
GenBank/structural_rna
GenBank/viral
GenBank/phage
GenBank/synthetic
GenBank/unannotated
GenPept/all Translated protein reading frames from
the latest GenBank release. Note that
GenPept contains translations only of
reading frames that are explicitly
mentioned in the GenBank sequence entry
annotations!
GenPept/new Translated protein reading frames from
GenBank daily updates (translated from
GenBank/new).
EMBL/all Latest EMBL Data Library release PLUS
sequences added since last release.
EMBL/new EMBL sequences added since last release.
SWISS-PROT/all All of the SWISS-PROT protein database.
GenBank and EMBL are nucleic acid sequence databases and SWISS-PROT is
a protein sequence database. GenPept is produced by GenBank and
consists of translations of open reading frames as documented in the
sequence entry annotations ("pept" in features table).
Accessing the FASTA program
To access the program, send an electronic mail message containing the
formatted query sequence (as described below) to the following Internet
address:
SEARCH at GENBANK.BIO.NET
If you are not on Internet, you may need to change the format of the
address. Consult your systems manager to determine the correct address.
Obtaining Help
If you would like to receive instructions on using the FASTA program,
send a mail message to the address above containing the word "Help" on
a single line of the mail message. Leave the Subject line in the mail
header blank. The help text will be updated when new information is
available for FASTA searches (such as new databases on-line). For
additional help on using FASTA, contact GenBank at (415) 962-7307 or
send an electronic mail message to the address:
CONSULTANT at GENBANK.BIO.NET
Formatting a Query
Queries consist of a mail message with search parameters identifying
the database to be searched, values related to the search and the
query sequence to be used in the search. The mail message has two
mandatory lines, three optional lines and a line identifying the query
sequence as descibed below. These lines are typed into the body of
the mail message in the order shown below:
Search
Parameter Mandatory Explanation
DATALIB Yes This line specifies the database to be
searched (as described in the beginning of
this text) for the query sequence and must
be included in the message.
KTUP No This line identifies the Ktup value which
specifies the sensitivity of the search.
Values range between 3 and 6 for nucleic acid
searches and between 1 and 2 for protein
searches. Lower values specify more sensitive
searches but require more time to complete.
For DNA sequences longer than 200 base pairs,
use a Ktup value of 4 or greater; lower values
are unnecessary and take longer to complete.
Protein searches will benefit from having a
Ktup value of 1 if you expect significant
matches with evolutionary amino acid replace-
ments but few exact amino acid matches. The
default value for nucleic acids is 4 and 1
for proteins.
SCORES No This line specifies the number of best-ranked
sequences to be listed in the results. The
default value is 100.
ALIGNMENTS No This line identifies the maximum number of
best-ranked sequences to be aligned in the
results. The default value is 20.
BEGIN Yes This line must be included in the message. No
other information is typed on it.
The remainder of the message contains the query sequence in either
Pearson FASTA format or in IntelliGenetics format.
Preparing Files for Similarity Searches
Only one sequence query is allowed per mail query. The query sequence
that you would like searched in the database must be contained in its
own file. Your sequence file must be in either Pearson format or
IntelliGenetics format. GenBank database file format is not currently
accepted; however, it is possible to use an editor to change the file
to Pearson format as described below. Note: all lines must be less
than 80 characters in length; larger lines will be truncated.
Pearson Format
Pearson is the preferred format to use for query sequences. The format
includes a mandatory comment line beginning with a greater-than sign ">"
followed by the name of the sequence, a space, and an optional note
about the sequence. The sequence data begin on the next line without
the greater-than sign. For example:
>AGREP4 Monkey SV40-like genomic segment promoting transcription.
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg
tttccatttactttggattatacgtcattataaa
IntelliGenetics Format
If your sequence was derived using one of the IntelliGenetics programs,
it can be used for a FASTA search. Comment lines are optional and
begin with a semi-colon ";". The name of the sequence and the
sequence data appear on separate lines without a semicolon. At the
end of the sequence data a number must follow to indicate if the
sequence is linear (1) or circular (2). For example:
;Monkey SV40-like genomic segment promoting transcription.
AGMREP4
ccccttcaaatctattacaaggtgagcgtctcgccaaggcaatgaaatcgcaatatgatg
tttccatttactttggattatacgtcattataaa1
GenBank Flat-File Forma