Fasta serch help

Dan Jacobson danj at WELCHGATE.WELCH.JHU.EDU
Wed Mar 18 17:35:32 EST 1992

Chris Upton made a request for the help file for fasta searches at genbank.  Sending the single message Help
is suppose to retrieve this but the Chris said he'd tried that - so I'm posting the instructions here.
As I have recently moved to a new machine I didn't have my fasta instructions handy (and as I use Thon de Boer's
wonderful little shell script -  I rarely use the instructions anymore! - kudos to Thon) so I pulled this up
with WAIS off the biosci source - I'm glad your back biosci, I really missed you!  Thanks to all those
who helped get the source up and running again.

As an aside, it is truly amazing how many people don't know about this fasta service - I guess it's one more
argument for the value of participating in the newsgroups.

Hope this helps,

Dan Jacobson

danj at welchgate.welch.jhu.edu


FASTA Server Help
GenBank now offers the FASTA program for nucleic acid sequence and
protein similarity searching of sequence databases.  You can access
the GenBank FASTA Server through a number of different networks,
including Internet, BITNET, EARN, NETNORTH and JANET.

The FASTA program allows you to send a specially formatted mail
message containing the nucleic acid or protein query sequence to the
FASTA Server at GenBank.  A FASTA sequence similarity search is then
performed against the specified database using the FASTA program
developed by William Pearson and David Lipman as described in their
	Pearson, W.R. and Lipman, D.J. 1988.  Improved Tools for 
	Biological Sequence Comparison.  Proc. Natl. Acad. Sci., 
	85: 2444-2448.
If you use FASTA as a research tool, we ask that this reference be
cited in your paper. The results of the FASTA search will be returned
to your local mail file as soon as they are processed and can be saved
in a separate disk file.

The following databases are currently available for FASTA searches:

   Designator                  Database
   ----------                  --------
   GenBank/all                 Latest GenBank quarterly release PLUS 
                               sequences added since last release.
   GenBank/new                 GenBank sequences added since last release.
   GenBank/primate             GenBank subdivisions

   GenPept/all		       Translated protein reading frames from
			       the latest GenBank release.  Note that
			       GenPept contains translations only of
			       reading frames that are explicitly
			       mentioned in the GenBank sequence entry
   GenPept/new		       Translated protein reading frames from
			       GenBank daily updates (translated from 

   EMBL/all                    Latest EMBL Data Library release PLUS
                               sequences added since last release.
   EMBL/new                    EMBL sequences added since last release.

   SWISS-PROT/all              All of the SWISS-PROT protein database.

GenBank and EMBL are nucleic acid sequence databases and SWISS-PROT is
a protein sequence database.  GenPept is produced by GenBank and
consists of translations of open reading frames as documented in the
sequence entry annotations ("pept" in features table).

Accessing the FASTA program

To access the program, send an electronic mail message containing the 
formatted query sequence (as described below) to the following Internet 


If you are not on Internet, you may need to change the format of the 
address.  Consult your systems manager to determine the correct address.

Obtaining Help

If you would like to receive instructions on using the FASTA program,
send a mail message to the address above containing the word "Help" on
a single line of the mail message.  Leave the Subject line in the mail
header blank. The help text will be updated when new information is
available for FASTA searches (such as new databases on-line). For
additional help on using FASTA, contact GenBank at (415) 962-7307 or
send an electronic mail message to the address:


Formatting a Query

Queries consist of a mail message with search parameters identifying
the database to be searched, values related to the search and the
query sequence to be used in the search.  The mail message has two
mandatory lines, three optional lines and a line identifying the query
sequence as descibed below.  These lines are typed into the body of
the mail message in the order shown below:

Parameter	Mandatory			Explanation

DATALIB		   Yes		This line specifies the database to be 
				searched (as described in the beginning of
				this text) for the query sequence and must 
				be included in the message.  
KTUP		   No		This line identifies the Ktup value which 
				specifies the sensitivity of the search. 
				Values range between 3 and 6 for nucleic acid
				searches and between 1 and 2 for protein 
				searches. Lower values specify more sensitive 
				searches but require more time to complete.  
				For DNA sequences longer than 200 base pairs, 
				use a Ktup value of 4 or greater; lower values
 				are unnecessary and take longer to complete.  
				Protein searches will benefit from having a 
				Ktup value of 1 if you expect significant 
				matches with evolutionary amino acid replace-
				ments but few exact amino acid matches. The 
				default value for nucleic acids is 4 and 1 
				for proteins.
SCORES		   No		This line specifies the number of best-ranked 
				sequences to be listed in the results.  The 
				default value is 100.
ALIGNMENTS	   No		This line identifies the maximum number of 
				best-ranked sequences to be aligned in the 
				results.  The default value is 20.
BEGIN		   Yes		This line must be included in the message.  No 
				other information is typed on it.

The remainder of the message contains the query sequence in either
Pearson FASTA format or in IntelliGenetics format.

Preparing Files for Similarity Searches

Only one sequence query is allowed per mail query.  The query sequence
that you would like searched in the database must be contained in its
own file.  Your sequence file must be in either Pearson format or
IntelliGenetics format.  GenBank database file format is not currently
accepted; however, it is possible to use an editor to change the file
to Pearson format as described below.  Note: all lines must be less
than 80 characters in length; larger lines will be truncated.

Pearson Format

Pearson is the preferred format to use for query sequences.  The format 
includes a mandatory comment line beginning with a greater-than sign ">" 
followed by the name of the sequence, a space, and an optional note 
about the sequence.  The sequence data begin on the next line without 
the greater-than sign.  For example:

>AGREP4 Monkey SV40-like genomic segment promoting transcription.

IntelliGenetics Format

If your sequence was derived using one of the IntelliGenetics programs,
it can be used for a FASTA search.  Comment lines are optional and
begin with a semi-colon ";".  The name of the sequence and the
sequence data appear on separate lines without a semicolon.  At the
end of the sequence data a number must follow to indicate if the
sequence is linear (1) or circular (2).  For example:

;Monkey SV40-like genomic segment promoting transcription.

GenBank Flat-File Forma

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net