Sequence formats

Peter Rice pmr at sanger.ac.uk
Tue Mar 25 13:19:50 EST 1997

In article <33380285.7D6A at nibsc.ac.uk> ajenkins at nibsc.ac.uk (Adrian Jenkins) writes:
>   On my 'wish-list' for features regarding molecular biology programs, the
>   main feature would be a universal format.

Probably better in a more general newsgroup, but anyway ...

The most general format seems to be "FASTA". GCG can now read FASTA
format, and so can applications like BLAST.

However, even FASTA format has variations. The "sequence name" can
include many defined fields (see for example the FASTA format of dbEST
and dbSTS), and after the "sequence name" some applications like to
define some format for the remainder of the line. For example, the
next text field might be reserved for an accession number, or other
delimiters could be used for other information.

Even the sequnece has alternatives - fixed length lines (if so, how
long?), spaces at the start of the line or between blocks of
characters, should proteinsequences end with a "*", what gap
character(s) are allowed, and so on.

Peter Rice                           | Informatics Division,
E-mail: pmr at sanger.ac.uk             | The Sanger Centre,
Tel: (44) 1223 494967                | Wellcome Trust Genome Campus,
Fax: (44) 1223 494919                | Hinxton, Cambridge, CB10 1SA,
URL: http://www.sanger.ac.uk/~pmr/   | England

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net