IUBio

FASTA format - proposed max line limit

Osborne, Brian brian.osborne at CADUS.COM
Fri Nov 6 08:55:11 EST 1998


To the group,

My favorite horror (to use Peter's term) is this one (and its brothers
and sisters in dbest) :

>gb|AA406996|acc:AA406966 EST02002 Mouse 7.5 dpc embryo ectoplacental
cone cDNA library Mus musculus cDNA clone C0016E04 3' similar to Mouse
mitochondrial genome. >gb, score = 1720, mRNA sequence - Mus musculus,
404 bp (RNA)

Enjoy!        ;-)

Brian O.

Brian Osborne
Cadus Pharmaceutical Corporation
777 Old Saw Mill River Rd.
Tarrytown NY USA
10591-6705
brian.osborne at cadus.com
TEL 914 467 6291
FAX 914 345 3565




> -----Original Message-----
> From:	Peter Rice [SMTP:pmr at sanger.ac.uk]
> Sent:	Friday, November 06, 1998 5:21 AM
> To:	bio-soft at net.bio.net
> Subject:	Re: FASTA format - proposed max line limit
> 
> mathog at seqaxp.bio.caltech.edu writes:
> 
> > We all know that the FASTA format is a bit restrictive in that there
> is
> > only the one line for comments, but can the software/database
> community
> > *please* agree on some reasonable maximum line length for both the
> comments
> > and the sequence?
> 
> I would welcome a standard "unique identifier" format after the ">".
> 
> We use FASTA format extensively at the Sanger Centre, but we need to
> hold both an identifier and an accession number. In specific cases
> we also need a database name. Often there is other information used
> (typically numeric) to generate several unique forms from one
> original name.
> 
> Curiously, one reason for the expansion of FASTA format is BLAST, as
> it takes as its database a file of many FASTA format sequences which
> need to have unique identifiers.
> 
> One option to get extra identifier information is to use the NCBI
> style with "|" characters to split the fields. Sometimes this seems to
> have special information in the first word(s) of the description too,
> for example in dbEST.weekly.FASTA
> 
>    >gi|1622446|dbj|C21336|C21336 HUMGS0003372, Human Gene Signature, \
> 	3'-directed cDNA sequence
> 
> (actually this is followed by "ctrl-A" and more description - see
> "other horrors" below)
> 
> Another, since we have some FASTA files generated by GCG, is the GCG
> syntax of:
> 
>    >DB:entryname accnum yet...more...description
> 
> We generate FASTA files from our unfinished sequence data where the
> unique name is built from the clone and contig, using "." as a
> delimiter, for example:
> 
>    >bK109G6.05061 Unfinished sequence: bK109G6  Contig_ID: 05061  \
> 	acc=AL023879  Length: 25298 bp 
>    >bK109G6.05234 Unfinished sequence: bK109G6  Contig_ID: 05234  \
> 	acc=AL023879  Length: 129756 bp 
> 
> I have seen various other styles of identifier to represent
> subsequences with a unique name, typically needed in protein
> domain databases, for example:
> 
>    >entryname-start-end  (e.g. SBASE)
>    >entryname/start-end  (horrible for generating filenames)
>    >entryname\start-end  (the "/" still causes confusion with
> filenames)
> 
> Other horrors:
> 
> Using control characters to fake extra lines in the description,
> for example ctrl-a appears in NCBI's dbEST.weekly.FASTA files.
> 
> UniGene's "seq.all" file has clusters of FASTA format sequences
> headed by comment lines starting with "#"
> 
> BLAST1.4 pressdb fails if the sequence lines are not all the same
> length.
> 
> An additional need is to have parseable information in the description
> which can be used to efficiently markup blast search results for a Web
> service.
> 
> >The Fasta-1998 REFERENCE format is very similar to the SEQUENCE
> format.
> >
> >R1. The reference file will hold information that didn't fit
> >      inside the Sequence file.
> >
> >R1.a  The comment line for each entry in the reference file must
> >        contain the ">" followed by the identifier, but no other
> information.
> 
> A nice idea. I would certainly support this kind of format for EMBOSS.
> 
> It is of course closely related to NBRF format, and its derivative
> GCG database format(s).
> 
> File naming could be a problem - the right FASTA REFERENCE file has to
> be associated with a FASTA SEQUENCE file. A ".ref" extension would
> help,
> but the sequence file may itself have various (or no) file extensions.
> 
> 
> -- 
> ----------------------------------------------------------------------
> Peter Rice                | Informatics Division, The Sanger Centre,
> E-mail: pmr at sanger.ac.uk  | Wellcome Trust Genome Campus,
> Tel: (44) 1223 494967     | Hinxton, Cambridge, CB10 1SA, England
> Fax: (44) 1223 494919     | URL: http://www.sanger.ac.uk/Users/pmr/




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net