To the group,
My favorite horror (to use Peter's term) is this one (and its brothers
and sisters in dbest) :
>gb|AA406996|acc:AA406966 EST02002 Mouse 7.5 dpc embryo ectoplacental
cone cDNA library Mus musculus cDNA clone C0016E04 3' similar to Mouse
mitochondrial genome. >gb, score = 1720, mRNA sequence - Mus musculus,
404 bp (RNA)
Enjoy! ;-)
Brian O.
Brian Osborne
Cadus Pharmaceutical Corporation
777 Old Saw Mill River Rd.
Tarrytown NY USA
10591-6705
brian.osborne at cadus.com
TEL 914 467 6291
FAX 914 345 3565
> -----Original Message-----
> From: Peter Rice [SMTP:pmr at sanger.ac.uk]
> Sent: Friday, November 06, 1998 5:21 AM
> To: bio-soft at net.bio.net> Subject: Re: FASTA format - proposed max line limit
>>mathog at seqaxp.bio.caltech.edu writes:
>> > We all know that the FASTA format is a bit restrictive in that there
> is
> > only the one line for comments, but can the software/database
> community
> > *please* agree on some reasonable maximum line length for both the
> comments
> > and the sequence?
>> I would welcome a standard "unique identifier" format after the ">".
>> We use FASTA format extensively at the Sanger Centre, but we need to
> hold both an identifier and an accession number. In specific cases
> we also need a database name. Often there is other information used
> (typically numeric) to generate several unique forms from one
> original name.
>> Curiously, one reason for the expansion of FASTA format is BLAST, as
> it takes as its database a file of many FASTA format sequences which
> need to have unique identifiers.
>> One option to get extra identifier information is to use the NCBI
> style with "|" characters to split the fields. Sometimes this seems to
> have special information in the first word(s) of the description too,
> for example in dbEST.weekly.FASTA
>> >gi|1622446|dbj|C21336|C21336 HUMGS0003372, Human Gene Signature, \
> 3'-directed cDNA sequence
>> (actually this is followed by "ctrl-A" and more description - see
> "other horrors" below)
>> Another, since we have some FASTA files generated by GCG, is the GCG
> syntax of:
>> >DB:entryname accnum yet...more...description
>> We generate FASTA files from our unfinished sequence data where the
> unique name is built from the clone and contig, using "." as a
> delimiter, for example:
>> >bK109G6.05061 Unfinished sequence: bK109G6 Contig_ID: 05061 \
> acc=AL023879 Length: 25298 bp
> >bK109G6.05234 Unfinished sequence: bK109G6 Contig_ID: 05234 \
> acc=AL023879 Length: 129756 bp
>> I have seen various other styles of identifier to represent
> subsequences with a unique name, typically needed in protein
> domain databases, for example:
>> >entryname-start-end (e.g. SBASE)
> >entryname/start-end (horrible for generating filenames)
> >entryname\start-end (the "/" still causes confusion with
> filenames)
>> Other horrors:
>> Using control characters to fake extra lines in the description,
> for example ctrl-a appears in NCBI's dbEST.weekly.FASTA files.
>> UniGene's "seq.all" file has clusters of FASTA format sequences
> headed by comment lines starting with "#"
>> BLAST1.4 pressdb fails if the sequence lines are not all the same
> length.
>> An additional need is to have parseable information in the description
> which can be used to efficiently markup blast search results for a Web
> service.
>> >The Fasta-1998 REFERENCE format is very similar to the SEQUENCE
> format.
> >
> >R1. The reference file will hold information that didn't fit
> > inside the Sequence file.
> >
> >R1.a The comment line for each entry in the reference file must
> > contain the ">" followed by the identifier, but no other
> information.
>> A nice idea. I would certainly support this kind of format for EMBOSS.
>> It is of course closely related to NBRF format, and its derivative
> GCG database format(s).
>> File naming could be a problem - the right FASTA REFERENCE file has to
> be associated with a FASTA SEQUENCE file. A ".ref" extension would
> help,
> but the sequence file may itself have various (or no) file extensions.
>>> --
> ----------------------------------------------------------------------
> Peter Rice | Informatics Division, The Sanger Centre,
> E-mail: pmr at sanger.ac.uk | Wellcome Trust Genome Campus,
> Tel: (44) 1223 494967 | Hinxton, Cambridge, CB10 1SA, England
> Fax: (44) 1223 494919 | URL: http://www.sanger.ac.uk/Users/pmr/