IUBio

FASTA format - proposed max line limit

Andrew Dalke dalke at bioreason.com
Mon Nov 16 21:11:09 EST 1998


François Jeanmougin <jeanmougin at igbmc.u-strasbg.fr>
>        I don't think it is the point. keeping comments out of the
>        sequence will dramatically increase the speed of both 
>        sequence similarities and keyword searches. It is not so
>        hard to use two files for a sequence, one for the sequence
>        itself and one for the comments.
>
>        It is not for the programmer's sake but for the researcher
>        comfort.

I hadn't noticed much of a problem with fasta performance on
long records.  The check for end of line detection is quite
fast, and it uses memory mapped files so buffering the input
isn't a problem.

Alignment is more complicated than character checks, so I don't
believe there can be a problem unless the comments are a lot
longer than they are now.

However, if you are concerned about speed for keyword searchings,
why not take a look at the Glimpse search engine?  (From
glimpse.cs.arizona.edu, I think.)  By default it will ignore words
longer than about 10 characters, which means most of the sequences
will be skipped (excepting those <= 10 characters :).  If that's
a hassle you can add a filter to the indexer and strip out all
sequence lines (replace them with spaces to keep the line/character
count the same).

Then your keyword search will be *really* fast compared to grep
since the sequences need only be sequenced once (when the database
changes).  This is part of what we did for the DiscoveryBase product
at the Molecular Applications Group (www.mag.com)

More to the original thread, proposing an 80 column character
length has as much likelyhood of becoming a standard as any
other variation I've seen, meaning (almost) none.  We've already
got NCBI using embedded ^A in their line to emulate newlines, and
what incentive is there for them to change?

As I understand the whole process, having all the data on one
line is:
  annoying to those that need to edit it
  cumbersome to those that write parsers and need to do (possible)
    dynamic arrays for input
  unneeded for those with a database; they only need the database
    reference ID in the >comment line.

The first is "fixed" by a couple possibilities:
  write a filter to split/join multiple lines
  write an emacs mode that recognizes lines that start with a ">"
     and "line wrap" visually but not textually.

The second is a well solved problem in C/C++, and trivial in Perl
and Python.  It's a lot harder for Fortran, but that's a 'hole
'nother story.  (Besides, I've got a lot of binary Fortran files
that I want to read in C, so we're even :)

The 3rd; well, not all of us need the database.  Just doing
straight FASTA searches and parsing/cross-referencing the results
is sufficient for a lot of people.  Parhaps another way of looking
at it is that FASTA is used for database exchange (with all the
extra text) and for local conventions (with references to local
databases).

  Or, we can do everything in XML  :)
						Andrew
						dalke at bioreason.com




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net