FASTA format - proposed max line limit

tendo tendo at fas.harvard.edu
Mon Nov 16 23:02:06 EST 1998

Andrew, I'm sorry but I didn't quite understand your point.
Could you explain simpler if you find I understood incorrectly?
What I really couldn't get is whether you meant we don't need new standard,
or just need 80 chars limit for a single comment line.  Or something else?

>However, if you are concerned about speed for keyword searchings,
>why not take a look at the Glimpse search engine?  (From
>glimpse.cs.arizona.edu, I think.)  By default it will ignore words
>longer than about 10 characters, which means most of the sequences
>will be skipped (excepting those <= 10 characters :).  If that's
>a hassle you can add a filter to the indexer and strip out all
>sequence lines (replace them with spaces to keep the line/character
>count the same).

Thank you, I'll take a look.
Does it use hashed index or something?

Well, secondary comment is requirement to me because what I really need is
much more information attached to the sequence and I don't want handle two
files every time I access to the sequence.  I'm not smart enough to remember
hundreds of sequence ID's with their features and I usually use the same set
of sequences many times, so keeping the information atached to the sequence
is very important.  I guess it's the same for the other reseachers who work
really on sequence with biological meanings.
So what I do is to modify source code of the programs whenever available to
accept secondary comment.  I thought if any body set up the standard, it's
much nicer to have the secondary comment to be in the standard so I'll never
have to change the source code by myself.

For many biologist in fact, using unix commands is not easy at all.  So if
the corresponding details are available with sequence without further
operation, it's nicer for them, I believe.

>More to the original thread, proposing an 80 column character
>length has as much likelyhood of becoming a standard as any
>other variation I've seen, meaning (almost) none.  We've already
>got NCBI using embedded ^A in their line to emulate newlines, and
>what incentive is there for them to change?

I think that they put many information in a single comment line because they
know those information are very important to biologists who are not familier
with computer.  So even if there are 80 chars limit in a standard, they will
dare to break the rule, becuase those information is necessary and they
already have the code to read those data without any problem.
If secondary comment was allowed, they wouldn't have used ^A for multiple

Human readability of current form is a problem as well as difficulty of
coding, but they chose retaining information.

>The first is "fixed" by a couple possibilities:
>  write a filter to split/join multiple lines
>  write an emacs mode that recognizes lines that start with a ">"
>     and "line wrap" visually but not textually.

This is a solution, but only for those who can write a filter or emacs-lisp.

>The second is a well solved problem in C/C++, and trivial in Perl
>and Python.  It's a lot harder for Fortran, but that's a 'hole
>'nother story.  (Besides, I've got a lot of binary Fortran files
>that I want to read in C, so we're even :)

Good for professional or well trained programmers.
Beginner programmers who want to write a program to handle sequence will
find it hard to write the reading code, though.

>The 3rd; well, not all of us need the database.  Just doing
>straight FASTA searches and parsing/cross-referencing the results
>is sufficient for a lot of people.  Parhaps another way of looking
>at it is that FASTA is used for database exchange (with all the
>extra text) and for local conventions (with references to local

Well, if enough information for individual sequences was given in the FASTA
output, it's fine to many people.

>  Or, we can do everything in XML  :)

This is an option, but not really practical, is it?


More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net