IUBio

FASTA format - proposed max line limit

Andrew Dalke dalke at bioreason.com
Sun Dec 6 01:39:00 EST 1998


David Mathog <mathog at seqaxp.bio.caltech.edu> said:
> Yes, it is, but not in the sense you meant it.  The fundamental problem
> with lines >80 characters is that there is no consistency in how they
> will be displayed.  They might wrap, they might truncate, they might be
> scrolled off the right hand side of the screen (which an end user might
> not notice when scanning quickly through a 100 entry FASTA file with a
> tool like "nedit" or "notepad").  There are even a few tools around
> which will do nasty things when they encounter overly long "text"
> records, for instance EDT on VMS will truncate them to 255 characters. 

I'll argue that they're using the wrong tools or not using their tools
well enough.  There exist solutions for all those platforms (well, I'm
not sure about VMS but you can probably get emacs for it).

  NEdit will let you choose several wrap styles under "Preferences-Wrap"
There's None, Auto Newline, Continuous (you probably want the last one).
You could probably define a macro to go into the right mode by default
when a .fasta file is read.
  
  Notepad is I understand a lousy text viewer.  One recommendation I've
heard is at http://x6.dejanews.com/getdoc.xp?
AN=414147166&CONTEXT=912925389.2082734198&hitnum=2
which suggests getting the free NoteTab Light from
http://www.notetab.com/ .  It even has a menu option to replace or
reinstall Notepad as the default viewer.  I don't have a MS machine
so I cannot tell how it deals with word wraps.  What description is
online is that it can do a lot.

  I'll bet BBEdit for the Mac handles line wrapping issues easily
  as well.

  Or, put it another way.  Suppose you're looking at a structure
file.  Do you want to look at the 2D/3D coordinates in a text editor
(after all, they are text files) or in a structure viewer?  FASTA
happens to contain simple enough data that a basic text viewer
usually works well, but it doesn't always work.  Given that more
advanced, free, powerful, easier and in some cases fully backwards
compatible tools exist, why worry about stunting an existing format
(and breaking compatibility with other software).


> FASTA is a TEXT format, so fasta files should look very much the
> same with the widest range of existing text tools.  Long lines are
> not compatible with that goal.

  Just because something is in text, or more specifically ASCII (since
Unicode is text) doesn't mean it is designed to be human readable.
Here's a couple of counter-examples, from an interactive Python
session:

>>> print pickle.dumps( ("This is a comment", "EKLADWERDNA") )
(S'This is a comment'
p0
S'EKLADWERDNA'
p1
tp2
.
>>> infile = StringIO.StringIO(">This is a comment\nEKLADWERDNA\n")
>>> uu.encode(infile, sys.stdout)
begin 666 -
?/E1H:7,@:7, at 82!C;VUM96YT"D5+3$%$5T521$Y!"@  
 
end

Both of these are TEXT representations, but the first is designed to
store complex data structures in an form easily parsable for a
computer and somewhat parsable for a human (so says the format
documentation) while the second is designed to send any text data via
email/usenet to software designed for limited text displays.  In the
case of uuencode/uudecode, the text width is about 61 characters as
many older terminals had 65 character displays.  Be thankful we can
expect 80 these days :)


  I persist in saying that the FASTA file is poorly designed and
overused.  It's 90% good enough, and simple, so it has become the de
facto sequence format.  It's the other 10% that makes it nasty,
like not expecting people would overload the single line comment
information and so forcing long lines which make it less readable.

  To stay in the strictures of the existing format, your proposal is
to limit the format even more by keeping the line width under 80
characters.  As I pointed out in an earlier email, there is existing
software that expects it can put more than 80 characters in a line. 
If you're going to change the format to make existing software
unusable, why not remove restrictions instead of adding new ones?

  Here's a format that I'll call FASTA-Next Generation (in homage to
IP-NG which was in homage to Star Trek, Next Generation :).  The
extension for this format is ".fng" and the MIME type is
chemical/fasta-ng.  It should make a FASTA variant that's 95% good
enough, and as good as I think you can do and still be FASTA-like.

  All this format does is extend the existing FASTA format by allowing
multiple successive comment lines.  Instead of having one long line,
that line can be folded across many lines.

Here's a more formal definition in BNF:

  FNG_FILE ::=
       NEWLINE*
     | BLANK* RECORD+
     ;

  NEWLINE ::=  "\r\n";

  RECORD ::=  COMMENT+ SEQUENCE+ NEWLINE*;

  COMMENT ::= '>' text+  NEWLINE;

  SEQUENCE ::= text+ NEWLINE;

The definition of "text" is given by RFC 822, Appendix D (see
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc822.txt) as:
text        =  <any CHAR, including bare CR & bare LF, but NOT
                including CRLF>
CHAR        =  <any ASCII character>  ; (  octal 0-177,  decimal 0-127)
CR          =  <ASCII CR, carriage return>  ; ( octal 15, decimal 13)
CRLF        =  CR LF
LF          =  <ASCII LF, linefeed>         ; ( octal 12, decimal 10)

My format proposal requires that lines end with CRLF, also as suggested
in the RFC.

It is urged that implementers of FNG parsers offer an option to
recognize ASCII CR (Macs) and ASCII LF (Unix) as alternate newline
characters to support noncompliant implementations.  The option should
not be enabled by default because of the definition of CHAR given
above.  It is also urged that writers of FNG files do not use CR or LF
other than as CRLF because of these portability issues.

For readability, long text in the comment and sequence fields
> may be "folded" onto multiple lines of the actual field.
>
> "Long" is commonly interpreted to mean greater than 65 or 72
> characters.  The former length serves as a limit, when the message
> is to be viewed on most simple terminals which use simple display
> software; however, the limit is not imposed by this standard.
> 
>    Note:  Some display software often can selectively fold lines,
>           to  suit  the display terminal.  In such cases, sender-
>           provided  folding  can  interfere  with   the   display
>           software.
                               -- RFC 822, section 3.4.8

It is strongly urged that users of this format not place complex
information inside of the comment section.  Other formats, such as
XML, are more appropriate.  Whitespace should only be used in the
comment section to indicate seperations between words and must not
be used to align information between multiple lines (that is,
whitespace must not be used for vertical alignment).  No implicit
whitespace is assumed by a CRLF.

In other words, the following comment lines
>This is a
> comm
>ent line.

is semantically equivalent to "This is a comment line."  (There is no
whitespace at the end of any of the three lines.)


There should be a paragraph here saying that (I think) whitespace is
important inside of the sequence section and that tabs and spaces are
not interconvertable.  I don't know enought about the use of spaces as
a gap character to know how to write this.  In this case, CRLF should
still not act as any sort of whitespace.


  This proposal is:
   o well-defined
   o cross platform
   o a superset of the existing FASTA format (excepting newline
       conversion)
   o easy to parse
   o a "normal looking" extension of common usage
   o great for MS Windows users since that OS uses CRLF as the system
      default newline.

Because the comment and sequence fields can be folded across multiple
lines without loosing any semantic meaning, it is also possible to
convert any FNG file to a FNG file that uses less than 80 characters
per line.  Conversion of such a folded file to a "classic FASTA"
format is trivial; remove CRLF used between comment lines in the
same record.  If a program doesn't support "long" lines the input
can be filtered with no loss of information.  Or the output can
be filtered to normalize to the in-house standards.

Also, excepting the newline issue and some whitespace details, there's
already software that understands this format since it's a natural
extension of the existing format.

There's other things I would like, including format version
information, but to stay close the existing FASTA format I'll hold
back.


Any comments?  Any reason this format is inappropriate for those
places where the current FASTA file is appropriate?  Any problems with
it?  Remember, suggestions that changes current practices (like
limiting line length) is incompatable with existing software, so why
not make things better instead of worse?

						Andrew Dalke
						dalke at bioreason.com




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net