IUBio

FASTA format - proposed max line limit

tendo tendo at nucleus.harvard.edu
Thu Nov 12 02:43:19 EST 1998


Nice proposal!
But why don't you allow comment lines other than identifier line?  I mean,
semicolon can be another character to indicate a comment line.  In fact, the
FASTA programs seem to accept semicolon-starting line as a comment line.
This way is nice because 1) you can put as many comment as you want at the
same place of sequence data,  2) you can seperate identifier line from other
comments, and 3) comments can be very easily removed without removing
identifier using grep command.   The third feature is actually important
because those lines should be easily removed in case they are problems -
actually, BLAST and CLUSTAL W don't seem to treat them as comments.
Seperate reference file is a good idea in this sense, but it is often messy
to handle reference separately.


I have one more proposal - there should be a BLANK LINE AFTER EACH SEQUENCE.
This makes it easy to search for a database by keyword with a Perl code -
e.g..

[keyword.pl]
#!/usr/bin/perl
$kw=join '|', at ARGV;
$/ = "";
while (<>) {
    /$kw/o and print;
}

[usage]
% cat gb109.fa | keyword.pl globin | keyword.pl horse > result

will yield sequences that contain both globin and horse.

Only 5 lines of perl code is enough for search!  Also, it makes coding
easier also in C, I think (I don't know about fortran...but who cares?)

This make the database slightly bigger, but it wouldn't be a serious
problem.
Better solutions?


Toshinori Endo
tendo at fas.harvard.edu
Harvard University Biological Laboratories.





More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net