IUBio

FASTA format - proposed max line limit

tendo tendo at nucleus.harvard.edu
Fri Nov 13 13:53:15 EST 1998


>I'm so used to having the reference information in a separate file (GCG
>databases), that I don't view it as an inconvenience.  Rather, I want
>to separate the messy problem of parsing the reference information,
>which can be in ASN.1, Genbank, etc, or even just free comments, from the
>much simpler problem of parsing the DNA/Protein sequences.

I got your point.
Hoever, I think the major problem of the current fasta format is unlimited
length of comment line.  People try to put too much information is just
because they need those information WITH the sequence.  If you igonore this
point, no body will use this standard.

As a researcher, I do want to include a lot of references in sequence file
because I needed them.  For example, I want to put source species, gene
name, the intron position information with thier phase and translation
information WITH the DNA sequence.  If those information are seperate, its
too much for me to identify actual intron positions in the sequence as well
as cheking the translation of the sequence.  Especially for the translation
data, theoretical translation and GenBank provided translation are often
differ, and latter might be either experimentally determined or just an
input error.

So it's not a good idea to establish a standard only for the programmer's
sake.


>The proposed
>(new) reference FASTA format contains just enough information (the unique
>identifier, common between sequence and reference) to allow programs
>to match up the two pieces of information - so that database generators
>can safely put the (often) unformatted text that has been going onto
>the ">" lines into a safer place, and the end users can still find it, even
>if they are only using "more", or Microsoft Word. ie, they need only search
>for "{NEWLINE}>identifier{ENDofLINE}".  If the people writing the
>database show a bit of restraint in their reference files,
>">identifier" alone would suffice.


This is only the case when the user handle a small nubmer of data.
If there is a very good database manager which allow you to obtain reference
data immediately from fasta output without individual search, for example,
it might be acceptable, though.



>>Only 5 lines of perl code is enough for search!  Also, it makes coding
>>easier also in C, I think.
>
>Blank lines and extra spaces should be ignored (that would fall under the
>alphabet business, which I didn't specify.)  They should not be required to
>convey any particular meaning. For sure they don't make reading a FASTA
>file in C any easier:
>
>char buffer[80];           /* conformant FASTA files only */
>  if(gets(buffer)){         /* else it blows up here */
>     if(buffer[0] == '>'){  /* uniquely identifies start of an entry */
>     }
>     else {                 /* remainder of an entry */
>     }
>  }


You missed the point.
When you a file with multiple sequences, it's nicer this way:

#define LINELEN 80
char buffer[LINELEN+2];
while (fgets(buffer,LINELEN+2,stdin) {
    if (buffer[0]) == '>') {
        code_for_the_new_sequence;
    else if (buffer[0] == '\n') {
        code_for_the_previous_sequence;
    } else {
        store_sequence;
    }
}

rather than

char buffer[LINELEN+2];
while (fgets(buffer,LINELEN+2,stdin) {
    if (buffer[0]) == '>') {
        if (this_is_not_the_first_sequence) {
            code_for_the_previous_sequence;
        }
        code_for_the_new_sequence;
    } else {
        store_sequence;
    }
}
if (there_is_unprocessed_sequence) {
    code_for_the_previous_sequence;
}

What do you think?


Besides above, you'd better avoid using of gets which is insecure and set up
the buffer length to 81 (if you do use gets) or 82 (for fgets) instead of
80.  See man gets(3).

te





More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net