I've written a sequence assembly algorithm which as part of its output
generates a set of files each of which contains a representation of a
multiple sequence alignment. Initially I've used the GDE flat file format:
#name1(offset)
sequence1
#name2(offset)
sequence2
.
.
.
because it's simple, it doesn't make me waste disk space encoding leading
gaps in the alignment as "-"s, and it's easy to load into GDE. Another
scientist wants to be able to load my contig files into his multiple
sequence editor. He has modified his editor to read the GDE flat file
format. We will probably make some additions to the format by augmenting
the information on the # lines of the file which luckily GDE seems to
ignore. Before we go to the trouble of defining our own standard, I thought
it best to consult the community about existing standards that would fit
the bill instead of creating yet another file format. I have looked at all
of the formats GDE will generate using "save as" or "export foreign format"
and found none of the GDE interpretations to be satisfying. Most formats
seemed to suffer from forcing leading gaps to be explicitly present and/or
being overly complex. I'd love to hear about your favorite file format which
does not suffer from these problems.
One enhancement to the GDE flat file format we have been considering is a
way to have more than one contig per file. We were considering accomplishing
this by indicating a consensus sequence and hence its associated sequences
by tagging all consensus sequences with two leading ## instead of one #.
Another enhancement would be to add some data integrity information on the
# line such as sequence length and composition and perhaps a checksum.
Thanks in advance for all comments via email or netnews.
Granger Sutton
The Institute for Genomic Research
grange at tigr.org