Alex Smith wrote:
>.. I am looking for different options for data
>storage, and was wondering if there were any XML or database standards
>which are widely used for storing genome information (ie. sequence,
Wish there were, it would make life easier for many genome miners.
Generally I'd suggest either make your software multi-format aware
and/or plan to reformat data used by it to a common form.
For euGenes*, I collect eukaryotes genome data from sources and
put into a common format for use w/ euGenes software (genome maps,
gene data search and reporting).
XML variants are available for fly and weed genome data (each
different). GFF for feature lists is also common. I prefer a
condensed EMBL/GenBank feature table format myself for describing
features*. GFF is similar, but spreads out a single feature through a file
(e.g. separate exon entries for each mRNA, no guarantee you don't
have to read a huge file just to get one complete feature). Worm
genome comes best in AceDB format (w/ GFF for chromosome features).
Human (public) genome data is best found in the GoldenPath MySQL
dumps, plus fasta sequence data. For yeast genome data, I use
NCBI genbank format.
For sequence data sans features, fasta is commonly used. I
use at euGenes the raw sequence (no formatting, just the long
string of dna in a chromosome) to make feature indexing into it
efficient for software.
Overall, it makes sense to maintain the sequence as separate
fasta/raw files, with feature entry files in some variants of
GenBank/EMBL, GFF, XML. For other genome data (functions, data
links, reference info, synonyms, etc.) I use a key=value
structure along lines of LocusLink for efficiency; an XML variant
would be useful here if any generally used standard emerges.
Gene function/location/process information of the GeneOntology.org
group is a useful standard for that set of info.
* http://iubio.bio.indiana.edu/eugenes/ and for the bulk data
euGenes feature table formats described at
-- gilbertd at bio.indiana.edu