IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

XML standards for genome data

Don Gilbert gilbertd at bio.indiana.edu
Fri May 11 03:30:37 EST 2001

Alex Smith wrote:
>.. I am looking for different options for data
>storage, and was wondering if there were any XML or database standards
>which are widely used for storing genome information (ie. sequence,
>genes, etc).

Wish there were, it would make life easier for many genome miners.
Generally I'd suggest either make your software multi-format aware
and/or plan to reformat data used by it to a common form.

For euGenes*, I collect eukaryotes genome data from sources and
put into a common format for use w/ euGenes software (genome maps,
gene data search and reporting).

XML variants are available for fly and weed genome data (each
different). GFF for feature lists is also common.  I prefer a
condensed EMBL/GenBank feature table format myself for describing
features*. GFF is similar, but spreads out a single feature through a file
(e.g. separate exon entries for each mRNA, no guarantee you don't
have to read a huge file just to get one complete feature).  Worm
genome comes best in AceDB format (w/ GFF for chromosome features).
Human (public) genome data is best found in the GoldenPath MySQL
dumps, plus fasta sequence data.  For yeast genome data, I use
NCBI genbank format.

For sequence data sans features, fasta is commonly used. I
use at euGenes the raw sequence (no formatting, just the long
string of dna in a chromosome) to make feature indexing into it
efficient for software.

Overall, it makes sense to maintain the sequence as separate
fasta/raw files, with feature entry files in some variants of
GenBank/EMBL, GFF, XML.  For other genome data (functions, data
links, reference info, synonyms, etc.) I use a key=value
structure along lines of LocusLink for efficiency; an XML variant
would be useful here if any generally used standard emerges.

Gene function/location/process information of the GeneOntology.org
group is a useful standard for that set of info.

-- Don

* http://iubio.bio.indiana.edu/eugenes/ and for the bulk data
euGenes feature table formats described at

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at bio.indiana.edu


More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net