You should look into Brian Fristensky's (sp?) XYLEM package, which
I believe will do just this (I have done this with my own code, but
I wouldn't foist THAT on anyone else :-) You should try searching
SeqAnalRef for the citation. SeqAnalRef is available from
In case you do wish to roll your own, you are actually better off
IGNORING the exon features in the feature table. Why? Because many
entries don't have them, and the nomenclature for alternatively spliced
exons is assuredly inconsistent. What most folks do is use the CDS
(if interested in coding regions, which I think you are) or mRNA (less
preferable -- again, may not always be present) features and look for
ones with multiple segments [CDS join(..]. There are a few rare cases in which
the segments do not correspond with exons (perfectly legal: the feature
table tells you how to assemble, not necessarily how nature does it).
For partial coding regions, there is a flag /codon_start=1 (or 2 or 3)
which tells you the initial frame.
One last warning: while all non-pseudogenes in GenBank translate,
not all have the splice junctions properly annotated -- they are
often off by 1 nt, with balancing errors in one exon's 3' end
and the others 5' end. If you are doing an analysis which will be
sensitive to this, WATCH OUT!
In other words, enjoy! -- but caveat emptor!
Department of Cellular and Developmental Biology
Department of Genetics / HHMI
robison at mito.harvard.edu