I am writing a program that parses the CDS keys of the FEATURE tables
of GenBank and EMBL entries.
I have discovered an inconsistent use of the "/codon_start=" qualifier in
several EMBL entries.
On page 55, "The DDBJ/EMBL/GenBank Feature Table: Definition" manual
(version 1.04) states that the "/codon_start=" qualifier can take values
of 1, 2, or 3 (indicating the relative offset of the reading frame from
the start nucleotide). However, as in the example below, the qualifier takes
the value of the actual position in the sequence. (I believe this was the
meaning implied for this qualifier in version 1.03 of the feature table
definition.) I have not yet observed this problem in GenBank entries.
This sort of thing doubles the work of the programmer, who must test for
both eventualities.
Is this something that will be corrected soon?
Example:
--------
[SSCYS] Synechococcus sp. cysA, 3' end, spbA, cysT, orf81, cysR, and cysW
genes, complete cds.
ID [LOC]
SSCYS standard; DNA; PRO; 4127 BP.
ACCESSION [ACC]
M65247;
[ ... several lines omitted ... ]
FEATURES [FEA]
Key Location/Qualifiers
CDS complement(<1..33)
/note="homologous to nucleotide binding
polypeptides of other permease systems"
/gene="cysA" /codon_start=33
CDS 233..1285
/function="sulfate-binding protein" /gene="sbpA"
/codon_start=233
CDS 1360..2196
/note="integral membrane polypeptide of the
sulfate permease" /gene="cysT" /codon_start=1360
CDS 2220..2465
/label=orf81 /codon_start=2220
CDS 2493..3113
/gene="cysR" /codon_start=2493
CDS 3159..4019
/note="integral membrane polypeptide of the
sulfate permease" /gene="cysW" /codon_start=3159
--
Conrad Halling
c-halling at uchicago.edu