Dear GenBank Users,
Although this group isn't specifically intended for discussions about
an XML representation known as INSDSeq, enough GenBank users are
using INSDSeq XML that we feel some recent changes should be
announced here.
(INSD == International Nucleotide Sequence Database == the collaboration
among DDBJ, EMBL, and GenBank.)
INSDSeq is collaborative XML DTD for sequence records that all three
members of the INSD support. The current version of the DTD (INSDSeq 1.4)
is still quite reminiscent of the GenBank, EMBL, and DDBJ flatfile
representations... However, additional structure is gradually being
introduced for various data elements, which we hope will prove useful for
XML users. The current DTD can be found at:
http://www.ncbi.nlm.nih.gov/data_specs/dtd/INSD_INSDSeq.dtdhttp://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd
Here is GenBank record M10101 in INSDSeq format:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&list_uids=146274&dopt=gbc
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Several semantic changes in NCBI's generation of INSDSeq 1.4 have been
(or will recently be) made:
1.) The basepair abbreviations for nucleotide sequences in the
INSDSeq_sequence element have been switched from upper-case
to lower-case letters.
This change has already been implemented.
2.) The INSDReference_reference element now contains *only* the serial
number of a reference.
Using M10101 as an example, the first reference is:
<INSDReference_reference>1</INSDReference_reference>
<INSDReference_position>1768..3531</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Tiedeman,A.A.</INSDAuthor>
<INSDAuthor>Smith,J.M.</INSDAuthor>
<INSDAuthor>Zalkin,H.</INSDAuthor>
</INSDReference_authors>
....
</INSDReference>
Previously, the basepair position was redundantly presented in
both the INSDReference_reference *and* INSDReference_position
elements.
This change has already been implemented.
3.) A tilde character ( ~ ) within INSDSeq_comment #PCDATA values will
soon be used to indicate a linebreak.
Doubled-tilde characters ( ~~ ) should be interpreted as a literal,
single tilde character .
The need for such a convention can be seen by examining the format
of the COMMENT section in the GenBank flatfile representation of
GenBank record AC183761 :
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=95147495
The semi-structured paragraph-oriented nature of the COMMENT can
be reproduced by XML rendering software with the adoption of
tilde as a linebreak chracter.
This convention for tilde has been in use for ASN.1 data provided
by NCBI for many years. So its use in INSDSeq seems warranted.
We expect that this change will be implemented by October 15 2006
or earlier.
4.) INSDSeq_strandedness will soon be populated for all sequences.
Currently, double-stranded DNA and single-stranded RNA sequences
are presented without any INSDSeq_strandedness element. Only when
the strandedness is something *other* than the defaults which are
apprropriate for DNA/RNA is INSDSeq_strandedness provided.
This practice will change by October 15 2006, such that a strandedness
value is always presented, for all sequences.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
BTW: If you are a developer of software tools, and if the INSDSeq XML
representation is of interest to you (as opposed to the full-blown XML
equivalents of NCBI's ASN.1 specifications), we would like to hear from
you! Please send your suggestions for INSDSeq changes to the NCBI Service
Desk:
info at ncbi.nlm.nih.gov
Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS