IUBio

[Genbank-bb] GenBank Release 221.0 Problem : 21 records with illegally line-wrapped text qualifiers in the Feature Table

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Fri Sep 1 13:54:32 EST 2017


Greetings GenBank Users,

Staff from Chemical Abstracts Services detected invalid flatfile formatting
for one of the GenBank 221.0 release files and reported it to NCBI on Friday
August 25th.

NCBI confirmed the problem and isolated a testcase. CAS ultimately found that
five of the release files were impacted:

	gbhtg86.seq.gz
	gbpri2.seq.gz
	gbpri24.seq.gz
	gbpri3.seq.gz
	gbpri56.seq.gz

NCBI identified a total of 21 affected records among these five files.
Accession lists are provided below.

The formatting problem is the presence of completely empty/null lines within
the Feature Table section. They are caused by text qualifiers whose values
include one or more lines of whitespace, intended to make the text more
readable. Here is an example:

LOCUS       AC202819               70919 bp    DNA     linear   HTG 27-JUN-2008
DEFINITION  Gossypium hirsutum chromosome UNKNOWN clone ZMMBBb-510L8, ***
            SEQUENCING IN PROGRESS ***, 21 unordered pieces.
ACCESSION   AC202819
VERSION     AC202819.1
....
     misc_feature    1..4949
                     /note=";
                     This clone was previously submitted as Zea mays as part of
                     the Maize sequencing project. It appears that the original
                     maize BAC library (ZMMBBb) is contaminated with Gossypium
                     hirsutum cv Maaxacotton. During the mapping project, these
                     clones clustered together to form small unanchored mapping
                     contigs and clones from these contigs were chosen for
                     sequencing. In order to make the best use of this data the
                     clones have been reclassified as cotton and left in the
                     public domain for potential use.
                     
                      assembly_name:Contig100
                     
                     SOURCE INFORMATION:
                     The ZMMBBb Corn BAC Library was constructed by Jeff
                     Tomkins at Clemson University Genomics Institute from Zea
                     mays cultivar B73. For more information about this library
                     or to obtain a clone, please refer to the online ordering
                     system at the CUGI BAC/EST Resource Center
                     (https://www.genome.clemson.edu)."

Because the blank lines lie within the body of a line-wrapped value for
the /note qualifier, they should *not* actually be completely empty. Rather,
they should consist of 21 space characters, followed by a newline:

                     public domain for potential use.
                     
^^^^^^^^^^^^^^^^^^^^^
                      assembly_name:Contig100

These leading spaces were erroneously trimmed, and the result technically
breaks the GenBank flatfile specification. Most likely, only those who perform
a fairly deep parse of the flatfile structure will be impacted by this error.

The five GenBank release files were patched on Friday Sept 1 2017 and then
installed at the FTP site:

-rw-r--r-- 1 gbupdate giprog  63705600 Sep  1 13:46 gbpri56.seq.gz
-rw-r--r-- 1 gbupdate giprog  80114309 Sep  1 13:46 gbpri3.seq.gz
-rw-r--r-- 1 gbupdate giprog  19750385 Sep  1 13:46 gbpri24.seq.gz
-rw-r--r-- 1 gbupdate giprog  65577608 Sep  1 13:46 gbpri2.seq.gz
-rw-r--r-- 1 gbupdate giprog  83466269 Sep  1 13:46 gbhtg86.seq.gz

In addition to fixing the flatfile generator bug, we have implemented
stricter flatfile parse-checks during GenBank Release processing, which
will prevent similar problems in the future.

We would like to thank our customers at Chemical Abstract Services for
alerting us to this problem. We appreciate the scrutiny of our products
which GenBank users provide, and welcome error reports.

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS

htg86:

AC202819
AC202820
AC202822
AC202823
AC202824
AC202825
AC202827
AC202829
AC202830
AC202831

pri2:

AC002116
AC003107
AC004143
AC004151
AC004472
AC004602
AC004770

pri3

AC005175
AC005306

pri24:

AF037222

pri56:

U95626



More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net