IUBio

[Genbank-bb] GenBank Release 186.0 Available : October 15 2011

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Sat Oct 15 14:10:14 EST 2011


Greetings GenBank Users,

  GenBank Release 186.0 is now available via FTP from the
National Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 186.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 186.0

  Close-of-data for GenBank 186.0 occurred on 10/13/2011. Uncompressed,
the Release 186.0 flatfiles require roughly 518 GB (sequence files only)
or 557 GB (including the 'short directory', 'index' and the *.txt
files). The ASN.1 data require approximately 426 GB.

Recent statistics for non-WGS, non-CON sequences:

  Release  Date      Base Pairs    Entries

  185      Aug 2011  130671233801  142284608
  186      Oct 2011  132067413372  144458648

Recent statistics for WGS sequences:

  Release  Date      Base Pairs    Entries

  185    Aug 2011  208315831132   64997137
  186    Oct 2011  218666368056   68330215

  During the 60 days between the close dates for GenBank Releases 185.0
and 186.0, the non-WGS/non-CON portion of GenBank grew by 1,396,179,571
basepairs and by 2,174,040 sequence records. During that same period,
789,185 records were updated. An average of 49,387 non-WGS/non-CON
records were added and/or updated per day.

  Between releases 185.0 and 186.0, the WGS component of GenBank grew by
10,350,536,924 basepairs and by 3,333,078 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 186.0 and Upcoming Changes) have been appended
below for your convenience.

                ** Important Notes **

*  GenBank 'index' files are now provided without any EST content, and
   without most GSS content. See Section 1.3.3 of the release notes for
   further details.

   NCBI is considering ceasing support for the index files, so we
   encourage affected users to review that section and provide feedback.

  Release 186.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: October 15 2011, 186.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release
        end

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
186.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 186.0

1.3.1 Organizational changes

The total number of sequence data files increased by 23 with this release:

  - the BCT division is now composed of  77 files (+2)
  - the CON division is now composed of 152 files (+3)
  - the ENV division is now composed of  43 files (+1)
  - the EST division is now composed of 450 files (+3)
  - the HTC division is now composed of  14 files (-1)
  - the HTG division is now composed of 136 files (+1)
  - the INV division is now composed of  30 files (-1)
  - the PAT division is now composed of 171 files (+3)
  - the PLN division is now composed of  51 files (+1)
  - the PRI division is now composed of  44 files (+2)
  - the TSA division is now composed of  43 files (+8)
  - the VRL division is now composed of  19 files (+1)

The number of HTC division files decreased by one because nearly 37,000
high-throughput sequences for human cDNA clones were removed at the
submitter's request.

The number of INV division files decreased by one because WGS project CABG01
has replaced approximately 19,000 (non-WGS) records for the Schistosoma mansoni
genome.

The total number of 'index' files increased by 2 with this release:

  - the AUT (author name) index is now composed of 91 files (+2)

1.3.2 New centromere and telomere features

  Telomeres and centromeres are essential features of chromosomes and
disrupting their structure affects the viability and life span of an
organism. Centromeric sequence varies from a compact, non-repetitive,
less than 150 base pair region in S. cerevisiae to a highly repetitive
and complex region of several hundred thousands of base pairs in
eukaryote genomes. The sequence at the telomeric ends is unique compared
to the rest of the chromosome and protects the chromosome ends from
recombination, fusion to other chromosomes or degradation by nucleases.
Currently telomere and centromere features may be under-annotated since
there are no specific feature keys for them, hence the INSDC approved
the creation of two new features, which are now allowed as of the
October 15 2011 GenBank Release.

  These two features are intended for use when the centromere or telomere
have been actually been sequenced. These two new features are now legal as
of this GenBank Release 186.0 (October 15 2011).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Feature Key          centromere

Definition           region of biological interest identified as a centromere
                     and which have been experimentally characterized

Optional qualifiers  /centromere_type=<centromere_type>
                     /citation=[number]
                     /db_xref="<database>:<identifier>"
                     /experiment="[CATEGORY:]text"
                     /inference="[CATEGORY:]TYPE[ (same species)][:EVIDENCE_BASIS]"
                     /note="text"
                     /standard_name="text"

Comment              the centromere feature describes the interval of DNA
                     that corresponds to a region where chromatids are held
                     and a kinetochore is formed

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Qualifier            /centromere_type=

Definition           type of centromere

Value format         point, regional

Example              /centromere_type=point

Comment              the values are case-insensitive, i.e. both "POINT" and
                     "point" are valid;

                     Definitions of the values:

                     regional : the DNA sequence consists of large arrays of
                     repetitive DNA, where the sequence of individual repeat
                     elements is similar but not identical

                     point : consist of well defined and conserved DNA
                     sequences that are sufficient to confirm centromere
                     identity and function

Important Note: The necessity of a /centromere_type qualifier is still under
discussion. If it is determined that the type can *always* be inferred based
on the organism alone (eg, 'point' for S. cerevisiae and 5 other budding yeasts,
'regional' for all other eukaryotes) then /centromere_type will not be
implemented.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Feature Key          telomere

Definition           region of biological interest identified as a telomere
                     and which have been experimentally characterized

Optional qualifiers  /citation=[number]
                     /db_xref="<database>:<identifier>"
                     /experiment="[CATEGORY:]text"/note="text"
                     /inference="[CATEGORY:]TYPE[ (same species)][:EVIDENCE_BASIS]"
                     /note="text"
                     /rpt_unit_seq
                     /rpt_unit_range
                     /rpt_type
                     /standard_name="text"

Comment              the telomere feature describes the interval of DNA
                     that corresponds to a specific structure at the end of
                     the linear eukaryotic chromosome which is required for
                     the integrity and maintenance of the end; this region
                     is unique compared to the rest of the chromosome and
                     represents the physical end of the chromosome;

1.3.3 New assembly_gap feature, and /gap_type and /linkage_evidence qualifiers

  Complete genomes are often submitted to the INSDC via a small (or large)
set of independent sequence records, which can be assembled into chromosomes
and/or scaffolds. The CON-division records representing these scaffolds
and chromosomes are usually built using information provided in "AGP files"
provided by the submitter. See:

   http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Update.shtml **

  The AGP 2.0 specification includes provisions for a variety of different
gap types, as well as information about whether a gap between two
scaffold or chromosome components is an unspanned gap or a spanned gap.
There is also biological gap-types: telomere, centromere and repeat.
AGP 2.0 also supports terminology to describe the type of evidence used
to establish the linkage connecting the components on either side of a
spanned gap within a scaffold or chromosome. Unfortunately, there is no
mechanism to represent any of this information in the Feature Table.

  To address this, the INSDC has decided to implement an assembly_gap
feature, and /gap_type and /linkage_evidence qualifiers, all of which
are now legal as of this October 15 2011 GenBank Release.

  The new centromere and telomere features (see Section 1.3.2) should
only be used when the actual sequence of a centromere/telomere has been
determined. If this is not the case, then an assembly_gap feature with
a /gap_type of "centromere" or "telomere" should be used instead to
indicate the locations of centromeres and telomeres within a genome
assembly.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Feature Key          assembly_gap

Definition           gap between two components of a CON-division record that
                     is part of a genome assembly

Mandatory qualifiers /estimated_length=unknown or <integer>
                     /gap_type="TYPE"
                     /linkage_evidence="TYPE" (Note: Mandatory only if the
                       /gap_type is "within scaffold" or "repeat within scaffold"
                       For all other types of assembly_gap features, use of the
                       /linkage_evidence qualifier is invalid.)

Comment              the location span of the assembly_gap feature for an unknown
                     gap is 100 bp, with the 100 bp indicated as 100 "n"'s in the
                     sequence. Where estimated length is indicated by an integer,
                     this is indicated by the same number of "n"'s in the sequence.
                     No upper or lower limit is set on the size of the gap.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Qualifier            /gap_type=TYPE

Definition           kind of gap connecting components in records of a genome
                     assembly, or the kind of biological gap in a record that
                     is part of a genome assembly

Value format         "between scaffolds", "within scaffold", "telomere",
                     "centromere", "short arm", "heterochromatin",
                     "repeat within scaffold", "repeat between scaffolds"

Example              /gap_type="between scaffolds"
                     /gap_type="within scaffold"

Comment              This qualifier is used only for assembly_gap features and
                     its values are controlled by the AGP Specification version 2.0:
                     http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Update.shtml **

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Qualifier            /linkage_evidence=TYPE

Definition           type of evidence establishing linkage across an assembly_gap.
                     Only allowed to be used with assembly_gap features that have
                     a /gap_type value of "within scaffold" or "repeat within scaffold"

Value format         "paired-ends", "align genus", "align xgenus", "align trnscpt",
                     "within clone", "clone contig", "map", "strobe", "unspecified"

Example              /linkage_evidence="paired-ends"
                     /linkage_evidence="within clone"

Comment              This qualifier is used only for assembly_gap features and its
                     values are controlled by the AGP Specification version 2.0
                     http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Update.shtml **

** The current URL for the AGP 2.0 specification is temporary. When a permanent link
   becomes available, the Feature Table will be updated accordingly.

  Although these new features and qualifiers are allowed as of Oct 15 2011,
it is unlikely that they will begin to appear on sequence records before
December 1st. However, users should begin preparing for them now, to prevent
processing problems later.

1.3.4 Changes in the content of index files

  As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics from January 2005
seemed to support this: the index files were transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also lead us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.

  The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.

  Our short-term solution is to cease generating some index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI.

  The three gbacc*.idx index files continue to reflect the entirety of the
release, including all EST and GSS records, however the file contents are
unsorted.

  These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options:

a) Cease support of the 'index' file products altogether.

b) Provide new products that present some of the most useful data from
   the legacy 'index' files, and cease support for other types of index data.

  If you are a user of the 'index' files associated with GenBank releases, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:

   info from ncbi.nlm.nih.gov

  Our apologies for any inconvenience that these changes may cause.

1.3.5 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
103 of the GSS flatfiles in Release 186.0. Consider gbgss146.seq :

GBGSS1.SEQ          Genetic Sequence Data Bank
                         October 15 2011

                NCBI-GenBank Flat File Release 186.0

                           GSS Sequences (Part 1)

   87119 loci,    64001369 bases, from    87119 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "146" based on the number of files dumped from the other
system.  We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.

1.4 Upcoming Changes

  There are no changes scheduled for implementation in the February 2012
GenBank Release.






More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net