IUBio

[Genbank-bb] GenBank Release 223.0 Available : December 19 2017

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Tue Dec 19 19:06:09 EST 2017


Greetings GenBank Users,

  GenBank Release 223.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 223.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 223.0

 Close-of-data for GenBank 223.0 occurred on 12/15/2017. Uncompressed,
the Release 223.0 flatfiles require roughly 862 GB (sequence files only).
The ASN.1 data require approximately 712 GB.

Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, TLS, and the CON-division):

  Release  Date      Base Pairs    Entries

  222      Oct 2017  244914705468  203953682
  223      Dec 2017  249722163594  206293625

Recent statistics for WGS sequencing projects:

  Release  Date      Base Pairs    Entries

  222    Oct 2017  2318156361999   508825331
  223    Dec 2017  2466098053327   551063065
  
Recent statistics for bulk-oriented TSA sequencing projects:

  Release  Date      Base Pairs     Entries

  222    Oct 2017   172909268535   192754804
  223    Dec 2017   181394660188   201559502
  
Recent statistics for bulk-oriented TLS sequencing projects:

  Release  Date      Base Pairs     Entries

  222    Oct 2017     2993818315     9479460
  223    Dec 2017     4458042616    12695198
  
During the 62 days between the close dates for GenBank Releases 222.0
and 223.0, the 'traditional' portion of GenBank grew by 4,807,458,126
basepairs and by 2,339,943 sequence records. During that same period,
112,692 records were updated. An average of 39,559 'traditional' records
were added and/or updated per day.

  Between releases 222.0 and 223.0, the WGS component of GenBank grew by
147,941,691,328 basepairs and by 42,237,734 sequence records.

  Between releases 222.0 and 223.0, the TSA component of GenBank grew by
8,485,391,653 basepairs and by 8,804,698 sequence records.

  Between releases 222.0 and 223.0, the TLS component of GenBank grew by
1,464,224,301 basepairs and by 3,215,738 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 223.0 and Upcoming Changes) have been appended
below for your convenience.

  Release 223.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: December 15 2017, 223.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release
        end

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
223.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 223.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 60 with this release:

  - the BCT division is now composed of 428 files (+21)
  - the CON division is now composed of 363 files (+3)
  - the ENV division is now composed of 100 files (+1)
  - the EST division is now composed of 485 files (+3)
  - the INV division is now composed of 159 files (+2)
  - the PAT division is now composed of 320 files (+19)
  - the PLN division is now composed of 166 files (+9)
  - the PRI division is now composed of  58 files (+1)
  - the VRL division is now composed of  51 files (+1)

1.3.2 Records missing from catalog files

  A problem during generation of the GenBank 223.0 catalog files caused
16,855 records to be excluded from the "other" (non-EST, non-GSS) catalogs.
Because the problem will take at least a week to resolve, we decided to
provide a supplemental catalog file for the records in question:

        gb223.catalog.suppl.txt.gz

  This file lacks the NCBI Taxonomy ID which normally appears in column
seven. Also, note that we could not provide lists of PubMed IDs and
Gene Symbols for these records in the time available, so there are no
accompanying "pmid_list" or "gene_list" files for the catalog supplement.

  Our apologies for any inconvenience that this might cause.

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for 130
of the GSS flatfiles in Release 223.0. Consider gbgss175.seq :

GBGSS1.SEQ          Genetic Sequence Data Bank
                         December 15 2017

                NCBI-GenBank Flat File Release 223.0

                           GSS Sequences (Part 1)

   87375 loci,    64103840 bases, from    87375 reported sequences
   
  Here, the filename and part number in the header is "1", though the file
has been renamed as "175" based on the number of files dumped from the other
system. Files gbgss175.seq.gz through gbgss304.seq.gz are affected. We hope
to resolve this discrepancy at some point, but the priority is certainly
much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New source feature qualifier : /submitter_seqid

  Data submitters typically have their own identifiers for genomic contigs and
scaffolds of Whole Genome Shotgun (WGS) sequencing projects, the RNA sequences
of Transcriptome Shotgun Assembly (TSA) sequencing projects, and the genomic
loci of Targeted Locus Study (TLS) sequencing projects. These identifiers can
be very simple (contig01, contig02, etc.), or they can have a bit more
meaning/structure. Examples of the latter include:

   gcontig_1106166512749 (ABDU01000001)
   CCB157_001            (BDDQ01000001)
   Lo7_v2_contig_2306    (CCJQ010001199)

  The INSDC has decided that it would be helpful to provide these identifiers
in a formalized way, since they may be known-to or used-by parties other than
the submitters themselves. For example, if a submitter had made them public
in some sort of data products, or used by genome browsers, or cited in an
analysis, or mentioned on websites. 

  These submitter identifiers will be provided via a new qualifier of the
source feature : /submitter_seqid . The value format for the qualifier will
be free text. 

  A complete definition of the qualifier will be provided when it becomes
available. The earliest implementation date would be as of GenBank Release
223.0 on December 15th 2017. But a more realistic timeframe is January or
February of 2018.

1.4.2 New /gap_type value : "contamination"

  When contamination is discovered in a sequence record, removing the
bases from the sequence data can be problematic (especially at the 5' end)
because the length of the sequence changes. If there exist higher-level
scaffold/CON-division records (possibly chromosomes), the resulting change
in length requires an adjustment to the coordinate system of the scaffold/
chromosome, and the features annotated on it. The impact of such a change
on both data submitters and users can impose quite a burden.

  To address this, the INSDC has decided to introduce a new Gap Type for
the assembly_gap feature : contamination . When sequence contamination
is discovered, the submitter will have the option of replacing the affected
base pairs with Ns, via a terminal assembly_gap feature. For example:

     assembly_gap    1..2956
                     /estimated_length=2956
                     /gap_type="contamination"
                     /note="contamination masked with Ns"

  An updated definition of the /gap_type qualifier will be provided when
it becomes available. The earliest implementation date would be as of
GenBank Release 223.0 on December 15th 2017. But a more realistic timeframe
is January or February of 2018.




More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net