[Genbank-bb] GenBank Release 225.0 Available : April 18 2018

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Wed Apr 18 16:36:27 EST 2018

Greetings GenBank Users,

  GenBank Release 225.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 225.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 225.0

 Close-of-data for GenBank 225.0 occurred on 04/14/2018. Uncompressed,
the Release 225.0 flatfiles require roughly 885 GB (sequence files only).
The ASN.1 data require approximately 727 GB.

Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, TLS, and the CON-division):

  Release  Date      Base Pairs    Entries

  224      Feb 2018  253630708098  207040555
  225      Apr 2018  260189141631  208452303

Recent statistics for WGS sequencing projects:

  Release  Date      Base Pairs    Entries

  224    Feb 2018  2608532210351   564286852
  225    Apr 2018  2784740996536   621379029  

Recent statistics for bulk-oriented TSA sequencing projects:

  Release  Date      Base Pairs     Entries

  224    Feb 2018   193940551226   214324264
  225    Apr 2018   205232396043   227364990  

Recent statistics for bulk-oriented TLS sequencing projects:

  Release  Date      Base Pairs     Entries

  224    Feb 2018     4531966831    12819978
  225    Apr 2018     5612769448    14782654 

During the 60 days between the close dates for GenBank Releases 224.0
and 225.0, the 'traditional' portion of GenBank grew by 6,558,433,533
basepairs and by 1,411,748 sequence records. During that same period,
86,960 records were updated. An average of 24,978 'traditional' records
were added and/or updated per day.

  Between releases 224.0 and 225.0, the WGS component of GenBank grew by
176,208,786,185 basepairs and by 57,092,177 sequence records.

  Between releases 224.0 and 225.0, the TSA component of GenBank grew by
11,291,844,817 basepairs and by 13,040,726 sequence records.

  Between releases 224.0 and 225.0, the TLS component of GenBank grew by
1,080,802,617 basepairs and by 1,962,676 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 225.0 and Upcoming Changes) have been appended
below for your convenience.

                * * * Important Note * * *

  Section 1.4.1 of the GenBank release notes describes future accession
format changes for WGS/TSA/TLS sequencing projects, and for protein
sequences. These important changes are likely to be of interest to many
GenBank users, and we encourage a review of the section.

  Release 225.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: April 15 2018, 225.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
225.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky

1.3 Important Changes in Release 225.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 59 with this release:

  - the BCT division is now composed of 474 files (+24)
  - the CON division is now composed of 365 files (+3)
  - the ENV division is now composed of 102 files (+1)
  - the HTG division is now composed of 155 files (+1)
  - the INV division is now composed of 163 files (+2)
  - the MAM division is now composed of  55 files (+16)
  - the PAT division is now composed of 330 files (+7)
  - the PLN division is now composed of 172 files (+4)
  - the VRL division is now composed of  54 files (+1)

1.3.2 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for 130
of the GSS flatfiles in Release 225.0. Consider gbgss175.seq :

GBGSS1.SEQ          Genetic Sequence Data Bank
                           April 15 2018

                NCBI-GenBank Flat File Release 225.0

                           GSS Sequences (Part 1)

   87375 loci,    64103840 bases, from    87375 reported sequences
  Here, the filename and part number in the header is "1", though the file
has been renamed as "175" based on the number of files dumped from the other
system. Files gbgss175.seq.gz through gbgss304.seq.gz are affected. We hope
to resolve this discrepancy at some point, but the priority is certainly
much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 Changes to accession formats for WGS/TSA/TLS sequence projects, and proteins

  The accession format used for Whole Genome Shotgun, Transcriptome Shotgun
Assembly, and Targeted Locus Study sequencing projects consists of a
four-letter Project Code prefix, a two-digit Assembly-Version number, and
then 6, 7, or 8 digits (depending on the number of sequences in the project).

  With only four letters available for the Project Code prefix, the total
number of projects that can be supported is 456,976 . As of April 2018,
the INSDC has become uncomfortably close to this limit, so discussions are
underway to expand the format.

  One of the approaches being considered is a 6 + 2 + 7/8/9 format. This
could mean that a legacy WGS contig might have an accession of AAAA02000001,
while a new WGS contig might be accessioned as AAAAAA020000001. But even
with a 6-letter prefix, the total number of projects is limited to about
309 million. Given that there are food-safety and pathogen surveillance
projects which can yield 100,000 genomes, this would mean that only about
3100 efforts on that scale could be supported.

  So other approaches are also being explored. The INSDC might add a prefix
to indicate the project type to the accession format: "WGS-" , "TSA-" , "TLS-".
Then follow that by a minimum of a 6-letter project code, and allow the number
of characters in the project code to increase as needed.

  There are similar concerns for the three-letter and five-digit protein
accession format. Under consideration: Increase the number of digits.
Imagine that the AAA-AZZ 3+5 series has been consumed. The INSDC could revert
*back* to that 3-letter series, but start issuing accessions at 3+7 : AAA0000001
The number of digits could be allowed to grow, perhaps up to 3+10. 

  Essentially, all of the already-used protein accesssion prefixes
would become available for use once again, but with much greater
capacities. And there would be at least a small visual distinction
between the old and the new: compare values like AAA77854 vs AAA7785401,
or AAA7777854. Or AAA00001 vs AAA0000001.

  Bottom Line: These accession formats *WILL* be changing soon. Although the
details aren't finalized yet, this seems to warrant an announcement to our
users now. Final decisions about the new accession formats are expected to
be made at the May 2018 INSDC meeting. The soonest that the new formats would
be utilized is two GenBank releases after they are formally announced, which
would be October of 2018.

1.4.2 New source feature qualifier : /submitter_seqid

  Data submitters typically have their own identifiers for genomic contigs and
scaffolds of Whole Genome Shotgun (WGS) sequencing projects, the RNA sequences
of Transcriptome Shotgun Assembly (TSA) sequencing projects, and the genomic
loci of Targeted Locus Study (TLS) sequencing projects. These identifiers can
be very simple (contig01, contig02, etc.), or they can have a bit more
meaning/structure. Examples of the latter include:

   gcontig_1106166512749 (ABDU01000001)
   CCB157_001            (BDDQ01000001)
   Lo7_v2_contig_2306    (CCJQ010001199)

  The INSDC has decided that it would be helpful to provide these identifiers
in a formalized way, since they may be known to, or used by, parties other than
the submitters themselves. For example, if a submitter had made them public
in some sort of data products, or displayed in genome browsers, cited in an
analysis, or mentioned on websites. 

  These submitter identifiers will be provided via a new qualifier of the
source feature : /submitter_seqid . The value format for the qualifier will
be free text. 

  A complete definition of the qualifier will be provided when it becomes
available. The earliest implementation date is within the two month period
after GenBank Release 224.0 on February 15th 2018. But a more realistic
timeframe is April 15th of 2018.

1.4.3 New /gap_type value : "contamination"

  When contamination is discovered in a sequence record, removing the
bases from the sequence data can be problematic (especially at the 5' end)
because the length of the sequence changes. If there exist higher-level
scaffold/CON-division records (possibly chromosomes), the resulting change
in length requires an adjustment to the coordinate system of the scaffold/
chromosome, and the features annotated on it. The impact of such a change
on both data submitters and users can impose quite a burden.

  To address this, the INSDC has decided to introduce a new Gap Type for
the assembly_gap feature : contamination . When sequence contamination
is discovered, the submitter will have the option of replacing the affected
base pairs with Ns, via a terminal assembly_gap feature. For example:

     assembly_gap    1..2956
                     /note="contamination masked with Ns"

  An updated definition of the /gap_type qualifier will be provided when
it becomes available. The earliest implementation date is within the two
month period after GenBank Release 224.0 on February 15th 2018. But a
more realistic timeframe is April 15th of 2018.

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net