IUBio

[Genbank-bb] GenBank Release 216.0 Available : October 14 2016

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Fri Oct 14 16:52:43 EST 2016


Greetings GenBank Users,

  GenBank Release 216.0 is now available via FTP from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 216.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 216.0

 Close-of-data for GenBank 216.0 occurred on 10/11/2016. Uncompressed,
the Release 216.0 flatfiles require roughly 797 GB (sequence files only).
The ASN.1 data require approximately 656 GB.

Recent statistics for 'traditional' sequences (including non-bulk-oriented
TSA, and excluding WGS, bulk-oriented TSA, and the CON-division):

  Release  Date      Base Pairs    Entries

  215      Aug 2016  217971437647  196120831
  216      Oct 2016  220731315250  197390691

Recent statistics for WGS sequencing projects:

  Release  Date      Base Pairs    Entries

  215    Aug 2016  1637224970324   359796497
  216    Oct 2016  1676238489250   363213315

Recent statistics for bulk-oriented TSA sequencing projects:

  Release  Date      Base Pairs     Entries

  215    Aug 2016   103399742586   113179607
  216    Oct 2016   113209225762   124199597

  As of this release, the total number of bases has exceeded the
two Terabase threshold : 2010179030262 basepairs.

  During the 53 days between the close dates for GenBank Releases 215.0
and 216.0, the 'traditional' portion of GenBank grew by 2,759,877,603
basepairs and by 1,269,860 sequence records. During that same period,
326,607 records were updated. An average of 30,122 'traditional' records
were added and/or updated per day.

  Between releases 215.0 and 216.0, the WGS component of GenBank grew by
39,013,518,926 basepairs and by 3,416,818 sequence records.

  Between releases 215.0 and 216.0, the TSA component of GenBank grew by
9,809,483,176 basepairs and by 11,019,990 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 216.0 and Upcoming Changes) have been appended
below for your convenience.

                    * * * IMPORTANT * * *
  A significant change is described in Section 1.4.3 of the release
notes: Removal of NCBI GI sequence identifiers from GenBank, GenPept,
and FASTA sequence formats. Users who make use of GIs in their information
systems and analysis pipelines should take particular note of that section.

  Implementation of this change for GenBank Release and Update FTP
products has been postponed until March 2017. However, GIs will cease
to be displayed via NCBI's Entrez retrieval system, by as early as
late October of 2016.
                    * * * IMPORTANT * * *

  Release 216.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank
release notes (gbrel.txt) whenever a release is being obtained. Check
to make sure that the date and release number in the header of the
release notes are current (eg: October 15 2016, 216.0). If they are
not, interrupt the remaining transfers and then request assistance from
the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a Unix or Linux platform, using csh/tcsh :

        set files = `ls gb*.*`
        foreach i ($files)
                head -10 $i | grep Release
        end

Or, if the files are compressed, perhaps:

        gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
216.0, please send email outlining your difficulties to:

        info from ncbi.nlm.nih.gov

Mark Cavanaugh, Michael Kimelman, Ilya Dondoshansky, Sergey Zhdanov,
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 216.0

1.3.1 Organizational changes

The total number of sequence data files increased by 30 with this release:

  - the BCT division is now composed of 281 267 files (+14)
  - the CON division is now composed of 352 files (+1)
  - the INV division is now composed of 151 144 files (+7)
  - the PAT division is now composed of 268 263 files (+5)
  - the PLN division is now composed of 135 files (+1)
  - the ROD division is now composed of  30 files (-1)
  - the VRL division is now composed of  44 files (+1)
  - the VRL division is now composed of  63 files (+2)

  Note : The decrease in the number of ROD division files is due to an
on-going effort to convert legacy "segmented set" GenBank records to
a simpler gapped-sequence representation.

1.3.2 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped by the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for 128
of the GSS flatfiles in Release 216.0. Consider gbgss174.seq :

GBGSS1.SEQ          Genetic Sequence Data Bank
                         October 15 2016

                NCBI-GenBank Flat File Release 216.0

                           GSS Sequences (Part 1)

   87032 loci,    63853715 bases, from    87032 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "174" based on the number of files dumped from the other
system.  We hope to resolve this discrepancy at some point, but the priority
is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New feature introduced : propeptide

  A new feature, propeptide, will be added to the INSDC Feature Table Document
as of October 2016. Here is a preliminary definition of the new feature:

Feature Key           propeptide

Definition            propeptide coding sequence; coding sequence for the
                      domain of a proprotein that is cleaved to form the mature protein product

Optional qualifiers   /allele="text"
                      /citation=[number]
                      /db_xref="<database>:<identifier>"
                      /experiment="[CATEGORY:]text"
                      /function="text"
                      /gene="text"
                      /gene_synonym="text"
                      /inference="[CATEGORY:]TYPE[ (same species)][:EVIDENCE_BASIS]"
                      /locus_tag="text" (single token)
                      /map="text"
                      /note="text"
                      /old_locus_tag="text" (single token)
                      /product="text"
                      /pseudo
                      /pseudogene="TYPE"
                      /standard_name="text"

  Although propeptide will technically be legal as of October, implementation
of the new feature key is likely to require several months.

1.4.2 New qualifier introduced : /recombination_class

  A new qualifier, /recombination_class, will be added to the INSDC Feature
Table Document as of October 2016, for use with the misc_recomb feature. Here
is a preliminary definition of the new qualifier:

Qualifier       /recombination_class

Definition      a structured description of the classification of recombination 
                hotspot region within a sequence

Value format    "TYPE"

Example         /recombination_class="meiotic recombination"
                /recombination_class="breakpoint_junction"

Comment         TYPE is a term taken from the INSDC controlled vocabulary for
                recombination classes. On 15-OCT-2016 the following terms
                were valid:

                meiotic_recombination
                mitotic_recombination
                non_allelic_homologous_recombination
                breakpoint_junction
                other	

  Although /recombination_class will technically be legal as of October,
implementation of the new qualifier is likely to require several months.

1.4.3 GI sequence identifiers to be removed from GenBank/GenPept/FASTA formats
      and FASTA header to be simplified

                        **NOTE**
  The removal of GI identifiers from GenBank (and RefSeq) Release and
  Update products has been postponed until March of 2017. NCBI will initially
  implement this change only in the Entrez system, in order to better gauge
  the impact that it will have on users. We expect that the Entrez change will
  take place in late October or early November of 2016.
                        **NOTE**

  As of March 15 2017, the integer sequence identifiers known as "GIs" will
no longer be included in the GenBank, GenPept, and FASTA formats for GenBank
Release and GenBank Update products. The FASTA header will be further simplified,
to report only the sequence Accession.Version for records that originate within
the International Sequence Database Collaboration (INSDC).

  As first described in the Release Notes for GenBank 199.0 in December 2013,
NCBI is in the process of moving to storage solutions which utilize only
Accession.Version identifiers. See Section 1.4.4 of these release notes for
additional background information about those developments.

  Given this shift to non-GI-based systems, the importance of using
Accession.Version identifiers cannot be overstated. As early as October 2016,
NCBI will cease displaying GI identifiers in the flatfile and FASTA views
generated within the Entrez:Nucleotide and Entrez:Protein resources. So
any web-related processes solely dependent on GIs will need to be adjusted
before that time or they will cease to work.  

  Previously-assigned GI sequence identifiers will continue to exist
'behind the scenes', and NCBI services which accept GIs as inputs will
continue to be supported. NCBI will be adding support for Accession.Version
identifiers to any services that currently do not support them. As NCBI
makes this transition, we encourage any users who have workflows that
depend on GIs to begin planning to use Accession.Version identifiers instead.

  The FASTA format will also be changed for all sequence records originating
within the INSDC, to report only the Accession.Version and the record title.
This will improve compatibility with other file types provided by NCBI and
others, including GFF3, Gene, and dbSNP download files. This FASTA format
change has already been made for the redesigned genomes FTP site based on
user requests to have a single consistent sequence identifier for both GFF3
and FASTA formats.

  At this time, we plan to continue to provide database source information in
the FASTA header/definition line for non-INSDC sources of sequence data,
including UniProt, PDB structures, PIR, and Patent sequences.

Example 1 : An INSDC nucleotide record

  In the sample record below, nucleotide sequence AF123456 was assigned a
GI of 6633795, and the protein translated from its coding region feature
was assigned a GI of 6633796 :

LOCUS       AF123456                1510 bp    mRNA    linear   VRT 12-APR-2012
DEFINITION  Gallus gallus doublesex and mab-3 related transcription factor 1
            (DMRT1) mRNA, partial cds.
ACCESSION   AF123456
VERSION     AF123456.2  GI:6633795
....
     CDS             <1..936
                     /gene="DMRT1"
                     /note="cDMRT1"
                     /codon_start=1
                     /product="doublesex and mab-3 related transcription factor
                     1"
                     /protein_id="AAF19666.1"
                     /db_xref="GI:6633796"
                     /translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
                     IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
                     HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
                     NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
                     RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
                     GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

  After September 15 2016, the Accession.Version will be the sole sequence
version identifier. The GI value on the VERSION line and the GI /db_xref
qualifier for the coding region feature will no longer be visible:

LOCUS       AF123456                1510 bp    mRNA    linear   VRT 12-APR-2012
DEFINITION  Gallus gallus doublesex and mab-3 related transcription factor 1
            (DMRT1) mRNA, partial cds.
ACCESSION   AF123456
VERSION     AF123456.2
....
     CDS             <1..936
                     /gene="DMRT1"
                     /note="cDMRT1"
                     /codon_start=1
                     /product="doublesex and mab-3 related transcription factor
                     1"
                     /protein_id="AAF19666.1"
                     /translation="PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSL
                     IAERQRVMAVQVALRRQQAQEEELGISHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPA
                     HSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSDLVVDSTYYSSFYQPSLYPYY
                     NNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQWQMKGMEN
                     RHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDS
                     GLGCLSSSESTKGDLECEPHQEPGAFAVSPVLEGE"

Example 2 : A GenPept record for an INSDC sequence

The current GenPept display format includes GI identifiers in the VERSION lines
(note that the coding region feature for GenPept has never included the display
of GI identifiers) :

LOCUS       AAF19666                 311 aa            linear   VRT 12-APR-2012
DEFINITION  doublesex and mab-3 related transcription factor 1, partial [Gallus
            gallus].
ACCESSION   AAF19666
VERSION     AAF19666.1  GI:6633796
DBSOURCE    accession AF123456.2
....
     CDS             1..311
                     /gene="DMRT1"
                     /coded_by="AF123456.2:<1..936"

After September 15 2016, the VERSION line will no longer include the GI value:

LOCUS       AAF19666                 311 aa            linear   VRT 12-APR-2012
DEFINITION  doublesex and mab-3 related transcription factor 1, partial [Gallus
            gallus].
ACCESSION   AAF19666
VERSION     AAF19666.1
DBSOURCE    accession AF123456.2
....
     CDS             1..311
                     /gene="DMRT1"
                     /coded_by="AF123456.2:<1..936"

Example 3: FASTA format for an INSDC nucleotide and protein sequence

The current FASTA display, for most products, includes GI and database
source information (eg, 'gb' for GenBank, 'emb' for ENA, 'dbj' for
DDBJ), using the '|' character as a delimiter:

>gi|6633795|gb|AF123456.2| Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

>gi|6633796|gb|AAF19666.1| doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE


As of September 15 2016, just the Accession.Version will be provided:

>AF123456.2 Gallus gallus doublesex and mab-3 related transcription factor 1 (DMRT1) mRNA, partial cds
CCGGCGGCGGGCAAGAAGCTGCCGCGTCTGCCCAAGTGTGCCCGCTGCCGCAACCACGGCTACTCCTCGC
CGCTGAAGGGGCACAAGCGGTTCTGCATGTGGCGGGACTGCCAGTGCAAGAAGTGCAGCCTGATCGCCGA
[....]

>AAF19666.1 doublesex and mab-3 related transcription factor 1, partial
[Gallus gallus]
PAAGKKLPRLPKCARCRNHGYSSPLKGHKRFCMWRDCQCKKCSLIAERQRVMAVQVALRRQQAQEEELGI
SHPVPLPSAPEPVVKKSSSSSSCLLQDSSSPAHSTSTVAAAAASAPPEGRMLIQDIPSIPSRGHLESTSD
LVVDSTYYSSFYQPSLYPYYNNLYNYSQYQMAVATESSSSETGGTFVGSAMKNSLRSLPATYMSSQSGKQ
WQMKGMENRHAMSSQYRMCSYYPPTSYLGQGVGSPTCVTQILASEDTPSYSESKARVFSPPSSQDSGLGC
LSSSESTKGDLECEPHQEPGAFAVSPVLEGE

Please direct any inquiries about these changes to the NCBI Service Desk:

  info from ncbi.nlm.nih.gov

1.4.4 GI sequence identifiers are being phased out at NCBI

  The numeric GI sequence identifier that NCBI used to assign to all
nucleotide and protein sequences was first introduced for GenBank Release
products as of GenBank 81.0, in February 1994. See:

     ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb81.release.notes

 These simple, uniform, integer-based unique identifiers (which predated the
introduction of Accession.Version sequence identifiers) were crucial to the
development of NCBI's Entrez retrieval system, and have served their purpose
very well for over 20 years. 

  However, as NCBI considers how best to address the expected increase in the
volume of submitted sequence data, it is clear that prior practices will need
to be re-thought. As an example, imagine 100,000 pathogen-related
genomes/samples, each with 5000 proteins, most of which are common to all. We
will be moving toward solutions that represent each unique protein *once*.
The coding region protein products for each genome will likely continue to be
assigned their own Accession.Version identifiers, but (within the NCBI data
model) they will simply *reference* the unique proteins. And, they will no
longer be issued GIs of their own.

  Such a change will likely have a significant impact on NCBI users who
utilize GIs in their own information systems and analysis pipelines, so it is
being implemented gradually. Unannotated WGS projects consisting of millions
of contigs and scaffolds, and unannotated TSA projects, are the first two
classes of records for which GIs are no longer being assigned. But the practice
will ultimately expand to include other classes of records.

  If GIs are central to your operations, NCBI strongly urges that you begin
planning a switch to the use of Accession.Version identifiers instead.

  The contigs and scaffolds of the ALWZ04 WGS project are good examples of
sequences that lack GIs. Below are excerpts from the flatfile representation
of the first ALWZ04 contig, and the 'singleton scaffold' which is constructed
from it. Note the absence of a GI value on the VERSION line of these two
records:

LOCUS       ALWZ040000001           1191 bp    DNA     linear   PLN 13-MAR-2015
DEFINITION  Picea glauca, whole genome shotgun sequence.
ACCESSION   ALWZ040000001 ALWZ040000000
VERSION     ALWZ040000001.1
DBLINK      BioProject: PRJNA83435
            BioSample: SAMN01120252
....
ORIGIN      
        1 ctataatacc cctatgccaa acgaacccaa ttgtaaatgt aaatgcaaat gtacttaggc
       61 tggttagttg tttaatatca ttttttgtat gcaccttcca tggtataatg cgcacatgta
      121 tagcgcacta aaattatgaa gtgtgcccat tccaagatat tgcgcgtaaa aaacttaagt
      181 gtgcatgatt ttgagactag ggagactttg tgtatatgtt gtgttttata tgctggagag
      241 acaattatta ttagttagga ggattatgtt ttgtactagg caagagagcc tagatgttaa
      301 aggctagtga gcctattttt gtatatgtct catcattaat ataatacatc attgtgtgta
....
      901 ttgttgggaa ttgatttcct gaatgtgtta aactgcattg atagggatct gagaattcct
      961 ttctggccta ttgctgaagc tttggaaggg aggtggggca accgagggac tgttgagaag
     1021 agaagggtca cacttcctgg ggtgggacaa gcatgtgggg aattagggat tgcaggatgt
     1081 tagtttgaat tggcacctat gacagagtct ttcctattgt ctgagatatg tcagcttggt
     1141 taggaaaccc tttacctggg tagagtttag tcccagctcg ggggtgaccc a
//

LOCUS       ALWZ04S0000001          1191 bp    DNA     linear   CON 13-MAR-2015
DEFINITION  Picea glauca Pg-01r141201s0000001, whole genome shotgun sequence.
ACCESSION   ALWZ04S0000001 ALWZ0400000000
VERSION     ALWZ04S0000001.1
DBLINK      BioProject: PRJNA83435
            BioSample: SAMN01120252
....
CONTIG      join(ALWZ040000001.1:1..1191)
//

Sample URLs from which ALWZ04 data may be obtained include:

  http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ALWZ04#contigs
  http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ALWZ04#scaffolds

  http://www.ncbi.nlm.nih.gov/Traces/wgs/?download=ALWZ04.gbff.1.gz
  http://www.ncbi.nlm.nih.gov/Traces/wgs/?download=ALWZ04S.gbff.1.gz

  ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/wgs.ALWZ.*.gbff.gz
  ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/wgs.ALWZ.scflds.*.gbff.gz





More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net