GenBank Release 145.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Wed Dec 22 23:13:24 EST 2004

Greetings GenBank Users,

  GenBank Release 145.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 145.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 145.0

  Close-of-data was 12/16/2004. Six business days were required to build
Release 145.0. Uncompressed, the Release 145.0 flatfiles require approximately
153 GB (sequence files only) or 170 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 133 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   144      Oct 2004   43194602655  38941263
   145      Dec 2004   44575745176  40604319

In the eight week period between the close dates for GenBank Releases 144.0
and 145.0, the non-WGS portion of GenBank grew by 1,381,142,521 basepairs
and by 1,663,056 sequence records. During that same period, 418,105 records
were updated. Combined, this yields an average of about 32,518 new and/or
updated records per day.

  Between releases 144.0 and 145.0, the WGS component of GenBank grew by
4,137,665,849 basepairs and by 125,139 sequence records.

        * * * Important * * * 

        The SDSC GenBank mirror site is experiencing problems
	caused by disk space limitations. Users of this site should closely
	check the file content (total number of files and their dates) at
	the mirror before using it. We will provide further details about
	the status of the SDSC mirror as they become available.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: December 15 2004, 145.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 145.0 and Upcoming Changes) have been appended

  Release 145.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release
145.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov 

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman

1.3 Important Changes in Release 145.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 21 with this release:

  - the BCT division is now comprised of  11 files (+1)
  - the EST division is now comprised of 355 files (+6)
  - the GSS division is now comprised of 132 files (+12)
  - the PAT division is now comprised of  17 files (+1)
  - the ROD division is now comprised of  15 files (+1)

1.3.2 New gap feature

  A new feature key for sequence gaps becomes legal as of this December 2004
GenBank release:

Feature key           gap

Definition            gap in the sequence
Mandatory qualifiers  /estimated_length=unknown or <integer>
Optional qualifiers   /map="text"
Comment               the location span of the gap feature for an unknown 
                      gap is 100 bp, with the 100 bp indicated as 100 "n"s in 
                      the sequence.  Where estimated length is indicated by 
                      an integer, this is indicated by the same number of 
                      "n"s in the sequence. 
                      No upper or lower limit is set on the size of the gap.

  Gap features will begin to appear in post-Release 145.0 GenBank Update files
in early January of 2005. They will frequently be seen in Phase 0, 1, and 2 HTG
records : each gap feature will coincide with the runs of N's in the sequence
data that separate adjacent sequence contigs.

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
twenty-three GSS flatfiles in Release 145.0. Consider gbgss110.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                          December 15 2004

                NCBI-GenBank Flat File Release 145.0

                           GSS Sequences (Part 1)

   88212 loci,    65541827 bases, from    88212 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "110" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New ENV Division in April 2005

  A new division for sequences obtained via environmental sampling methods
will be introduced with GenBank Release 147.0 in April 2005 . Records in this
new division will have these characteristics:

  1. ENV division code on the LOCUS line
  2. ENV keyword
  3. /environmental_sample qualifier in the source feature

This new division will segregate sequences for which the source organism is
unknown, or can only be inferred by sequence comparison.

  Sequences from WGS projects that involve environmental sampling will *not*
be distributed via this new division. All WGS projects will continue to be
distributed using project-specific data files at the NCBI FTP site:


  Additional information about the new ENV division will be provided via
these release notes and the GenBank newsgroup.  

1.4.2 Continuous ranges of secondary accessions

  With the removal of sequence length limits, some genomes (typically
bacterial) that had been split into many pieces are gradually being
replaced by a single sequence record. U00096 is a good example.

  When this happens, the accessions of the former small pieces become
secondary accessions for the single large sequence record. When each
secondary is separately listed, the ACCESSION line becomes excessively

  As of GenBank Release 146.0 in February 2005, it will be legal to
represent continuous ranges of secondary accessions by a start accession,
a dash character, and an end accession. In the case of U00096, the
ACCESSION line would thus look like:

	ACCESSION   U00096 AE000111-AE000510

  Further details about the conventions for secondary accession ranges
will be provided via these release notes and the GenBank newsgroup.  


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net