GenBank Release 146.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Mon Feb 21 21:25:19 EST 2005

Greetings GenBank Users,

  GenBank Release 146.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 146.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 146.0

  Close-of-data was 02/16/2005. Five business days were required to build
Release 146.0. Uncompressed, the Release 146.0 flatfiles require approximately
162 GB (sequence files only) or 180 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 140 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   145      Dec 2004   44575745176  40604319
   146      Feb 2005   46849831226  42734478

In the nearly nine week period between the close dates for GenBank Releases 145.0
and 146.0, the non-WGS portion of GenBank grew by 2,274,086,050 basepairs
and by 2,130,159 sequence records. During that same period, 489,419 records
were updated. Combined, this yields an average of about 42,250 new and/or
updated records per day.

  Between releases 145.0 and 146.0, the WGS component of GenBank grew by
3,067,043,856 basepairs and by 701,367 sequence records.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: February 15 2005, 146.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 146.0 and Upcoming Changes) have been appended

  Release 146.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release
146.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov 

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman

1.3 Important Changes in Release 146.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 37 with this release:

  - the EST division is now comprised of 377 files (+22)
  - the GSS division is now comprised of 138 files (+6)
  - the HTG division is now comprised of  63 files (+1)
  - the PLN division is now comprised of  15 files (+2)
  - the ROD division is now comprised of  16 files (+1)
  - the STS division is now comprised of   9 files (+4)
  - the VRT division is now comprised of   8 files (+1)

1.3.2 Continuous ranges of secondary accessions

  With the removal of sequence length limits, some genomes (typically
bacterial) that had been split into many pieces are gradually being
replaced by a single sequence record. U00096 is a good example.

  When this happens, the accessions of the former small pieces become
secondary accessions for the single large sequence record. When each
secondary is separately listed, the ACCESSION line becomes excessively

  As of this February 2005 GenBank Release, continuous ranges of secondary
accessions (represented by a start accession, a dash character, and an end
accession) will begin to appear, initially within the GenBank Updates. In
the case of U00096, the ACCESSION line would look like:

	ACCESSION   U00096 AE000111-AE000510

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
twenty-four GSS flatfiles in Release 146.0. Consider gbgss115.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                          February 15 2005

                NCBI-GenBank Flat File Release 146.0

                           GSS Sequences (Part 1)

   87937 loci,    65332512 bases, from    87937 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "115" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New ENV Division in April 2005

  A new division for sequences obtained via environmental sampling methods
will be introduced with GenBank Release 147.0 in April 2005 . Records in this
new division will have these characteristics:

  1. ENV division code on the LOCUS line
  2. ENV keyword
  3. /environmental_sample qualifier in the source feature

This new division will segregate sequences for which the source organism is
unknown, or can only be inferred by sequence comparison.

  Sequences from WGS projects that involve environmental sampling will *not*
be distributed via this new division. All WGS projects will continue to be
distributed using project-specific data files at the NCBI FTP site:


  Additional information about the new ENV division will be provided via
these release notes and the GenBank newsgroup.  

1.4.2 Removal of MEDLINE linetype in April 2005

The PUBMED linetype was introduced in December of 1997, as a means of
linking references in sequence records to the PubMed biomedical literature
database, based on a PubMed ID (PMID) .

Since then, we have been displaying both the PMID and its predecessor
(Medline Unique ID / MUID) for all references. For example :

LOCUS       ECOGUABA                3531 bp    DNA     linear   BCT
DEFINITION  Escherichia coli guaBA operon operon, complete sequence.
ACCESSION   M10101 M10102
VERSION     M10101.1  GI:146274
REFERENCE   1  (bases 1768 to 3531)
  AUTHORS   Tiedeman,A.A., Smith,J.M. and Zalkin,H.
  TITLE     Nucleotide sequence of the guaA gene encoding GMP synthetase of
            Escherichia coli K12
  JOURNAL   J. Biol. Chem. 260 (15), 8676-8679 (1985)
  MEDLINE   85261223
   PUBMED   3894345

Subsequent to 1997, PMID article identifiers subsumed MUIDs. Some background
information about that evolution can be found at:


Starting with GenBank Release 147.0 in April of 2005, the older MEDLINE
linetype will be displayed in GenBank sequence records only for (very rare)
articles that lack a PMID identifier.

For the vast majority of cases, this means that the MEDLINE linetype will
no longer be displayed; only the PUBMED identifier will be presented.

- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net