GenBank Release 128.0 Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Thu Feb 21 18:02:38 EST 2002

Greetings GenBank Users,

  GenBank Release 128.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 128.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 128.0

  Uncompressed, the Release 128.0 flatfiles require roughly 60.91 GB
(sequence files only) or 67.46 GB (including the 'index' files).  The
ASN.1 version requires roughly 53.95 GB. From the release notes:

   Release  Date       Base Pairs   Entries

   127      Dec 2001   15849921438  14976310
   128      Feb 2002   17089143893  15465325

  Close-of-data was 02/13/2002. Six working days were required to prepare
this release. In the eight-week period between close-of-data for GenBank
releases 127.0 and 128.0, GenBank grew by 1.239 billion basepairs and by
489,015 sequence records. During that same period, 218,540 records were
updated. Combined, this yields an average of nearly 12,000 new/updated
records per day.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is high.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 128.0 and Upcoming Changes) have been appended below.

  Release 128.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 128.0 close-of-data, should be
available by 10:00am EST, February 20. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 127.0 was

  If you encounter problems while ftp'ing or uncompressing Release 128.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev

1.3 Important Changes in Release 128.0

1.3.1 Organizational changes

  Due to database growth, the BCT division is now being split into 5 pieces.

  Due to database growth, the EST division is now being split into 148 pieces.

  Due to database growth, the GSS division is now being split into 49 pieces.

  Due to database growth, the HTG division is now being split into 29 pieces.

  Due to database growth, the INV division is now being split into 5 pieces.

  Due to database growth, the PAT division is now being split into 4 pieces.

  Due to database growth, the PRI division is now being split into 17 pieces.

1.3.2 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of eight
GSS flatfiles in Release 128.0. Consider the gbgss42.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          February 15 2002

                 NCBI-GenBank Flat File Release 128

                           GSS Sequences (Part 1)

  Here, the part number in the header is "1", though the file has been
renamed as "42" based on the files dumped from the other system. We will
work to resolve this discrepancy in future releases, but the priority is
admittedly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New CONSRTM linetype for references.

  In order to capture the names of consortia and other groups that are involved
in large-scale sequencing projects, a new linetype called CONSRTM will become
legal in the REFERENCE block of the GenBank flatfile format as of June, 2002 .

  Consider, for example, the literature citation associated with PubMed
identifier 11237011 :

  Nature 2001 Feb 15;409(6822):860-921
  Initial sequencing and analysis of the human genome.

In addition to the very long list of author names, a consortium is associated
with this publication:

  International Human Genome Sequencing Consortium

  With the addition of a CONSRTM linetype, collective names like this will
have a dedicated location in the flatfile format. Records which currently
attempt to force consortium names into the last entry of the AUTHORS line
will be updated to utilize the new linetype.

  Note that multiple consortia for a REFERENCE may exist, in which case
they will be separated by a semi-colon. It is also possible that references
with a CONSRTM linetype will not have any individual AUTHORS at all.

1.4.2 New REFERENCE type for on-line journals

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in the
following manner, starting with GenBank Release 129.0 in April 2002 :

	REFERENCE   1  (bases 1 to 2858)
	  AUTHORS   Smith, J.
	  TITLE     Cloning and expression of a phospholipase gene
	  JOURNAL   Online Publication
	  REMARK    Online-Journal-name; Article Identifier; URL

1.4.3 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.

  Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net