GenBank Release 127.0 Available

Mark Cavanaugh cavanaug at zeus.nlm.nih.gov
Fri Dec 21 18:01:29 EST 2001

Greetings GenBank Users,

  GenBank Release 127.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 127.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 127.0

  Uncompressed, the Release 127.0 flatfiles require roughly 58.12 GB
(sequence files only) or 65.72 GB (including the 'index' files).  The
ASN.1 version requires roughly 51.75 GB. From the release notes:

   Release  Date       Base Pairs   Entries

   126      Oct 2001   14396883064  13602262
   127      Dec 2001   15849921438  14976310

  Close-of-data was 12/17/2001. Four business days were required to prepare
this release. In the eight-week period between close-of-data for GenBank
releases 126.0 and 127.0, GenBank grew by 1.453 billion basepairs and by
1,374,048 sequence records. These single-release increases are the second
largest and the very largest in the database's history, respectively.


  The new LOCUS line format for GenBank flatfiles, announced in April of
this year, has been introduced with this release. Complete details about
the format change can be found in Section 1.3 of the release notes (see below).
The new format will also be utilized by the GenBank Updates, starting with
the nc1222 update.

			GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of eight
GSS flatfiles in Release 127.0. Consider the gbgss41.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          December 15 2001

                 NCBI-GenBank Flat File Release 127

                           GSS Sequences (Part 1)

  Here, the part number in the header is "1", though the file has been
renamed as "36" based on the files dumped from the other system. We will
work to resolve this discrepancy in future releases, but the priority is
admittedly much lower than other tasks.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is high.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 127.0 and Upcoming Changes) have been appended below.

  Release 127.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 127.0 close-of-data, should be
available by 07:00am EST, December 22. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 126.0 was

  If you encounter problems while ftp'ing or uncompressing Release 127.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev

1.3 Important Changes in Release 127.0

1.3.1 Organizational changes

  Due to database growth, the EST division is now being split into 142 pieces.

  Due to database growth, the GSS division is now being split into 48 pieces.

  Due to database growth, the PRI division is now being split into 16 pieces.

  Due to database growth, the HTG division is now being split into 26 pieces.

  Due to database growth, the PLN division is now being split into 5 pieces.

1.3.2 NCBI's ftp address has changed

  NCBI's FTP server has a new address:

	old address: ncbi.nlm.nih.gov
	new address: ftp.ncbi.nih.gov

  Although the old address still works, it is no longer officially supported.
So all users of the NCBI FTP server should either be using the new address now,
or actively involved in switching to the new server. Please contact the NCBI
Service Desk if you have any questions about this change:

	info at ncbi.nlm.nih.gov

1.3.3 LOCUS line format change : to accomodate longer names and sequences

  When the LOCUS line format for the GenBank flatfile was designed nearly
two decades ago, sequences over 10 Mbp in length were not anticipated. As
a result, the maximum length of a LOCUS name was nine characters, and the
maximum length of a sequence was 9,999,999 bases.

  With this release, a new LOCUS line format has been introduced which
accomodates names of up to sixteen characters and sequences as long as
99,999,999,999 bases :

1       10        20        30        40        50        60        70       79
LOCUS       16Char_LocusName 99999999999 bp ss-snoRNA  circular DIV DD-MMM-YYYY

Positions  Contents
---------  --------
01-05      LOCUS
06-12      spaces
13-28      Locus name
31-31      space
30-40      Length of sequence, right-justified
41-41      space
42-43      bp
44-44      space
45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
           ms- (mixed-stranded)
48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
           mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
           snoRNA. Left justified.
54-55      space
56-63      'linear' followed by two spaces, or 'circular'
64-64      space
65-67      The division code (see Section 3.3)
68-68      space
69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

  This change solves several problems: a) meaningful names of more than
nine characters can now be utilized; b) LOCUS names for many segmented
sets of more than ten members will no longer be truncated; c) invalid
LOCUS lines will no longer be generated when very large sequences are
displayed in GenBank format (eg, contig records such as NT_011520).

  Here's how two records now appear using the new LOCUS format:

LOCUS       AB000383                5423 bp    DNA     circular VRL 05-FEB-1999
DEFINITION  Leucania seperata nuclear polyhedrosis virus DNA for p13, xe,
            envelope protein, complete cds.

LOCUS       AF345888                 147 bp ss-RNA     linear   VRL 21-JUN-2001
DEFINITION  Chikungunya virus nonstructural protein 4 gene, partial cds.

  We encourage software developers to switch to a token-based LOCUS parsing
approach, rather than a column-specific approach. If this is done, then future
changes to the LOCUS line that affect only the spacing of its data values will
not require any modifications to software.

1.4 Upcoming Changes

1.4.1 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.

  Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.

1.4.2 New REFERENCE type for on-line journals

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:

	REFERENCE   1  (bases 1 to 2858)
	  AUTHORS   Smith, J.
	  TITLE     Cloning and expression of a phospholipase gene
	  JOURNAL   Online Publication
	  REMARK    Online-Journal-name; Article Identifier; URL

  This format is still tentative; additional information about this new
reference type will be made available via these release notes.


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net