GenBank Release 125.0 Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Fri Aug 24 15:43:17 EST 2001

Greetings GenBank Users,

  GenBank Release 125.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 125.0 flatfiles
  ^^^^^^^^^^^^^^^^   ncbi-asn1   ASN.1 data used to create Release 125.0

  PLEASE NOTE : NCBI's FTP address has changed to the value shown
above. Further information can be found in Section 1.3 of the release
notes, appended below.

  Uncompressed, the Release 125.0 flatfiles require roughly 49.72 GB
(sequence files only) or 55.23 GB (including the 'index' files).  The
ASN.1 version requires roughly 45.06 GB. From the release notes:

   Release  Date       Base Pairs   Entries

   124      Jun 2001   12973707065  12243766
   125      Aug 2001   13543364296  12813516

  Close-of-data was 08/19/2001. Five business days were required to prepare
this release. In the eight-week period between close-of-data for GenBank
releases 124.0 and 125.0, GenBank grew by 0.570 billion basepairs and 569,750
sequence records.

  GSS File Header Problem : GSS sequences at GenBank are maintained in one
of two different systems, depending on their origin. One recent change to
release processing involves the parallelization of the dumps from the systems.
Because the second dump (for example) has no prior knowledge of exactly
how many GSS files will be dumped from the first, it doesn't know how to
number it's own output files. There is thus a discrepancy between the
filenames and file headers of six GSS flatfiles in Release 125.0. Consider
the gbgss35.seq file:

   GBGSS1.SEQ           Genetic Sequence Data Bank
                              August 15 2001

                    NCBI-GenBank Flat File Release 125

                              GSS Sequences (Part 1)

  Here, the part number in the header is "1", though the file has been
renamed as "35" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, probably
by using a tool that modifies part numbers in flatfile headers. Unless
we find that we can do away with flatfile headers entirely, which would
be a much simpler solution...

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is high.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 125.0 and Upcoming Changes) have been appended below.


  One of the changes described in Section 1.4 is a redefinition of the
LOCUS line of the GenBank flatfile format, to be introduced in December
of 2001. Every record in GenBank will be affected. If you parse the LOCUS
line of the flatfile, please pay special attention to this upcoming change!

  Release 125.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 125.0 close-of-data, should be
available by 07:00am EDT, August 25. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 124.0 was

  If you encounter problems while ftp'ing or uncompressing Release 125.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev

1.3 Important Changes in Release 125.0

1.3.1 Organizational changes

  Due to database growth, the EST division is now being split into 123 pieces.

  Due to database growth, the GSS division is now being split into 40 pieces.

  Due to database growth, the PRI division is now being split into 13 pieces.

1.3.2 NCBI's ftp address has changed

  NCBI's FTP server has a new address:

	old address: ncbi.nlm.nih.gov
	new address: ftp.ncbi.nih.gov

  For the moment, the old address still works. But due to the volume of FTP
traffic at our site, we cannot continue to support the old address much beyond
October 15, 2001.

(In fact, if FTP traffic levels increase substantially, we may not be able to
 continue supporting the old address even that long.)

  So we urge all users of the NCBI FTP server to switch to the new address
sooner, rather than later. Please contact the NCBI Service Desk if you have
any questions about this change:

	info at ncbi.nlm.nih.gov

1.4 Upcoming Changes

1.4.1 LOCUS line format change : to accomodate longer names and sequences

  When the LOCUS line format for the GenBank flatfile was designed nearly
two decades ago, sequences over 10 Mbp in length were not anticipated. As
a result, the maximum length of a LOCUS name is nine characters, and the
maximum length of a sequence is 9,999,999 bases :

1       10        20        30        40        50        60        70       79
LOCUS       AB000383     5423 bp    DNA   circular  VRL       05-FEB-1999

Positions  Contents
---------  --------
01-05      LOCUS
06-12      spaces
13-21      Locus name
22-22      space
23-29      Length of sequence, right-justified
31-32      bp
34-36      Blank, ss- (single-stranded), ds- (double-stranded), or
           ms- (mixed-stranded)
37-42      Blank, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
           mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA
43-52      Blank (implies linear) or circular
53-55      The division code (see Section 3.3)
63-73      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

  This has led to several problems: a) meaningful names of more than nine
characters cannot be utilized; b) the nine-character limit causes LOCUS
names to be truncated for many segmented sets of more than ten members
(see AF272557, AF272558, etc); c) invalid LOCUS lines result when the
GenBank flatfile format is used to display other types of sequence data.

  For (c), consider human contig Hs22_11677 derived from primary (archival)
sequences in the HTG division of GenBank:

LOCUS       Hs22_1167722998459 bp    DNA            PRI       10-FEB-2001
DEFINITION  Homo sapiens chromosome 22 working draft sequence segment.

  The LOCUS name ( Hs22_11677 ) collides with the sequence length ( 22998459 )
due to the restrictions of the LOCUS line format.

  To address the LOCUS problems, a new LOCUS line format which allows names
of up to 16 characters, sequences of up to 99,999,999,999 bases, and a uniform
number of data values (eight) will be utilized for all GenBank records starting
with Release 127.0 in December 2001.

  There have been several changes to this new format since it was originally
announced in April 2001. Because a new molecule type of snoRNA was approved at
a recent GenBank/EMBL/DDBJ collaborative meeting, the molecule type has been
increased to 6 characters. An additional space was reserved in case 7 character
molecule types ever appear. This reduced the space available for LOCUS names
from 18 character to 16 characters. Lastly, we realized that the presence of
spaces for linear molecules and 'circular' for circular molecules makes simple
token-based parsing of the LOCUS line a harder task. So 'linear' will be
present in the new LOCUS format.

  The last change is also to encourage software developers to switch to a
token-based LOCUS parsing approach, rather than a column-specific approach.
If this is done, then future changes to the LOCUS line that affect only the
spacing of its data values will not require any modifications to software.

  Because of these changes, we are delaying the introduction of the new format
by two more months (Release 127.0 in December 2001). Here is the revised LOCUS
line format:

1       10        20        30        40        50        60        70       79
LOCUS       16Char_LocusName 99999999999 bp ss-snoRNA  circular DIV DD-MMM-YYYY

Positions  Contents
---------  --------
01-05      LOCUS
06-12      spaces
13-28      Locus name
31-31      space
30-40      Length of sequence, right-justified
41-41      space
42-43      bp
44-44      space
45-47      spaces, ss- (single-stranded), ds- (double-stranded), or
           ms- (mixed-stranded)
48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 
           mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,
           snoRNA. Left justified.
54-55      space
56-63      'linear' followed by two spaces, or 'circular'
64-64      space
65-67      The division code (see Section 3.3)
68-68      space
69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)

  Here's how two existing records will appear using this new format:

LOCUS       AB000383                5423 bp    DNA     circular VRL 05-FEB-1999
DEFINITION  Leucania seperata nuclear polyhedrosis virus DNA for p13, xe,
            envelope protein, complete cds.

LOCUS       AF345888                 147 bp ss-RNA     linear   VRL 21-JUN-2001
DEFINITION  Chikungunya virus nonstructural protein 4 gene, partial cds.

  Sample GenBank flatfiles utilizing the new LOCUS line format will be made
available after Releases 125.0 (August) and 126.0 (October), so that developers
can test software that parses GenBank flatfiles. Further announcements about
the LOCUS line change will be made via these release notes and the GenBank
newsgroup (bionet.molbio.genbank).

1.4.2 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.

  Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.

1.4.3 New REFERENCE type for on-line journals

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:

	REFERENCE   1  (bases 1 to 2858)
	  AUTHORS   Smith, J.
	  TITLE     Cloning and expression of a phospholipase gene
	  JOURNAL   Online Publication
	  REMARK    Online-Journal-name; Article Identifier; URL

  This format is still tentative; additional information about this new
reference type will be made available via these release notes.


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net