GenBank Release 130.0 Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Wed Jun 26 02:36:20 EST 2002

Greetings GenBank Users,

  GenBank Release 130.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 130.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 130.0

  Uncompressed, the Release 130.0 flatfiles require roughly 70.15 GB
(sequence files only) or 78.75 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires roughly 61.72 GB. From the
release notes:

   Release  Date       Base Pairs   Entries

   129      Apr 2002   19072679701  16769983
   130      Jun 2002   20648748345  17471130

  Close-of-data was 06/20/2002. Four working days were required to prepare
this release. In the eight-week period between close-of-data for GenBank
releases 129.0 and 130.0, GenBank grew by 1.576 billion basepairs and by
701,147 sequence records. During that same period, 130,116 records were
updated. Combined, this yields an average of about 13,850 new/updated
records per day.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is high.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 130.0 and Upcoming Changes) have been appended below.

  Release 130.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 130.0 close-of-data, should be
available by 10:00am EDT, June 26. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 129.0 was

  If you encounter problems while ftp'ing or uncompressing Release 130.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev

1.3 Important Changes in Release 130.0

1.3.1 Organizational changes

  Due to database growth, the EST division is now being split into 168 pieces.

  Due to database growth, the HTG division is now being split into 38 pieces.

  Due to database growth, the PAT division is now being split into 5 pieces.

  Due to database growth, the PLN division is now being split into 6 pieces.

  Due to database growth, the PRI division is now being split into 21 pieces.

  Due to database growth, the VRT division is now being split into 2 pieces.

1.3.2 New CONSRTM linetype for references.

  In order to capture the names of consortia and other groups that are involved
in large-scale sequencing projects, a new linetype called CONSRTM is now
legal for the REFERENCE block of the GenBank flatfile format, as of June, 2002 .

  Consider, for example, the literature citation associated with PubMed
identifier 11237011 :

  Nature 2001 Feb 15;409(6822):860-921
  Initial sequencing and analysis of the human genome.

In addition to the very long list of author names, a consortium is associated
with this publication:

  International Human Genome Sequencing Consortium

  With the addition of a CONSRTM linetype, collective names like this will
have a dedicated location in the flatfile format. Records which currently
attempt to force consortium names into the last entry of the AUTHORS line
will be updated to utilize the new linetype in upcoming months.

  Note that multiple consortia for a REFERENCE may exist, in which case
they will be separated by a semi-colon. It is also possible that references
with a CONSRTM linetype will not have any individual AUTHORS at all.

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of eight
GSS flatfiles in Release 130.0. Consider the gbgss44.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                           June 15 2002

                 NCBI-GenBank Flat File Release 130

                           GSS Sequences (Part 1)

  Here, the filename and part number in the header is "1", though the file
has been renamed as "44" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, but the priority
is admittedly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 Change to the SOURCE and ORGANISM format

The GenBank flatfile format utilizes two different formats for the SOURCE
linetype, depending on the existence of a designated common name in the
GenBank Taxonomy Database -

   --- Current GenBank format ---

SOURCE    [organism name] OR [common name]
ORGANISM  [organelle prefix] organism name

Starting with GenBank release 132.0 in October 2002, a new more flexible
SOURCE format will be adopted, allowing for the display of several types
of secondary names (common names, acronyms, synonyms, anamorphs for the fungi)
which can be derived either from the taxonomy database *or* from the source
feature annotation provided by the submitter.

In addition, the optional organelle prefix will move from the ORGANISM line 
(in the old format) to the SOURCE line in the new format. The ORGANISM line
will contain only the unadorned organism name, the name by which a sequence
entry is indexed in the taxonomy database.

   --- NEW GenBank format ---

SOURCE    [organelle prefix] organism name ([optional second name])
ORGANISM  organism name

The optional second name can be one of the following (ordered by precedence) -

  'synonym' from the source feature organism modifiers (submitter-supplied)
  'acronym' from the source feature organism modifiers (submitter-supplied)
  'anamorph' from the source feature organism modifiers (submitter-supplied)
  'common' from the source feature organism modifiers (submitter-supplied)

  'genbank synonym' from the taxonomy database
  'genbank acronmym' from the taxonomy database
  'genbank anamorph' from the taxonomy database
  'genbank common name' from the taxonomy database

The first set allows us to customize the flatfiles of particular entries,
the last allow us to add useful & informative information from the
taxonomy database (with a more reasonable presentation than in the
current flatfiles).

The 'anamorph' names will appear within parentheses prefixed with
(anamorph: ---). The 'common name', 'acronym' and 'synonym' fields will be 
parenthesized without a prefix (see examples below).

The SOURCE line organelle prefix will correspond to the most detailed portion
of the string value for the /organelle qualifier of the source feature. This
allows us to annotate everything with the correct general terms, yet prominently
display the familiar 'Chloroplast' & 'Kinetoplast' :

  organelle qualifer            SOURCE organelle prefix
  -----------------             -----------------------
  "plastid"                     plastid
  "mitochondrion"               mitochondrion
  "nucleomorph"                 nucleomorph
  "mitochondrion: kinetoplast"  kinetoplast
  "plastid: chloroplast"        chloroplast
  "plastid: apicoplast"         apicoplast
  "plastid: chromoplast"        chromoplast
  "plastid: cyanelle"           cyanelle
  "plastid: leucoplast"         leucoplast
  "plastid: protoplast"         protoplast

======          Examples of the new format       ======

In all of the examples below, the source feature qualifiers given
in the first part of the example will automatically generate the
SOURCE & ORGANISM lines shown:


  /organism="Sus scrofa"

SOURCE      Sus scrofa (pig)
ORGANISM    Sus scrofa

'pig' is the genbank common name from the GenBank taxonomy database.


  /organism="Sus scrofa"
  /note="common: Japanese wild boar"

SOURCE      Sus scrofa (Japanese wild boar)
ORGANISM    Sus scrofa

The common name from the source feature (submittor-suppllied) for
the entry overrides the common name from the GenBank taxonomy database
with the new SOURCE format.


  /organism="Takifugu rubripes"

SOURCE       Takifugu rubripes (Fugu rubripes)
ORGANISM     Takifugu rubripes

'genbank synonym' from the taxonomy database is displayed on the SOURCE


  /organism="Takifugu rubripes"
  /note="common: Sydney's pufferfish"

SOURCE       Takifugu rubripes (Sydney's pufferfish)
ORGANISM     Takifugu rubripes

Any of the customizing fields from the entry itself take precedence
over the default values from the taxonomy database.


  /organism="Cauliflower mosaic virus"

SOURCE       Cauliflower mosaic virus (CaMV)
ORGANISM     Cauliflower mosaic virus

If there is a single acronym listed in the taxonomy database,
it will appear on the SOURCE line.


'genbank anamorph' (from the taxonomy database) 

  /organism="Emericella nidulans"

SOURCE       Emericella nidulans (anamorph: Aspergillus nidulans)
ORGANISM     Emericella nidulans

The 'anamorph' nametype is prefixed with "anamorph:" on the SOURCE line
to distinguish it from a taxonomic synonym.


  /organism="Mytilus californicus"

SOURCE       mitochondrion Mytilus californicus (California mussel)
ORGANISM     Mytilus californicus

Organelle prefix moved to SOURCE, with common name from the
GenBank taxonomy database.

Additional information about this change will be presented via the
GenBank release notes, and via the GenBank newsgroup.

1.4.2 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.

  Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net