[Genbank-bb] GenBank Release 148.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Mon Jun 20 23:50:21 EST 2005

Greetings GenBank Users,

  GenBank Release 148.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 148.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 148.0

  Close-of-data was 06/14/2005. Five business days were required to build
Release 148.0. Uncompressed, the Release 148.0 flatfiles require approximately
172 GB (sequence files only) or 189 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 148 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   147      Apr 2005   48235738567  44202133
   148      Jun 2005   49398852122  45236251

In the eight week period between the close dates for GenBank Releases 147.0
and 148.0, the non-WGS portion of GenBank grew by 1,163,113,555 basepairs
and by 1,034,118 sequence records. During that same period, 501,858 records
were updated. Combined, this yields an average of about 27,400 new and/or
updated records per day.

  Between releases 147.0 and 148.0, the WGS component of GenBank grew by
7,244,024,219 basepairs and by 2,026,178 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 148.0 and Upcoming Changes) have been appended

  **NOTE** Problems were encountered generating the gbacc.idx and
gbkey.idx 'index' files that accompany GenBank Releases. See Section
1.3.1 for further details.

  Release 148.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: June 15 2005, 148.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
148.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov 

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman

1.3 Important Changes in Release 148.0

1.3.1 Problems generating accession number and keyword indexes

  Continuing software problems again prevented the generation of
the gbacc.idx and gbkey.idx 'index' files which normally accompany
GenBank releases.

  A version of gbacc.idx was built manually. However, the first field
contains just an accession number rather than Accession.Version .

  The gbkey.idx index could not be created without substantial
additional delays in release processing, so it is completely absent
from 148.0 .

  Our apologies for any inconvenience that this may cause.

1.3.2 Organizational changes

  The total number of sequence data files increased by 17 with this release:

  - the EST division is now comprised of 397 files (+9)
  - the GSS division is now comprised of 144 files (+2)
  - the HTC division is now comprised of   7 files (+1)
  - the HTG division is now comprised of  65 files (+2)
  - the PAT division is now comprised of  18 files (+1)
  - the VRL division is now comprised of   5 files (+1)
  - the VRT division is now comprised of   9 files (+1)

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
twenty-six GSS flatfiles in Release 148.0. Consider gbgss119.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                            June 15 2005

                NCBI-GenBank Flat File Release 148.0

                           GSS Sequences (Part 1)

   87186 loci,    64735440 bases, from    87186 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "119" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

  Several changes related to the Feature Table were agreed to during the
May 2005 collaborative meeting among DDBJ, EMBL, and GenBank. The descriptions
of the changes provided below are preliminary; complete definitions will appear
in future release notes.

1.4.1 New qualifiers for the source feature

  A set of eight new source feature qualifiers will be legal as of the
October 2005 release.

    /lat_lon : GPS coordinates for the location at which a specimen,
               from which the sequence was obtained, was collected.
               Format: Decimal degrees (N/S, E/W). 

    /collected_by : Name of the person who collected the specimen.

    /collection_date : Date that the specimen was collected. 
               Format: DD-MMM-YYYY (two-digit month, three letter
               month abbreviation, 4-digit year)

    /identified_by : Name of the person who identified the specimen.

    /fwd_primer_seq : Forward PCR primer sequence used to amplify
               the sequence.

    /fwd_primer_name : Name of the forward PCR primer.

    /rev_primer_seq : Reverse PCR primer sequence used to amplify
               the sequence.

    /rev_primer_name : Name of the reverse PCR primer.

  These qualifiers will most likely see their first use in association
with environmental sampling projects and the BarCode project.

1.4.2 : /evidence qualifer to be replaced 

  Two new qualifiers designed to replace /evidence will be legal as
of the October 2005 GenBank release : /experiment and /inference .

  The current /evidence="not_experimental" qualifier will be replaced
by /inference . The /inference values will be from a controlled list
which is intended to capture several different classes of inferential

  The current /evidence="experimental" qualifier will be replaced
by /experiment. This will be a free-text qualifier in which a brief
description of the nature of the bench experiment which supports
the associated feature can be provided by the submittor.

1.4.3 New /organelle qualifier value

  As of the October 2005 GenBank release, a new value for the /organelle
qualifier will be legal : hydrogenosome 

  This will support the annotation of sequences from anaerobic protozoa
and fungi, for which the hydrogenosome has a role in anaerobic respiration. 

1.4.4 Two new CDS qualifiers

  As of the October 2005 GenBank release, two new CDS feature qualifiers
will be introduced:


  Coding regions involved in such processes will be more easily identified
with the addition of these qualifiers.

1.4.5 New /exception qualifier value

  Coding regions for which the conceptual protein translation differs from
the supplied /translation qualifier are flagged with an /exception 
qualifier. The value :

	"rearrangement required for product"

will be legal for this qualifier as of the October 2005 GenBank release.

1.4.6 : /repeat_unit qualifer to be replaced 

  Two new qualifiers designed to replace /repeat_unit will be legal as
of the October 2005 GenBank release : /repeat_unit_seq and /repeat_unit_range .

  The current qualifier accomodates both integer ranges (eg: "10..20") and
characters that represent a repeat unit pattern (eg: (AT)2(AA)5 ). Introducing
a distinct qualifier for each of these representations will make it easier
to submit and validate them.

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net