IUBio

[Genbank-bb] GenBank Release 149.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Fri Aug 19 18:41:10 EST 2005


Greetings GenBank Users,

  GenBank Release 149.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 149.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 149.0

  Close-of-data was 08/15/2005. Five business days were required to build
Release 149.0. Uncompressed, the Release 149.0 flatfiles require approximately
179 GB (sequence files only) or 195 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 156 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   148      Jun 2005   49398852122  45236251
   149      Aug 2005   51674486881  46947388

In the nearly nine week period between the close dates for GenBank Releases 148.0
and 149.0, the non-WGS portion of GenBank grew by 2,275,634,759 basepairs
and by 1,711,137 sequence records. During that same period, 482,321 records
were updated. Combined, this yields an average of about 35,400 new and/or
updated records per day.

  Between releases 148.0 and 149.0, the WGS component of GenBank grew by
6,579,373,219 basepairs and by 1,564,237 sequence records.

  Note that Release 149.0 represents a significant milestone for GenBank.
The total number of basepairs (WGS and non-WGS) now exceeds 100 billion:

	105,021,092,665

Keeping pace with the continued exponential growth of the database is possible
only through the dedicated efforts of many talented NCBI staff.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 149.0 and Upcoming Changes) have been appended
below.

  **NOTE** Problems were encountered generating the gbacc.idx and
gbkey.idx 'index' files that accompany GenBank Releases. See Section
1.3.1 for further details.

  Release 149.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: August 15 2005, 149.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release
	end

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
149.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov 

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 149.0

1.3.0 GenBank Exceeds 100 Gigabases!

  GenBank reaches a milestone with 149.0, exceeding 100 gigabases of sequence
data. It is interesting to note that the Whole Genome Shotgun (WGS) portion
of the database has grown to exceed the non-WGS portion in just 3.5 years.

1.3.1 Problems generating accession number and keyword indexes

  Continuing software problems again prevented the generation of
the gbacc.idx and gbkey.idx 'index' files which normally accompany
GenBank releases.

  A version of gbacc.idx was built manually. However, the first field
contains just an accession number rather than Accession.Version .

  The gbkey.idx index could not be created without substantial
additional delays in release processing, so it is completely absent
from 149.0 .

  Our apologies for any inconvenience that this may cause.

1.3.2 Organizational changes

  The total number of sequence data files increased by 25 with this release:

  - the EST division is now comprised of 413 files (+16)
  - the GSS division is now comprised of 151 files (+7)
  - the HTG division is now comprised of  68 files (+3)
  - the PRI division is now comprised of  29 files (+1)
  - the ROD division is now comprised of  20 files (+2)

1.3.3 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
twenty-seven of the GSS flatfiles in Release 149.0. Consider gbgss125.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                           August 15 2005

                NCBI-GenBank Flat File Release 149.0

                           GSS Sequences (Part 1)

   87189 loci,    64730609 bases, from    87189 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "125" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

  Several changes related to the Feature Table were agreed to during the
May 2005 collaborative meeting among DDBJ, EMBL, and GenBank. The descriptions
of the changes provided below are preliminary; complete definitions will appear
in future release notes.

1.4.1 New qualifiers for the source feature

  A set of five new source feature qualifiers will be legal as of the
October 2005 release.

    /lat_lon : GPS coordinates for the location at which a specimen,
               from which the sequence was obtained, was collected.
               Format: Decimal degrees (N/S, E/W). 

    /collected_by : Name of the person who collected the specimen.

    /collection_date : Date that the specimen was collected. 
               Format: DD-MMM-YYYY (two-digit month, three letter
               month abbreviation, 4-digit year)

    /identified_by : Name of the person who identified the specimen.

    /PCR_primers="fwd_name: XXX, fwd_seq: aaatttgggccc"
                  rev_name: YYY, rev_seq: gggcccaaattt"

Four separate primer-related qualifiers were initially proposed
(and announced), but in subsequent discussion it was decided to
combine them into a single structured /PCR_primers qualifier.

fwd_seq and rev_seq are mandatory, and their values must be from
the IUPAC nucleotide alphabet. fwd_name and rev_name are both
optional. The primer names (if present) must be a single token,
without whitespace.

The order of the elements within the /PCR_primers must always be
as shown above. Multiple /PCR_primers qualifiers may exist on a
source feature.

  These qualifiers will most likely see their first use in association
with environmental sampling projects and the BarCode project.

1.4.2 : /evidence qualifer to be replaced 

  Two new qualifiers designed to replace /evidence will be legal as
of the October 2005 GenBank release : /experiment and /inference .

  The current /evidence="not_experimental" qualifier will be replaced
by /inference . The /inference values will be from a controlled list
which is intended to capture several different classes of inferential
methods.

  The current /evidence="experimental" qualifier will be replaced
by /experiment. This will be a free-text qualifier in which a brief
description of the nature of the bench experiment which supports
the associated feature can be provided by the submittor.

1.4.3 New /organelle qualifier value

  As of the October 2005 GenBank release, a new value for the /organelle
qualifier will be legal : hydrogenosome 

  This will support the annotation of sequences from anaerobic protozoa
and fungi, for which the hydrogenosome has a role in anaerobic respiration. 

1.4.4 Two new CDS qualifiers

  As of the October 2005 GenBank release, two new CDS feature qualifiers
will be introduced:

	/trans_splicing
	/ribosomal_slippage

  Coding regions involved in such processes will be more easily identified
with the addition of these qualifiers.

1.4.5 New /exception qualifier value

  Coding regions for which the conceptual protein translation differs from
the supplied /translation qualifier are flagged with an /exception 
qualifier. The value :

	"rearrangement required for product"

will be legal for this qualifier as of the October 2005 GenBank release.

1.4.6 : /repeat_unit qualifer to be replaced 

  Two new qualifiers designed to replace /repeat_unit will be legal as
of the October 2005 GenBank release : /repeat_unit_seq and /repeat_unit_range .

  The current qualifier accomodates both integer ranges (eg: "10..20") and
characters that represent a repeat unit pattern (eg: (AT)2(AA)5 ). Introducing
a distinct qualifier for each of these representations will make it easier
to submit and validate them.



More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net