Greetings GenBank Users,
GenBank Release 149.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 149.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 149.0
Close-of-data was 08/15/2005. Five business days were required to build
Release 149.0. Uncompressed, the Release 149.0 flatfiles require approximately
179 GB (sequence files only) or 195 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 156 GB. From
the release notes:
Release Date Base Pairs Entries
148 Jun 2005 49398852122 45236251
149 Aug 2005 51674486881 46947388
In the nearly nine week period between the close dates for GenBank Releases 148.0
and 149.0, the non-WGS portion of GenBank grew by 2,275,634,759 basepairs
and by 1,711,137 sequence records. During that same period, 482,321 records
were updated. Combined, this yields an average of about 35,400 new and/or
updated records per day.
Between releases 148.0 and 149.0, the WGS component of GenBank grew by
6,579,373,219 basepairs and by 1,564,237 sequence records.
Note that Release 149.0 represents a significant milestone for GenBank.
The total number of basepairs (WGS and non-WGS) now exceeds 100 billion:
105,021,092,665
Keeping pace with the continued exponential growth of the database is possible
only through the dedicated efforts of many talented NCBI staff.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 149.0 and Upcoming Changes) have been appended
below.
**NOTE** Problems were encountered generating the gbacc.idx and
gbkey.idx 'index' files that accompany GenBank Releases. See Section
1.3.1 for further details.
Release 149.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: August 15 2005, 149.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
end
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
If you encounter problems while ftp'ing or uncompressing Release
149.0, please send email outlining your difficulties to:
info at ncbi.nlm.nih.gov
Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH/HHS
1.3 Important Changes in Release 149.0
1.3.0 GenBank Exceeds 100 Gigabases!
GenBank reaches a milestone with 149.0, exceeding 100 gigabases of sequence
data. It is interesting to note that the Whole Genome Shotgun (WGS) portion
of the database has grown to exceed the non-WGS portion in just 3.5 years.
1.3.1 Problems generating accession number and keyword indexes
Continuing software problems again prevented the generation of
the gbacc.idx and gbkey.idx 'index' files which normally accompany
GenBank releases.
A version of gbacc.idx was built manually. However, the first field
contains just an accession number rather than Accession.Version .
The gbkey.idx index could not be created without substantial
additional delays in release processing, so it is completely absent
from 149.0 .
Our apologies for any inconvenience that this may cause.
1.3.2 Organizational changes
The total number of sequence data files increased by 25 with this release:
- the EST division is now comprised of 413 files (+16)
- the GSS division is now comprised of 151 files (+7)
- the HTG division is now comprised of 68 files (+3)
- the PRI division is now comprised of 29 files (+1)
- the ROD division is now comprised of 20 files (+2)
1.3.3 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.
There is thus a discrepancy between the filenames and file headers for
twenty-seven of the GSS flatfiles in Release 149.0. Consider gbgss125.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
August 15 2005
NCBI-GenBank Flat File Release 149.0
GSS Sequences (Part 1)
87189 loci, 64730609 bases, from 87189 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "125" based on the number of files dumped from the other
system. We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
Several changes related to the Feature Table were agreed to during the
May 2005 collaborative meeting among DDBJ, EMBL, and GenBank. The descriptions
of the changes provided below are preliminary; complete definitions will appear
in future release notes.
1.4.1 New qualifiers for the source feature
A set of five new source feature qualifiers will be legal as of the
October 2005 release.
/lat_lon : GPS coordinates for the location at which a specimen,
from which the sequence was obtained, was collected.
Format: Decimal degrees (N/S, E/W).
/collected_by : Name of the person who collected the specimen.
/collection_date : Date that the specimen was collected.
Format: DD-MMM-YYYY (two-digit month, three letter
month abbreviation, 4-digit year)
/identified_by : Name of the person who identified the specimen.
/PCR_primers="fwd_name: XXX, fwd_seq: aaatttgggccc"
rev_name: YYY, rev_seq: gggcccaaattt"
Four separate primer-related qualifiers were initially proposed
(and announced), but in subsequent discussion it was decided to
combine them into a single structured /PCR_primers qualifier.
fwd_seq and rev_seq are mandatory, and their values must be from
the IUPAC nucleotide alphabet. fwd_name and rev_name are both
optional. The primer names (if present) must be a single token,
without whitespace.
The order of the elements within the /PCR_primers must always be
as shown above. Multiple /PCR_primers qualifiers may exist on a
source feature.
These qualifiers will most likely see their first use in association
with environmental sampling projects and the BarCode project.
1.4.2 : /evidence qualifer to be replaced
Two new qualifiers designed to replace /evidence will be legal as
of the October 2005 GenBank release : /experiment and /inference .
The current /evidence="not_experimental" qualifier will be replaced
by /inference . The /inference values will be from a controlled list
which is intended to capture several different classes of inferential
methods.
The current /evidence="experimental" qualifier will be replaced
by /experiment. This will be a free-text qualifier in which a brief
description of the nature of the bench experiment which supports
the associated feature can be provided by the submittor.
1.4.3 New /organelle qualifier value
As of the October 2005 GenBank release, a new value for the /organelle
qualifier will be legal : hydrogenosome
This will support the annotation of sequences from anaerobic protozoa
and fungi, for which the hydrogenosome has a role in anaerobic respiration.
1.4.4 Two new CDS qualifiers
As of the October 2005 GenBank release, two new CDS feature qualifiers
will be introduced:
/trans_splicing
/ribosomal_slippage
Coding regions involved in such processes will be more easily identified
with the addition of these qualifiers.
1.4.5 New /exception qualifier value
Coding regions for which the conceptual protein translation differs from
the supplied /translation qualifier are flagged with an /exception
qualifier. The value :
"rearrangement required for product"
will be legal for this qualifier as of the October 2005 GenBank release.
1.4.6 : /repeat_unit qualifer to be replaced
Two new qualifiers designed to replace /repeat_unit will be legal as
of the October 2005 GenBank release : /repeat_unit_seq and /repeat_unit_range .
The current qualifier accomodates both integer ranges (eg: "10..20") and
characters that represent a repeat unit pattern (eg: (AT)2(AA)5 ). Introducing
a distinct qualifier for each of these representations will make it easier
to submit and validate them.