[Genbank-bb] GenBank Release 154.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Fri Jun 16 01:29:48 EST 2006

Greetings GenBank Users,

  GenBank Release 154.0 is now available via FTP from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 154.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 154.0

  Close-of-data for GenBank 154.0 occured on 06/09/2006. Uncompressed, the
Release 154.0 flatfiles require roughly 222 GB (sequence files only)
or 232 GB (including the 'short directory', 'index' and the *.txt files). 
The ASN.1 data require approximately 192 GB.

Statistics for non-WGS sequences:

  Release  Date       Base Pairs   Entries

  153      Apr 2006   61582143971  56620500
  154      Jun 2006   63412609711  58890345

And for WGS sequences:

  Release  Date        Base Pairs   Entries

  153      Apr 2006    67488612571  13573144
  154      Jun 2006    78858635822  17733973

  During the 59 days between the close dates for GenBank Releases 153.0
and 154.0, the non-WGS portion of GenBank grew by 1,830,465,740 basepairs
and by 2,269,845 sequence records. During that same period, 2,184,755 records
were updated. An average of about 75,500 non-WGS records were added and/or
updated per day.

  Between releases 153.0 and 154.0, the WGS component of GenBank grew by
11,370,023,251 basepairs and by 4,160,829 sequence records.

  The combined (WGS and non-WGS) basepair growth of 13,200,488,991 bases
experienced for GenBank 154.0 represents the largest single-release increase
in the history of the database.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 154.0 and Upcoming Changes) have been appended
		** Important Note #1 **

  A new protein residue abbreviation for the 22nd naturally occurring
amino acid, pyrrolysine, will become legal in GenBank protein sequences
as of October 2006 (Release 156.0). Please see Section 1.4.1 for further

		** Important Note #2 **

  After recent problems generating the 'index' files which normally
accompany GenBank Releases, these files are once again being provided,
though without any EST content, and without most GSS content. See Section
1.3.3 for further details. NCBI is considering ceasing support for the
index files, so we strongly encourage affected users to review that section
and provide feedback.

  Release 154.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: April 15 2006, 154.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
154.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman

1.3 Important Changes in Release 154.0

1.3.1 New JOURNAL type for Pre-Grant Patent Publications

  Sequences associated with granted patents from the US Patent and
Trademark Office (USPTO) typically have references that look like this:

  REFERENCE   1  (bases 1 to 22)
    AUTHORS   Stewart,L.J.
    TITLE     Screening methods for identifying ligands
    JOURNAL   Patent: US 6950757-A 2 27-SEP-2005;

The "Patent:" token indicates that the JOURNAL line pertains to a
patent document, as opposed to a published article in the scientific 

But sequence data can be available well in advance of the point at which
an actual patent has been granted. As of GenBank Release 154 in June 2006,
a patent sequence associated with a "Pre-Grant Publication" is now
indicated via a slight change to the JOURNAL line:

  REFERENCE   1  (bases 1 to 190)
    AUTHORS   Xu,M. and Humphreys,R.
      TITLE   Inhibition of li expression in mammalian cells
     JOURNAL  Pre-Grant Patent: US 20060008448A1 1 12-JAN-2006;

The introduction of "Pre-Grant Patent:" at the start of the JOURNAL
line distinguishes sequences associated with these two different
states in USPTO's patenting process.

Note that pre-grant identifiers from the USPTO are alphanumeric, and
lack a document-type suffix ("-A" in the granted-patent example above).

1.3.2 Organizational changes

  The total number of sequence data files increased by 32 with this release:

  - the BCT division is now comprised of  15 files (+1)
  - the CON division is newly split into  3 pieces (+2)
  - the EST division is now comprised of 528 files (+16)
  - the GSS division is now comprised of 177 files (+3)
  - the HTG division is now comprised of  83 files (+2)
  - the INV division is now comprised of   9 files (+1)
  - the PAT division is now comprised of  24 files (+4)
  - the PLN division is now comprised of  18 files (+1)
  - the VRL division is now comprised of   6 files (+1)
  - the VRT division is now comprised of  11 files (+1)

  In addition, the Short-Directory 'index' file has also been split into
  three pieces:

  gbsdr1.txt : non-EST and non-GSS short directory entries
  gbsdr2.txt : EST short directory entries
  gbsdr3.txt : EST short directory entries

1.3.3 Changes in the content of index files

  As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics since January 2005
seem to support this: the index files are transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also leads us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.

  The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.

  Our short-term solution is to cease generating index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI. GenBank 154.0 thus contains these ten index
files, which lack all EST and most GSS content:


  In addition, a version of gbacc.idx which encompasses the entirety of the
release was built manually, but note that the first field contains just an
accession number, rather than Accession.Version, and that the file is unsorted.

  These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options within the next year:

a) Cease support of the 'index' file products altogether.

b) Provide new products that present some of the most useful data from
   the legacy 'index' files, and cease support for other types of index data.

  If you are a user of the 'index' files associated with GenBank files, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:

   info at ncbi.nlm.nih.gov

  Our apologies for any inconvenience that these changes may cause.

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
thirty-three of the GSS flatfiles in Release 154.0. Consider gbgss145.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                            June 15 2006

                NCBI-GenBank Flat File Release 154.0

                           GSS Sequences (Part 1)

   86832 loci,    64420446 bases, from    86832 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "145" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New protein residue abbreviation for Pyrrolysine

  Sequence databases use single-letter amino acid abbreviations to
record the primary structure (sequence) of amino acids in a polypeptide.
The table of abbreviations includes only those amino acids that are
encoded in the genetic code and directly inserted by a tRNA during the
process of protein translation.  Post-translational modifications are
not represented in the sequence data itself, but may be described by
features annotated on the sequence.

  The discovery of the 22nd naturally encoded amino acid, pyrrolysine,
and the recent submission of sequence records that should contain
this residue, require the adoption of a new amino acid abbreviation.
Because several letters are assigned to represent different experimental
ambiguities, the only letter still available for use is O (uppercase
letter o).  Scientists working in the field have independently suggested
use of this letter, and it has a reasonable mnemonic, pyrrOlysine.

  IUPAC, the body which is responsible for biochemical nomenclature,
has agreed that Pyl/O will be recommended for this amino acid.

  The consequences for flatfile users are that O will appear in CDS
/translation qualifiers, and that Pyl (the three-letter abbreviation)
will appear in CDS /transl_except qualifiers and in the /product and
/anticodon qualifiers of tRNA features. These changes will take effect
as of the October 2006 GenBank release.

  Sample records in ASN.1, FASTA, GenBank flatfile, and INSDSeq XML
formats will be made available on the NCBI ftp site for the purpose of
testing software prior to the public introduction of 'O' in protein

  For BLAST and other sequence similarity search tools, we expect to map
pyrrolysine (O) to unknown (X), as is already done with selenocysteine
(U), the 21st naturally encoded amino acid.  One reason is that the PAM
and BLOSUM substitution matrices do not accommodate these more recently
discovered amino acids.  The other reason is that selenocysteine and
pyrrolysine both appear to be used as active sites in certain enzymes,
and thus do not simply substitute for cysteine or lysine.

  Here are a few literature references which provide more information
about pyrrolysine :

  G. Srinivasan, C. M. James, J. A. Krzycki.  Pyrrolysine encoded by
  UAG in Archaea: charging of a UAG-decoding specialized tRNA.  Science
  2002, 296:1459-1462.

  B. Hao, W. Gong, T.K. Ferguson, C.M. James, J.A. Krzycki, M.K.
  Chan.  A new UAG-encoded residue in the structure of a methanogen
  methyltransferase.  Science 2002, 296:1462-1466.

  C. Polycarpo, A. Ambrogelly, A. Berube, S.M. Winbush, J.A.
  McCloskey, P. F. Crain, J. L. Wood, D. Soll.  An aminoacyl-tRNA
  synthetase that specifically activates pyrrolysine.  Proc. Natl. Acad.
  Sci. (USA) 2004, 101:12450-12454.

  C. Fenske, G.J. Palm, W. Hinrichs.  How unique is the genetic code?
  Agnew. Chem. Int. Ed. 2003, 42:606-610.

1.4.2 Protein residue J for leucine/isoleucine ambiguities

  The residue abbreviation J is reserved for mass spectrometry experiments that
cannot distinguish leucine from isoleucine. Although this abbreviation has
been part of the IUPAC recommendations for some time, it has not previously
appeared in protein sequences in the GenBank database.

  As of October 2006, abbreviation J will be legal in CDS /translation
qualifiers, and Xle (the three-letter abbreviation) will be allowed in CDS
/transl_except qualifiers and in the /product and /anticodon qualifiers of
tRNA features.

  J will also be mapped to unknown (X) for the purpose of BLAST and other
sequence similarity search tools.

1.4.3 /PCR_primers and modified bases

  PCR primers are sometimes constructed which utilize modified bases,
such as those listed in the table of modified bases included in the
Feature Table document:


In October 2006, it will be legal to use modified-base abbreviations
for the /PCR_primers qualifier. For example:

         /PCR_primers="fwd_seq: gcagtt<i>caag<gal q>tggagtgaa, rev_seq:

Here, modified bases inosine and beta,D-galactosylqueosine are included
in the forward sequence of the primer pair, and enclosed between angle
brackets ( <...> ) .

Each pair of angle brackets will include only a single modified base

1.4.4 Introduction of /mobile_element qualifier

  For repeat_region features, the /transposon and /insertion_seq
qualifiers can be used to describe two specific classes of mobile
elements. But not all mobile elements fall into these two categories,
so a new structured /mobile_element qualifier will be introduced
as of GenBank 155.0 in December 2006. The preliminary description
of the new qualifier is as follows:

  Qualifer: /mobile_element

  Description: Type, and name (or identifier), of the mobile element
  which is described by the parent feature.

  Value format: <mobile_element_type>:<mobile_element_id>
  Where mobile element type is one of the following: transposon,
  integron, insertion_sequence, other .

  Example: /mobile_element="transposon:Tnp9"

  Further details about this new qualifier, the domain of mobile element
types in particular, will be provided in these release notes and via the
GenBank newsgroup as they become available.

1.4.5 New /mol_type value

  A new legal molecule type value for viral cRNA sequences will be
introduced as of October 2006:

	/mol_type="viral cRNA"

  This value will also be legal for the molecule type field on the
LOCUS line of the GenBank flatfile format. Additional details about
the usage of this new molecule type value will be provided via
these release notes and the GenBank newsgroup.

1.4.6 Feature location syntax X.Y to be discontinued

  The Feature Table currently supports feature locations of the
format X.Y, to represent a base position which is greater or
equal to X, and less than or equal to Y. For example:

	misc_feature    1.10..20
	misc_feature    join(100..150,200.210..250)

  In the first example, the misc_feature starts somewhere between
bases 1 and 10 (inclusive), and ends at basepair 20. In the second,
the 51 bases from 100..150 are joined together with a second basepair
interval, which could be anywhere from 200..250 to 210..250 .

  Although this syntax seems like a reasonable way to capture an
uncertain interval, it is used for features on a vanishingly small
number of sequence records, most database submission mechanisms
don't support it, and the meaning of its use in a join() context
is not entirely clear.

  As of October 2006, this type of location will no longer be 
supported. Those records with features which utilize X.Y locations
will be reviewed and converted to a non-uncertain format prior to
that date.

1.4.7 /operon to become legal for rRNA features

  With the October 2006 GenBank release, the /operon qualifier will
be legal for use on rRNA features.

More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net