GCG database reorganization - the time has come

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Fri Jun 30 12:49:44 EST 1995

Hi folks,

A couple of days ago I finished downloading and installing Genbank 89.
This exercise pointed out how GCG's method of handling databases is
becoming increasingly cumbersome as the data increases in size. 

For instance, the EST division of genbank, at the end of the conversion
from genbank format to gcg format, took up 1.15 Gbytes of disk space.  Of
this, 560 Mb was the Genbank formated GBEST.SEQ file, and the rest was the
GCG formatted version of the same.  (GCG is a tiny bit bigger because of
the offset, numbers and names files.)  One can delete the genbank formatted
file as soon as the GCG formatted one is generated - but NOT before.  It is
also worth noting that the reference section took up 5.3X the disk space of
the sequence section, which is only slightly higher than the overall
ratio (for the whole database) of 4.6.

The entire Genbank 89 distribution in GCG format takes up 1.1 Gbytes, so
that cannot share a 2Gb disk with the scratch directory.  (I don't delete
the existing GCG database until the new one is ready - too often something
breaks during the download and we would be without a database.)

In order to handle the conversions I use this configuration:

 1 2Gb disk for a scratch area to convert Genbank -> GCG
 1 2Gb disk to hold the final, converted, GCG form of Genbank
 1 2Gb disk to hold the GCG (and other) software.

GBEST is growing at something like 30% per release.  In two releases it
may not be possible to unpack it on a single 2Gb disk!

I suggest that the following changes are urgently needed:

1.  Modification of genbanktogcg so that it can read a compressed (.Z)
genbank file.  Ie, why decompress the entire file and leave it laying about
on disk when it could just as easily be decompressed on the fly? 

2.  Modification of the REFERENCE part of the database to leave the
REFERENCE information compressed.  There are two parts to this.  First,
huge numbers of the GB_EST reference entries are very, very similar to 
each other (same lab, organism, cloning vector, etc, etc, etc.).  These
could be represented as Template 1 + changes.  Secondly, and generally,
the reference section could very easily be left in a compressed form
and decompressed on the fly.  It's just text and spaces and should easily
compress by 3X or so.  Modern processors are so much faster than their
disks that this might even speed up retrieval times.  In any case, we do
very few stringsearches compared to sequence searches of various types, so
even if it did slow FETCH and STRINGSEARCH up a bit, in the grand scheme
of things it would be a minor effect.

3.  Start thinking about breaking the Genbank database up into smaller 
pieces that can be relocated to multiple locations.  Soon all of Genbank
won't fit on a single 2Gb disk, not long after that, individual divisions
won't fit on that size disk.  There should be an option to fragment these
files into, say 200 Mb pieces, and the database software should understand
how to deal with such as a single logical unit.  This would be preferable
to the cludgy methods that are available now, such as splitting the 
database across disks with logicals like:

  $ define genbankdir disk1:[genbankdir],disk2:[genbankdir]

or creating virtual huge disks by making shadow sets of multiple disks.


David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net