rate of growth of the sequence databases

S S Sturrock sss at castle.ed.ac.uk
Mon Mar 8 08:34:34 EST 1993

In article <73E0FEC5429F007D61 at HKUMD1.HKU.HK> HRMBDKC at HKUMD1.HKU.HK writes:
>I'd like some information about the rate of growth of the sequence
>databases both in the past and projections for the future. This is

Generally it is accepted that databases are growing exponentially, doubling
times around 18 months.  Often this information is put in the release notes
for a particular database.  Also, note many databases are now available for
free or on CD-ROM for a small charge (eg EMBL) and so you need not be
limited to the databases that a particular software company see fit to
supply in their particular format.  Add to this updates available on FTP
and things can get interesting.

>a serious consideration when setting up one's own on -site database
>a serious consideration when subscribing to the database services
>currently available. This is also a consideration in deciding which
>type of main-frame system to go for.

Always a difficult choice and the wide range of available software and
systems not to mention the vast amount of wild claims made for sensitivity
and such like of differing systems/code make this a near impossible
question to answer.  Database size is really just a matter of disc space
unless you are dealing with specific implementations of searching code
where memory requirements may obsolete the machine you buy very quickly or
certainly put you in line for large memory upgrades, not that manufacturers
would complain.  Also, as databases increase in size the time taken to
search will increase (obviously) and so your hardware may also be obsolete
quickly.  Depending on how rigorous you want to be in searches or whatever
you can go for the variety of heuristic algorithms such as FASTA or BLAST
which allow for quick searches on cheap machines but may or may not miss
some interesting alignments.  For many cases you may not see any
difference.  Also, there are various implementations of the classic Smith
& Waterman exhaustive dynamic algorithm which compares every base/residue
in the database with every one in the query thus is time consuming and can
be very slow.  Again, it depends on who wrote the code, I know of
implementations of the same algorithm which can be orders of magnitude
different in performance on the same hardware thus claims that heuristic 
methods may be much faster than the exhaustive algorithm depends on which 
implementation of that algorithm was compared against it.

That's the long answer, the short answer is that you just can't tell, new
hardware is coming along at a rate of knots and implementations of code on
these platforms takes time with various people claiming speed records and
someone always waiting in the wings to go faster on some other machine,
especially true with parallel architectures.  You must also consider that
if you buy software you are going to be paying for support which needs to
be budgeted for each year.  The popular suites such as GCG offer a huge
array of functions/routines for a reasonable price of course performance
will suffer since this is normally run on low performance hardware.  This
isn't really helping much is it?

Either way, it's going to cost in time or money.

Assess just what your users want and how patient they are willing to be, or
in many cases how likely it is that they will use the software to best
effect if it takes several hours to see the results of changing one
parameter thus making them more likely to take defaults.  IMHO this can
lead to people not finding all the possible clues to homology that are
available given a little time and tinkering.  Most of all, don't just
assume that since a computer gave these results they are the gospel truth,
the code is produced by programmers and may be biased in terms of
weighting schemes, significances of results etc etc etc.  There are
programs out there which can produce very convincing results which are just
plain wrong but how can you tell?  It is a well known fact that *ALL* code
contains bugs, how dangerous these are depends on how long they go unseen.
Well supported software should (I stress *SHOULD*) be mature enough to have
a low level of errors and rapid corrections.

I hope this helps.

PS: Disclaimer.  These are my own personal views and are not necessarily
shared by my employers.
Shane Sturrock, Biocomputing Research Unit, Darwin Building, Mayfield Road,
University of Edinburgh, Scotland, Commonwealth of Independent Kingdoms.  :-)

Civilisation is a Haggis Supper with salt and sauce and a bottle of Irn Bru.

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net