In article <9206092231.AA21818 at temin.lanl.gov>, pgil at TEMIN.LANL.GOV (Paul Gilna) writes...
[Heavily edited...]
>>Regardless of Medline/Genbank/Entrez as the data
>source, the sentiment that I am picking up is that centralised access
>to the data is essential.
Yup, that is how I feel.
>etc.
>>While neither system can be claimed to be perfect, its clear that there
>is a strong preference to the login/server path than the off-line
>access, and that we are doing a good job in meeting the demands on this
>service.
Yes, the service is pretty good, you folks deserve more credit than you get.
>Yet look at the evidence we are facing today; the available servers are
>processing literally 1000's of queries per day, and this rate is
>climbing with no sign of abating, GenBank is already placing load
>limiters on the retrieval queues, and I would guess that one if not
>more of the CPU's dedicated to this service is in permanent FASTA
>mode.
Good, their machines are doing what they are supposed to be doing with
minimal idle time.
>Together, these factors dramatically increase the significance
>of the consequences of a system failure (or even planned downtime) to a
>community becoming increasingly dependant on a centralised data
>distribution mechanism--cd-roms may be slow, but you're going to get
>er, annoyed, when you cannot get your FASTA results back because you
>are behind 500 other jobs and you are three time zones away.
>>And I'll bet that's only a glimmer of what would happen if the entire Medline
>user community could suddenly dial or internet in to NLM!
>>etc.
There are a couple of different strings in this last part. You don't have
to know too much about industry to know that they have the same problems
and that they have time tested solutions to them. I'm going to use VAXes
as an example here (no flames please, it's just an example and I'm sure
Unix equivalents exist).
String 1. Current services are getting overloaded.
Solutions: Put few more computers on line or buy faster computers. The
incremental cost of adding a machine at an existing center is negligible,
having everybody go out and buy one to duplicate the service would cost a
fortune. (Ignore the hardware costs, think about the labor and
maintenance). I don't know how difficult it will be for the various
biology services to set up the appropriate load sharing, but it can't be
that hard, after all it's been available on VAXclusters for what, ten years
now?
String 2a. "We're putting all of our eggs in one basket, what if it breaks?"
(Single machine failures bringing down a service.)
Solution: Implement distributed sharing as above.
String 2b. "We're putting all of our eggs in one basket, what if it breaks?"
(Data center failure due to hurricane, power failure, etc. bringing down a
service.)
Solution: Rollover to alternate datacenters. It's almost this way now
for some services - if NCBI BLAST is down we go to Genbank and vice versa.
Distributed processing helps out here too. For instance, a large
VAXcluster can be physically located at multiple datacenters and it will
not fail even if several of the datacenters go down. If the service under
discussion can go away for a couple of hours, then it would also be
possible to contract with one of the (many) disaster recovery companies, or
even a big lab, for a backup site.
On a separate note, I'm not a big fan of CD-rom distributions because:
1. Either you need as many drives as you've got databases, or you have
to load it all onto a different disk anyway, or you have to swap CD-
roms all the time.
2. The disk production costs are too high. Send it out on 8 mm tape and
you can probably get the per unit cost down to $20 or so (plus we can
reuse the tapes). Sure we'd have to buy 8 mm tape drives and disks big
enough to hold the data, but at least we'd get a backup method and
flexibility for future changes.
3. Isn't the data capacity of CD-roms about 600 Mb? When we get there
in a few years are we going to have multivolume CD-rom sets?
4. I'm worried about CD-rom format incompatibilities.
My preferred solution: distribute database releases as difference sets.
These will always be a lot smaller than the whole database and you can
stay with existing distribution channels for a while. If anybody
wants to start from scratch, they just obtain the full series of updates.
Nothing novel about this, ACEDB is distributed this way. For Genbank files
this wouldn't be terribly difficult to implement at either end, we'd just
need a file that said something like:
discard entries X83423,...
include updated entries X83423...
include new entries...
Anyway, that's what I think.
David Mathog
mathog at seqvax.caltech.edu
manager, sequence analysis facility, biology division, Caltech