Updating databases

Reinhard Doelz doelz at comp.bioz.unibas.ch
Fri Jan 6 02:34:36 EST 1995

David Mathog (mathog at seqvax.caltech.edu) wrote:
: It would be *really* nice if the folks who maintain and distribute
: databases would make it a bit easier to automate updates.  It isn't all 
: that hard now, but it seems that every time I go to do a set of updates
: some file has been moved, or changed names, or is .txt.Z when it was just
: txt the previous time, or in one way or another mutated so as to break my
: retrieval software. 


: Comments?

Comment 1: 
The proposed stubs try to compensate the weakness of a transaction based 
on a poll mechanism without query. Rather than the customer inquires in 
detailed fashion, the current schema of FTP requires the FTP site to be 
(1) fixed in file names (btw, what do you do with your software if a new 
division appears), (2) preprepared for any request, be it day-by-day, 
week-by-week, or other, and (3) does not at all tackle the question what 
to do with it. The proposed schema does not mention whether the target is 
to update (1) formatted (which format? ),  (2)  unformatted or (3) incremental 
data. In particular, the latter is most appropriate at wide area networks 
to save badnwith, and raises management problems at the local site. 

Comment 2:
The proposed schema seems to imply a resource discovery based on static 
listings. These are notoriously difficult to maintain and do not necessarily
offer a quality control issue (see below). Even if this were a possibility,
this implied that all sites referencing each other have the same policy of 
'free for all' and do honor the same quality standards. 

Comment 3:
The major problem in sequence database updating is that the quality of 
an update cannot be judged if you download it as file. Neither date nor
contents are sufficiently characterized in their format. A synchronisation
is required (or at least desirable) which allows to crosscheck the contents
of your local, adapted , formatted copy to the originally present data at 
the provider. Versions and dates are nice but insufficient to characterize
a contents in incremental updates. 

Comment 4:
The proposed schema requires a considerable amount of coordination in 
between providers. The resources for the update buisness are fairly low
as you, and many others, are neither prepared nor willing to pay for the 
service you request. I'm not telling you that all researchers ought to be 
tapped but as I had to realize recently lots of our customers don't even 
know where the data come from, and being asked for funding, the granting 
organizations want to bless just these customers with 'self-sustaining' 
fees to cover the costs of such a service. In other words, even if the 
items 1-3 could be ruled out, money is tough. 

Comment 5:
The market is fairly small. In Europe, we have 26 EMBnet nodes mirroring the 
EMBL database, and presumably about 100 sites who care about updates. The 
recent referee's comment on a paper we submitted to XXXXX with respect to 
a new compression system was (quote) "Not many sites update their databases
regularly over the Internet". This might be a guess on the low side but
you will not be able to count on commercial providers to help you developing
such a mechanism easily. 

Comment 6:
We are aware of all these problems. We deal with them on a daily basis as 
we update all sequence databases available on daily, or weekly schedule. 
We have developed suitable mechanisms to receive, distribute and redistribute
updates. The reason why we didn't make it a 'release' yet is the resource 
issue - it works for us, and not necessarily requires that we make the 
methods publically available. I am currently finishing HASSLE with SRS 
access but afterwards will try to work on the DBTOOLS for release in early 
summer (sorry it was originally scheduled fall '94 but postponed). This 
should be symptomatic as it shows that once it works at the local site any 
effort to make it used on a wider basis in not profitable for the individual
site and therefore usually not practiced heavily. 

Reinhard Doelz
EMBnet Switzerland 

 R.Doelz         Klingelbergstr.70| Tel. x41 61 267 2247  Fax x41 61 267 2078|
 Biocomputing        CH 4056 Basel| electronic Mail    doelz at ubaclu.unibas.ch|
 Biozentrum der Universitaet Basel|-------------- Switzerland ---------------|
<a href=http://beta.embnet.unibas.ch/>EMBnet Switzerland:info at ch.embnet.org</a> 

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net