Do we need local databases?

ewan birney birney at molbiol.ox.ac.uk
Tue Feb 14 13:38:39 EST 1995

doelz at comp.bioz.unibas.ch (Reinhard Doelz) wrote:
> Colleagues, 
> the last posting in the 'GCG on a PC' thread raised an interesting 
> hypothesis which I would like to get more input on. The author mentions that, 
> due to the speed and ease of the NCBI network server, it were easy to 
> omit local databases and rely on network resources entirely. 
> We have seriously investigated this earlier and concluded that the 
> work done by many of the 'casual' users (i.e., type in sequence, search 
> sequence, retrieve top hit) can indeed be done by networked databases. 
> However, to the residual 30% of users, who do not stop after having noticed
> merely insignificant hits, what  happens if (1) you need to search 
> for subsets in the database, (2) you need _many_ database entries 
> (i.e., a 100 or 1000)  and (3) you do many comparisons, statistical or 
> evolutionary analysis, and individual work which should be done anyhow
> after a reasonable search.
> One of my favourites is to use GCG's feature of files of sequence names 
> in order to group sequences and process these in any other operation. 
> Unless you have a very sophisticated network system, this can only be 
> achieved if your database is in the same environment as your process runs
> on local resources most of the time. In order to have _this_ achieved 
> with networks, we needed a much more sophisticated way to communicate 
> the search set which we want to tackle. I don't think of database 
> divisions here but of sets of data which do not use the whole length 
> but rather a short fragment of it . 
> How would you imagine to run this type of search in a networked environment?
> Regards
> Reinhard

I'm not sure if this is implicit in your proposal or not, but
one thing I am increasingly coming up against is the fact that
one gets "hits" against large (genomic) S.cerevisae chromosones, and
also C.elegans contigs in virtually every search I do. So you
definitely need to have a way of not simply specifying sets of
sequences but also segments of sequences as a set. I know GCG
has /begin= and /end= (but do they work on every sort of input?)

If you want to be really formal, I guess you would need to have
disjoint segements allowed (so you could splice if you like).

Otherwise, I am not sure if you are trying to suggest some sort
of "standard query language" for biological information? I think
this would be useful. Especially if I/we/one could develop tools
which automatically retrieved it from the most appropiate place
(ie, if it is held locally, do it local, else try nearby
network sites, else try more distant network sites). This 
*hopefully* would be hidden from both the user and the programmer.

... aaah dreaming again....


birney at molbiol.ox.ac.uk


More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net