I've just spent some time messing with NCBI retrieve. The turnaround time
for a request varied from 2 hours (17:00 PST) to 5 minutes (9:30 PST),
which is fast enough.
However, the service is a bit schizophrenic - it tries to be both a search
service and a retrieval service, and so is not really very good at either.
First of all, it's completely nuts that you cannot retrieve an entry by its
accession number or name without also getting hit with spurious cross
references from other entries.
Similarly, the search function is compromised by insisting that it send the
entire database entry on a hit. Here's a fun one, try "brown AND
drosophila" against Genbank. There are 18 hits, most of them are for
somebody named Brown and a Drosophila entry. However, you also get Yeast
Chromosome III (somebody named Brown, and a homology to Drosophila). When
it tries to mail this entry it exceeds the maximum number of lines/entry
and truncates the list at entry 6 - you don't even get to look at the last
12!. This problem is only going to get worse with time as other huge
genomic stretches are entered into the databases. All we need back from a
search statement like this is enough information to determine if the hits
are the ones we want. I'd also REALLY like to be able to restrict the
search either by fields, or at least by proximity, but we'll leave that
discussion for another time.
Anyway, for a quick fix, what about breaking this up into two services?
NCBI SEARCH: Command words: DATALIB, MAXHITS, BEGIN
Same search lines as now, but return =
Accession number (or entry name) + # of lines in entry + description line
(Needed for the MAXLINES parameter ^^^^^^^^^^^^^^^^^^^ in retrieve)
NCBI RETRIEVE: Command words: DATALIB, MAXDOCS, MAXLINES, SEPARATE, BEGIN
Standard keyed retrieval by accession numbers or names.
(SEPARATE = "mail each entry separately")
mathog at seqvax.caltech.edu
manager, sequence analysis facility, biology division, Caltech