David Mathog had several useful comments on the RETRIEVE server - I'll try
to address them in turn....
mathog at seqvax.caltech.edu (David Mathog) writes:
: I've just spent some time messing with NCBI retrieve. The turnaround time
: for a request varied from 2 hours (17:00 PST) to 5 minutes (9:30 PST),
: which is fast enough.
This delay must be network delay -- we're not queueing the RETRIEVE requests yet,
so we process as soon as the message comes in and mail results right back.
A local message is processed in under 30 seconds.
: However, the service is a bit schizophrenic - it tries to be both a search
: service and a retrieval service, and so is not really very good at either.
Agreed, I think we've taken a step to differentiate the two. There is now
an option 'TITLES' which will retrieve just the definition lines, thereby
giving you an opportunity to do a subsequent search for the entire record.
Just use the keyword TITLES followed by yes in the message.
... query follows
: First of all, it's completely nuts that you cannot retrieve an entry by its
: accession number or name without also getting hit with spurious cross
: references from other entries.
There are ways to limit/eliminate the cross references; we obviously need to
give some specific examples in the documentation. For example, you can
limit the fields that you want to search by adding a field qualifier.
If you want to retrieve a record with the accession number, x56813 from
GenBank, you can enter the query as:
Field qualifiers for other GenBank fields are as follows:
ACCESSION NO. [ACC]
(We'll add this to the documentation sent in response to 'help' along with field
descriptions for the other databases.)
:: Similarly, the search function is compromised by insisting that it send the
: entire database entry on a hit. Here's a fun one, try "brown AND
: drosophila" against Genbank. There are 18 hits, most of them are for
: somebody named Brown and a Drosophila entry. However, you also get Yeast
Again, using field restriction you could do something like:
brown [key] AND drosophila
brown [def] and drosophila
1 record should be returned.
: Chromosome III (somebody named Brown, and a homology to Drosophila). When
: it tries to mail this entry it exceeds the maximum number of lines/entry
: and truncates the list at entry 6 - you don't even get to look at the last
: 12!. This problem is only going to get worse with time as other huge
: genomic stretches are entered into the databases. All we need back from a
: search statement like this is enough information to determine if the hits
: are the ones we want. I'd also REALLY like to be able to restrict the
: search either by fields, or at least by proximity, but we'll leave that
: discussion for another time.
True proximity we don't have; but there is an approximation to it. You can
use double quotes around terms if you want ALL the terms to appear in a
single field (the field doesn't have to be specified if you don't want it to
be). For example, if you want to get records dealing with creatine kinase,
you can use:
That will avoid the default OR'ing of creatine OR kinase and retrieve some
79 records as opposed to 1546.
: Anyway, for a quick fix, what about breaking this up into two services?
:: NCBI SEARCH: Command words: DATALIB, MAXHITS, BEGIN
: Same search lines as now, but return =
: Accession number (or entry name) + # of lines in entry + description line
: (Needed for the MAXLINES parameter ^^^^^^^^^^^^^^^^^^^ in retrieve)
:: NCBI RETRIEVE: Command words: DATALIB, MAXDOCS, MAXLINES, SEPARATE, BEGIN
: Standard keyed retrieval by accession numbers or names.
: (SEPARATE = "mail each entry separately")
I think the TITLES option helps accomplish the above suggestion; we will give
some consideration to the SEPARATE option if it's felt that a record per
mail message is necessary.
We appreciate the comments - also for specific questions/comments if you
use the address: retrieve-help at ncbi.nlm.nih.gov, we scan that more frequently
than the news group and can hopefully offer faster turn-around.