Suggestions for future versions of SRS
I think SRS is great software for many genome informatics needs.
We are now using it as the main search engine in the FlyBase project
(http://flybase.bio.indiana.edu/, see e.g., the Genes section searches).
Based on using SRS for Genbank and related sequence data at IUBio and for
the Drosophila data of quite a variety, I have several suggestions
These are in rough order of importance to me, and possibly to others.
Maybe some of these are possible now and I haven't looked hard enough.
If not, I will try to add some of them and pass code on to Thure
and colleagues.
-- Sequence output format
-- default should always be the native format, untouched by SRS (current
return of genbank data is a bogus format that can't be interpreted well,
it is missing the ORIGIN line, the sequence data is in EMBL not GENBANK
style; maybe part of this is icarus indexing mistakes).
-- offer GENBANK and PIR/CODATA output formats as primary standard
sequence formats
-- Query symbol neutrality
The symbols that SRS now requires in queries for operations and parsing
clash with symbols used above (in unix and http command strings) and below
(in biological data). Especially because of the latter, it is difficult
to use escape characters to do the kinds of queries needed.
There should be query-time switches for getz, wgetz and such that let the
caller set symbols used for query parsing, including &|![]={}-.
At the least offer query-time symbol swapping, so that any single parsing
symbol can be changed to another in meaning. The high ascii set would
make a good option. But it would also be nice to allow strings, such
as _AND_ for &, _OR_ for |, _OPEN_PHRASE_ for [, _CLOSE_PHRASE_ for ],
etc. in queries.
-- Case sensitive searches
This should be available as a query-time, user choice for any field.
Perhaps there should be an index-time switch that will say if a field
has case sensitive potential, if it is compute expensive at query-time.
-- Index numeric ranges
For example, a map range such as "123-456" should be indexed so it
can be queried as a numeric range. Query such as 124, 234, 345 should
all match such a range. Several ranges per field must be possible.
In WAIS, we just stored the text string of such a field, and did a numeric
range test at query time.
-- Cache query results and use that for quick lookups of next page data.
wgetz, and other srs query drivers, offer a page of results for a given
query, plus additional page links. These additional page links redo
the same query at a sometimes large cpu cost. It would be nice to have
the full match set for each query cached (for maybe an hour, in SRSTMP:)
and used to serve multipage requests of same query.
-- Relevance ranking
Allow fields to store word counts per record in indexes,
and use these counts for one form of relevance calculation. Relevance
ranking can markedly improve the usability of query results, where those
with the most query words (or however defined as most relevant) are sorted
to the top of the results list. Relevance ranking has been standard in
WAIS and related text indexing.
-- Lists of words to ignore in indexing
Use lists/files of common words to ignore at indexing (a, and, the, ...).
Let the icarus parsing script read such a list from common file/data and
apply to storing indices from any particular fields. Maybe we
can do this now in the rich icarus; if so an example would be nice.
--
-- d.gilbert--biocomputing--indiana u--bloomington--gilbertd at bio.indiana.edu