There have been several articles about the use of indexes for extracting
entries from the sequence libraries and the problems that gcg has when
an accession number occurs in more than one entry.
Mention was made of a package (SRS) that does not have this problem.
Readers may be interested to know of our own method of dealing with
sequence library indexes (and which also does not suffer from the
accession number problem).
We decided to use the indexing system that is included on the EMBL
cdrom, and using it we can extract entries based on accession number
and entry name. In addition we can perform instantaneous author and
text searches (the text indexes include every non-trivial word throughout
an entry - not just the keywords - so we find it very useful).
So this allows us to use EMBL and SWISSPROT from the cdrom (or copied
to disk for extra speed), but we also wanted to be able to use EMBL updates
and PIR and GenBank. To make this possible we wrote software to
create EMBL cdrom style indexes for all libraries ie we create
entryname, accession number, author and freetext indexes for all
library formats. It is important to realise that we do not change
the libraries, but leave them as distributed. Not having to reformat
or change the libraries obviously saves a great deal of time and
temporary disk space.
An article dexcribing the initial work on this subject was buried
in Staden,R and Dear,S "Indexing the sequence libraries: software
providing a common indexing system for all the standard sequence
libraries. DNA Sequence 3, 99-105 (1992).
Rodger Staden, Medical Research Council Laboratory Of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, ENGLAND Telephone: 0223 402389
Internet: rs at UK.AC.Cam.MRC-LMB Facsimile: 0223 412282