--
Hi,
As most of you are aware of, the increase in the EBI or Genbank sizes is
not a minor problem...
Here I maintain a GCG-formatted version of the EBI databank (I exclude
the EST division, because ESTs are mostly used in BLAST searches at
remote sites).
It appears that today, the cumulated weekly updates since the last CD-ROM
release of EBI is as large as the last release itself....
This is due in part, of course, to the increasing number of newly
determined sequences. But it is due also to a great number of ESTs and a
fabulous number of "duplicates" where duplicate means that an entry in
EBI has been corrected or modified - thus a "duplicate" has the same
accession number or ID in the full release and the updates.
My question is: do you know of an efficient program which, starting from
the EBI flat file and the weekly updates flat files, will remove the
redundancies and keep the last updated one, and possibly remove the ESTs
from the updates?
Thank you for your help,
Jean-Loup
PS. I cross-post this message to both bionet.software.gcg and
bionet.software
-------------------------------
Jean-Loup RISLER
risler at cgmvax.cgm.cnrs-gif.fr
Centre de Genetique Moleculaire
91198 Gif sur Yvette France
-------------------------------