In article <20JAN94.20563898 at shrsys.hslc.org>, bshoop at shrsys.hslc.org writes:
>> HELP!!!
>> I am looking for an Anomymous FTP site for the EMBL Database. Any suggestions?
Well, I'd hoped someone else would speak up, but since this thread has been
quiet so far, I'll pass on my (mostly unhelpful) understanding of the
situation.
The EMBL database is available for anonymous ftp at at least two sites of which
I'm aware - ftp.embl-heidelberg.de and Reinhard Doelz' EMBnet site in
Switzerland - *if* you're a European site. In the past, they were incurring
great expense for ftp transfers to US sites (their network costs are borne much
more directly than at most US sites), so they had to shut down transAtlantic
access and close Mike Cherry's mirror site at MGH. US sites like NCBI don't
carry it because they feel that improved data sharing among database
maintainers will result in GenBank, EMBL, and DDBJ containing all of each
other's entries RSN. (I'm not sure where GSDB fits in here; they communicate
all submissions to their site to the other databases, but the last description
of it I saw made it seem like a superset of the other three.)
For a US site, the need for EMBL may actualy be less than you think. Last
Fall, I tried to compile a set of those entries present in EMBL release 36 but
not in GenBank release 79. It turns out that there is no easy way to do this,
since entries are changing constantly, and the numbers I got depended on the
way I performed the comparisons (e.g. comparing primary accession numbers or
all accession numbers). Despite the fuzzines, the following seemed to be the
case:
o There are around 3000 primary accession numbers of EMBL 36 entries not
present in GenBank 79. About 1200 hundred of these are in the patent
section, which I was told occurred because NCBI didn't have information
on EMBL patent entries in time for GenBank 79. (These may be entries which
are present in GenBank, but have different accession numbers, since I'm
told that themethods by which they were added to each database differed
at the time.) Around 1000 are in the synthetic section, for reasons
unknown to me. The rest are pretty evenly distributed across the
database. They include entries over a broad span of time, though it
seems that there may be a preference for odler entries. (This makes
sense, since there's a bscklog of old entries to be sorted out between
databases.) An unknown amount of this is also due to 'dead' entries
originating from one database which haven't been removed from the other.
o A quick check of EMBL 36 vs. GenBank 80 by one method (primary accession
numbers of EMBL entries which don't appear in any GenBank entry), shows
3200 such entries. Interestingly, nearly 1800 of these are in the
backbone section, and so are presumably somewhere in GenBank (perhaps
EMBL assigns its own accession number to entries in the temporary BB
section).
o Many of the entries are small sequences (e.g PCR primers for STSs, found
in the synthetic section). I'm relying on a correspondent for this
judgement; I didn't check this out myself.
o According to the GenBank maintainers, much of this difference should be
resolved in the near future.
I've made a copy of the EMBL36-v-GenBank79 exclusion set available for
anonymous ftp here (genetics.upenn.edu) in the directory [.bio.gcg]. There
are two files:
embl36unique_gcgvms.zip - a ZIP file of the exclusion set in GCG 7.3
for VMS format
embl36unique_gcgunix.tar_z - a compressed tarfile of the exclusion set in
GCG 7.2 (I think) for Unix
Please remember that these are EMBL 36 vs GenBank 79, and that I make no
guarantees that they are complete.
If I get a chance someday, I may repeat the comparison using GenBank 81, and
EMBL 37. If so, I'll put that set up for ftp here.
The bottom line appears to be that there are a number of entries that haven't
been consolidated between GenBank and EMBL yet, but there's no easy way to get
the entries present in EMBL but not GenBank. You'll have to decide for
yourself whether
o the exclusion set here will fill your needs, or
o you would be better off relying on searches against NCBI's nr database
(which does include the EMBL entries) via mailserver, or
o it's worthwhile to get a tape copy of EMBL releases, or
o you can ignore the differences in most cases.
I'm sorry not to be of more help, but I hope that this at least better outlines
the situation for you. Good luck.
Regards,
Charles Bailey
!-------------------------------------------------------------------------------
! Computational Biology and Informatics Laboratory
! Dept. of Genetics, Univ. of Pennsylvania School of Medicine
! Philadelphia, PA USA 19104 Tel. (215) 573-3112
! Internet: bailey at genetics.upenn.edu (IN 128.91.200.37)
!-------------------------------------------------------------------------------