FTP site needed for EMBL Database...

Charles Bailey bailey at hmivax.humgen.upenn.edu
Tue Jan 25 15:05:20 EST 1994

In article <20JAN94.20563898 at shrsys.hslc.org>, bshoop at shrsys.hslc.org writes:
> HELP!!!
> I am looking for an Anomymous FTP site for the EMBL Database.  Any suggestions?

Well, I'd hoped someone else would speak up, but since this thread has been
quiet so far, I'll pass on my (mostly unhelpful) understanding of the

The EMBL database is available for anonymous ftp at at least two sites of which
I'm aware - ftp.embl-heidelberg.de and Reinhard Doelz' EMBnet site in
Switzerland - *if* you're a European site.  In the past, they were incurring
great expense for ftp transfers to US sites (their network costs are borne much
more directly than at most US sites), so they had to shut down transAtlantic
access and close Mike Cherry's mirror site at MGH.  US sites like NCBI don't
carry it because they feel that improved data sharing among database
maintainers will result in GenBank, EMBL, and DDBJ containing all of each
other's entries RSN.  (I'm not sure where GSDB fits in here; they communicate
all submissions to their site to the other databases, but the last description
of it I saw made it seem like a superset of the other three.)

For a US site, the need for EMBL may actualy be less than you think.  Last
Fall, I tried to compile a set of those entries present in EMBL release 36 but
not in GenBank release 79.  It turns out that there is no easy way to do this,
since entries are changing constantly, and the numbers I got depended on the
way I performed the comparisons (e.g. comparing primary accession numbers or
all accession numbers). Despite the fuzzines, the following seemed to be the
  o There are around 3000 primary accession numbers of EMBL 36 entries not
    present in GenBank 79.  About 1200 hundred of these are in the patent
    section, which I was told occurred because NCBI didn't have information
    on EMBL patent entries in time for GenBank 79. (These may be entries which
    are present in GenBank, but have different accession numbers, since I'm
    told that themethods by which they were added to each database differed
    at the time.)  Around 1000 are in the synthetic section, for reasons
    unknown to me.  The rest are pretty evenly distributed across the
    database.  They include entries over a broad span of time, though it
    seems that there may be a preference for odler entries.  (This makes
    sense, since there's a bscklog of old entries to be sorted out between
    databases.)  An unknown amount of this is also due to 'dead' entries
    originating from one database which haven't been removed from the other.

  o A quick check of EMBL 36 vs. GenBank 80 by one method (primary accession
    numbers of EMBL entries which don't appear in any GenBank entry), shows
    3200 such entries.  Interestingly, nearly 1800 of these are in the
    backbone section, and so are presumably somewhere in GenBank (perhaps
    EMBL assigns its own accession number to entries in the temporary BB

  o Many of the entries are small sequences (e.g PCR primers for STSs, found
    in the synthetic section).  I'm relying on a correspondent for this
    judgement; I didn't check this out myself.

  o According to the GenBank maintainers, much of this difference should be
    resolved in the near future.

I've made a copy of the EMBL36-v-GenBank79 exclusion set available for
anonymous ftp here (genetics.upenn.edu) in the directory [.bio.gcg].  There
are two files:
  embl36unique_gcgvms.zip -    a ZIP file of the exclusion set in GCG 7.3
                               for VMS format
  embl36unique_gcgunix.tar_z - a compressed tarfile of the exclusion set in
                               GCG 7.2 (I think) for Unix
Please remember that these are EMBL 36 vs GenBank 79, and that I make no
guarantees that they are complete.

If I get a chance someday, I may repeat the comparison using GenBank 81, and
EMBL 37.  If so, I'll put that set up for ftp here.

The bottom line appears to be that there are a number of entries that haven't
been consolidated between GenBank and EMBL yet, but there's no easy way to get
the entries present in EMBL but not GenBank.  You'll have to decide for
yourself whether 
  o the exclusion set here will fill your needs, or
  o you would be better off relying on searches against NCBI's nr database
    (which does include the EMBL entries) via mailserver, or 
  o it's worthwhile to get a tape copy of EMBL releases, or
  o you can ignore the differences in most cases.

I'm sorry not to be of more help, but I hope that this at least better outlines
the situation for you.  Good luck.

