IUBio

EPD for GCG. Was: Quesyion about READSEQ...

Cary O'Donnell ODONNELL at ARCB.AFRC.AC.UK
Mon Mar 2 10:13:00 EST 1992


> I am trying to change the eukaryotic promoter database (epd29.seq)
> from a FASTA format to a GCG format so GCG can perform FASTA
> searches.
>

The EPD.DAT file distributed by EMBL does not contain sequence information
at all. So it cannot be "formatted" for GCG.

EPD is a list of database entries which are Eukarotic promptors.

The first entry is:

XX
FP   Pv snRNA U1         :+S  PLN:PVUG1      1+     352; 17001.098
XX
DO        Experimental evidence: 4
DO        Expression/Regulation: housekeeping gene
RF        PNAS84:9094

The entry code is PLN:PVUG1 . In GCG you can obtain a copy of this entry using
$ FETCH EMBL:PVUG1


If you wish to use the whole EPD "database" with GCG you should do the
following:

a) Make a file of sequence names (FOSN) from the EPD.DAT file. This file
   holds all the entry codes. You could write a simple program to parse
   the EPD.DAT file to extract all the codes.
            eg call the following file EPD.LIS:

   This is a FOSN for the Eukarotic promotor databae
   ..
   EM:PVUG1
   etc
   etc


b) Use the FOSN for database searches. eg: With FASTA when it asks for the
   database to search:

    Search for query in what sequence(s) (* GenEMBL:* *) ? @EPD.LIS


c) Or - use DATASET to make a separate GCG-readable database. Again, when
   it asks for the data:

    Assembl DATASET from what sequence(s) ? @EPD.LIS


BEWARE:
If you use the GCG-provided databases, then EMBL is only a subset of
the full EMBL database - avoiding duplications in Genbank. You will need a
full copy of the EMBL database for the above to work correctly.

(Alternatively you will need to cross-identify the Accession numbers,
 to identify all the Genbank entry codes from the EMBL codes.
 This is not difficult if you start off with a full copy of EMBL and Genbank
 and use Peter Rice's GBONLY facility (or modify it) in the GCGUNSUPPORTED
 set.)

regards

Cary O'Donnell
*****************************************************************************
AFRC Computing Division         JANET   : AFRC.ARCB::ODONNELL
West Common                     INTERNET: ODONNELL at ARCB.AFRC.AC.UK
Harpenden                       Tel: (+44) 582 762271 ext 229
Herts AL5 2JE                   Fax: (+44) 582 761710
U.K.                            (AFRC = Agricultural & Food Research Council)
-----------------------------------------------------------------------------

============================================================================
Here is an extract from the EMBL release notes:

4.2  Eukaryotic Promoter Database (EPD)

EPD provides additional information about eukaryotic promoters which are present
in  the  main  nucleotide  sequence database.  EPD is maintained and distributed
concurrently with the EMBL nucleotide sequence database.




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net