Software to extract annotation fields from EMBL/GenBank entries.

Peter Rice pmr at sanger.ac.uk
Mon Jun 3 10:49:48 EST 1996

In article <b.robertson-0306961409530001 at cb-11.sm.ic.ac.uk> b.robertson at ic.ac.uk (Brian Robertson) writes:
>   The amount of bacterial genome data available as sequenced cosmids of
>   30-40 kb is increasing rapidly. Our problem is that we need to keep track
>   of newly discovered genes as they appear, so they can be incorporated into
>   our research program as appropriate. For this we need to create lists of
>   probable genes identified in the annotations for each cosmid. This can
>   then be circulated to laboratory workers.
>   An example of this kind of annotation is shown below. We would like to
>   extract the "/note" field, which contains the probable function of the
>   gene, and create a list of these for each cosmid.

>FT   CDS_pept        complement(3043..4155)
>FT                   /note="MTCY190.03c, probable anthranilate
>FT                   phosphoribosyltransferase, trpD, len: 370, similar to eg
>FT                   SW:TRPD_LACCA P17170, (43.2% identity in 308 aa overlap),
>FT                   initiation codon uncertain, gtg at 4086 favoured by
>FT                   homology but this has no clear ribosome binding site"

Clearly a job for SRS and ICARUS. Try the bionet.software.srs
newsgroup (this message is crossposted there) ...

But beware - these fields are often describing homologies rather than
confirmed functional assignments (how firm the assignments are depends
from project to project).

Also, different projects will use different methods for formatting
the initial annotation. There is as yet no consensus on how the data
should be presented.

Then again, you also need to keep up with changes to the entries
in case the annotation includes a new homology, or the predicted
gene changes (splice sites in eukaryotes, or the gtg alternative start
codon in the example you used)

>If a shell script is required, can anyone help with writing one? I'm
>afraid it's beyond my capabilities.....

Sadly, it's a little more complicated than that :-)

What SRS can do is extract the features of interest for new/changed
entries, for a given organism, since the last run.

Something like:

% getz "([emnew-org:Mycobacterium tuberculosis] & \
         [emnew-dat#19960500:])" \
         -f 'id acc dat def fts' >! mtcds.may96

... which takes only a second or two to run. You can also do the same
query through any of the SRSWWW servers (it will work on EMNEW or on
GBNEW, updates for EMBL or GenBank).

ICARUS in turn can be used to parse out the feature table fields of
interest, assuming that all entries have some common format (in this
case, all orf names start MTC but you could also look for "/gene=" to
pick other entries).

ICARUS is due sometime soon (in SRS 5). Meanwhile, other languages like Perl
can do the job, although they are a little more complicated to write.

Peter Rice                           | Informatics Division
E-mail: pmr at sanger.ac.uk             | The Sanger Centre
Tel: (44) 1223 494967                | Hinxton Hall, Hinxton,
Fax: (44) 1223 494919                | Cambs, CB10 1RQ
URL: http://www.sanger.ac.uk/~pmr/   | England

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net