Software to extract annotation fields from EMBL/GenBank entries.

James Knight knight at quad.cs.ucdavis.edu
Wed Jun 5 19:06:07 EST 1996

This type of problem is one of the reasons I wrote my SEQIO
package.  Below is a complete C program which will extract
all of the /note fields from the CDS_pept features.  The
program takes a list of files and outputs the list of
entries and the note fields as follows:

For entry embl:MELLP
   late lactation protein precursor
For entry embl:MEMRNAAL
   pot. alpha lactalbumin preprotein
For entry embl:SC9920
   YM9920.01c, unknown, partial, len: 956, CAI: 0.14; PS00061 Short-chain alcohol dehydrogenase family signature
   YM9920.02c, unknown, len: 61, CAI: 0.17, possible small spliced gene
   YM9920.03c, unknown, len: 55, CAI: 0.13, possible small spliced gene
   YM9920.04, unknown, len: 585, CAI: 0.18, putative glutamate decarboxylase gene

It will also take care of splicing the lines together.  It
is, however, a little limited in that it can only handle
EMBL format files, and it only looks for the /note field in
each CDS_pept feature (these are fixed in the code).

For someone with a little programming experience, it should
be simple to modify the program to look for a different feature
or a different sub-field of a feature.

To compile the program, first extract the text from this
message, then ftp my SEQIO package from the following site


unpack it, and compile the seqio.c and main program together
using a C compiler.  It should compile and run using any
Unix or Windows NT/95 machine (in Windows, run the program
from a DOS shell).

I wrote the SEQIO package for things just like this, where
someone with a little programming experience needs to do
something that no one has provided software for.  The
package simplifies all of the file and sequence I/O, so
that the programmer can concentrate on the new piece 
of code.  It's use does require C/C++ programming experience,
but, as I hope you can see from the example below, not
that much experience.
(***End of Advert***)


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "seqio.h"

int main(int argc, char *argv[])
  int len, i, flag;
  char *entry, *s, *t, *s2, *t2, *feature, *note, *id;
  SEQFILE *sfp;

  for (i=1; i < argc; i++) {
    if ((sfp = seqfopen2(argv[i])) == NULL)
    if (strcmp(seqfformat(sfp, 0), "EMBL") != 0) {
      fprintf(stderr, "%s:  Not an EMBL file.\n", seqffilename(sfp, 0));

     * Read the entries.
    while ((entry = seqfgetentry(sfp, &len, 0)) != NULL) {
      s = entry;
      flag = 0;
      while ((s = strstr(s, "\nFT   CDS_pept")) != NULL) {
         * Found a CDS, so print the entry's id, if its the first CDS found.
        if (!flag) {
          if ((id = seqfmainid(sfp, 0)) != NULL ||
              (id = seqfmainacc(sfp, 0)))
            printf("For entry %s\n", id);
            printf("For an unknown entry\n");
          flag = 1;

         * Find the end of the feature lines for that CDS, and make
         * it NULL-terminated.
        feature = ++s;
        while (*s != '\n') s++;
        while (strncmp(s, "\nFT   ", 6) == 0 && isspace(s[6])) {
          while (*s != '\n') s++;
        *s = '\0';

         * Look for the /note field, then find the strings between
         * the quotes, squeezing out any line breaks.
        if ((t = strstr(feature, "/note=\"")) != NULL) {
          note = t2 = s2 = t + 7;
          while (s2 < s && *s2 != '"') {
            if (*s2 == '\n') {
              s2 += 6;        /* Skip the "\nFT   " and then the spaces */
              while (*s2 != '\n' && isspace(*s2))
              *t2++ = ' ';
            else {
              if (t2 != s2)
                *t2 = *s2;

          *t2 = '\0';
          printf("   %s\n", note);

        *s = '\n';


  return 0;

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net