This type of problem is one of the reasons I wrote my SEQIO
package. Below is a complete C program which will extract
all of the /note fields from the CDS_pept features. The
program takes a list of files and outputs the list of
entries and the note fields as follows:
For entry embl:MELLP
late lactation protein precursor
For entry embl:MEMRNAAL
pot. alpha lactalbumin preprotein
For entry embl:SC9920
YM9920.01c, unknown, partial, len: 956, CAI: 0.14; PS00061 Short-chain alcohol dehydrogenase family signature
YM9920.02c, unknown, len: 61, CAI: 0.17, possible small spliced gene
YM9920.03c, unknown, len: 55, CAI: 0.13, possible small spliced gene
YM9920.04, unknown, len: 585, CAI: 0.18, putative glutamate decarboxylase gene
It will also take care of splicing the lines together. It
is, however, a little limited in that it can only handle
EMBL format files, and it only looks for the /note field in
each CDS_pept feature (these are fixed in the code).
For someone with a little programming experience, it should
be simple to modify the program to look for a different feature
or a different sub-field of a feature.
To compile the program, first extract the text from this
message, then ftp my SEQIO package from the following site
ftp://ftp.cs.ucdavis.edu/pub/strings/seqio.tar.gz
unpack it, and compile the seqio.c and main program together
using a C compiler. It should compile and run using any
Unix or Windows NT/95 machine (in Windows, run the program
from a DOS shell).
(***Advertisement***)
I wrote the SEQIO package for things just like this, where
someone with a little programming experience needs to do
something that no one has provided software for. The
package simplifies all of the file and sequence I/O, so
that the programmer can concentrate on the new piece
of code. It's use does require C/C++ programming experience,
but, as I hope you can see from the example below, not
that much experience.
(***End of Advert***)
Jim
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include "seqio.h"
int main(int argc, char *argv[])
{
int len, i, flag;
char *entry, *s, *t, *s2, *t2, *feature, *note, *id;
SEQFILE *sfp;
for (i=1; i < argc; i++) {
if ((sfp = seqfopen2(argv[i])) == NULL)
continue;
if (strcmp(seqfformat(sfp, 0), "EMBL") != 0) {
fprintf(stderr, "%s: Not an EMBL file.\n", seqffilename(sfp, 0));
continue;
}
/*
* Read the entries.
*/
while ((entry = seqfgetentry(sfp, &len, 0)) != NULL) {
s = entry;
flag = 0;
while ((s = strstr(s, "\nFT CDS_pept")) != NULL) {
/*
* Found a CDS, so print the entry's id, if its the first CDS found.
*/
if (!flag) {
if ((id = seqfmainid(sfp, 0)) != NULL ||
(id = seqfmainacc(sfp, 0)))
printf("For entry %s\n", id);
else
printf("For an unknown entry\n");
flag = 1;
}
/*
* Find the end of the feature lines for that CDS, and make
* it NULL-terminated.
*/
feature = ++s;
while (*s != '\n') s++;
while (strncmp(s, "\nFT ", 6) == 0 && isspace(s[6])) {
s++;
while (*s != '\n') s++;
}
*s = '\0';
/*
* Look for the /note field, then find the strings between
* the quotes, squeezing out any line breaks.
*/
if ((t = strstr(feature, "/note=\"")) != NULL) {
note = t2 = s2 = t + 7;
while (s2 < s && *s2 != '"') {
if (*s2 == '\n') {
s2 += 6; /* Skip the "\nFT " and then the spaces */
while (*s2 != '\n' && isspace(*s2))
s2++;
*t2++ = ' ';
}
else {
if (t2 != s2)
*t2 = *s2;
t2++;
s2++;
}
}
*t2 = '\0';
printf(" %s\n", note);
}
*s = '\n';
}
}
seqfclose(sfp);
}
return 0;
}