This is the initial release of the SEQIO package, a set of C functions
which can read and write biological sequence files formatted using
various file formats and which can be used to perform efficient
database searches on biological databases. It's essentially a
successor to the "readseq" program, but geared more toward being used
in programs than just as a file conversion program (although it can do
that too.
The package currently supports the following file formats: GenBank
Flat File, PIR/CODATA, EMBL/Swiss-Prot, FASTA, NBRF, IG/Stanford,
ASN.1 text files. More formats will be included as I can find out the
details about them.
The package is freely available to anyone and can be ftp'ed from the
following FTP site:
ftp://ftp.cs.ucdavis.edu/pub/strings/seqio.tar.gz
It is a gzip'ed, tar file containing the package code and
documentation files. I don't have a Web site up yet, but it's coming
soon.
What I'm looking for now are four things,
1) Users to begin writing programs with the package (see below
for an example program).
2) People who have examples and/or descriptions of other file
formats so I can include them (it takes me on average about
a hour per file format). High on my list of formats to
include are the Phylip formats, FASTA/BLAST output and any
multiple sequence alignment formats. A more complete list
is given in the documentation.
3) Information about the organization and file formats used
by any databases out there (if you look at the documentation
to the package, you'll see what I mean).
4) Folks who are interested enough in getting the package to
run on their machine that they would help me port it. It
currently is Unix-specific software and has been tested under
SunOS, Ultrix and IRIX, because they are the only machines
I have access to. I'm willing to do as much as I can to
get it to work on any and all machines.
The main goal of the package was to make reading and writing sequences
as easy as reading and writing normal files, as well as being able to
handle large databases like GenBank. As an example, this complete
program takes a keyword and database name, checks all of the sequences
in the database and outputs the entries whose sequences match a keyword:
#include <stdio.h>
#include <stdlib.h>
#include "seqio.h"
int main(int argc, char *argv[])
{
int len;
char *seq, *entry;
SEQFILE *sfp;
if (argc != 3) {
fprintf(stderr, "match keyword database\n");
exit(1);
}
if ((sfp = seqfopendb(argv[2])) == NULL)
exit(1);
while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
if (len > 0 && strstr(seq, argv[1])) {
entry = seqfentry(sfp, NULL, 0);
fputs(entry, stdout);
}
}
seqfclose(sfp);
return 0;
}
This program scanned all of the GenBank database, Flat File Release
87.0 (about 800MB characters, 249MB of sequence), for a randomly
generated 20 character sequence in under 8 minutes on a DEC 5000/240
(not an Alpha).