This message is to announce the release of version 1.2 of the SEQIO
package. The package is freely available to anyone for commercial or
non-commercial use, and can be ftp'ed from the following FTP site:
It is a gzip'ed, tar file (356K compressed) containing the package
code and documentation files. I've also set up a web site for the
Also, see the description below for more information about the
Major changes from version 1.1:
* Added the GCG, MSF and BLAST program output formats.
(including the ability for converting between non-GCG and
GCG forms of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF
and IG/Stanford formats without losing any header info)
* Added the ability to index database entries based on any (or
all) identifiers given in the entry. Including the new NID and
* Added the ability to handle database identifiers (with wildcards)
and randomly access the specified entries.
* Added a "single entry access specification" mode for regular
files, so that you can extract, say, just the third entry in a
file, or the entry whose identifier is "sp:104k_thepa".
* Added a number of example programs to show how to use the package.
* The file conversion program (fmtseq) now can do "big alignments"
of BLAST output, and can do conversions between GCG and non-GCG
forms of sequence entries without losing any header information.
For those of you who were at ISMB'96 and to whom I promised that both
this package and my new database search algorithm would be available
last weekend, the database search algorithm isn't ready yet. (I still
haven't gotten the hang of the difference between estimates for "down
the hallway" software, i.e. software written for the folks down the
hallway, and real product quality software.) However, it will be
ready soon. Certainly, by the time my post-doc runs out at the end of
For those of you who weren't at ISMB (or who I didn't tell about my
database search algorithm), I've developed an alternative to FASTA and
BLAST that should produce Smith-Waterman quality alignments, i.e. the
same alignments you'd get if you ran the full-blown Smith-Waterman
search, but with the speed of FASTA and BLASTP. (It will probably
take me until the next version of this program to get to BLASTN
Which reminds me. My post-doc here at UC Davis ends at the end of
July, and so I'm looking for a job (either an industry,
algorithm/software-development position or a postdoc working with
biologists). If you have such a position or know about such a
position (that hasn't been widely advertised on the newsgroups or on
the WWW, because I've seen those), please let me know. I would
SEQIO: A C/C++ Package for Reading and Writing Sequences
The SEQIO package is a C/C++ package (or library) which makes reading
and writing sequences and biological databases as easy as reading and
writing files, while at the same time supporting I/O in the following
Raw/Plain, GenBank, PIR (CODATA), EMBL, Swiss-Prot, FASTA, NBRF,
IG/Stanford, ASN.1 text files, GCG, MSF, PHYLIP, Clustalw, and
output from the FASTA and BLAST suites of programs
supporting completely configurable databases, using the new BIOSEQ
standard for describing databases, like this one for GenBank:
# The GenBank Flat-File Database
# GenBank files as found at ftp site ncbi.nlm.nih.gov in /genbank.
gbbct.seq, gbest?.seq, gbinv.seq, gbmam.seq, gbpat.seq, gbphg.seq
gbpln.seq, gbpri.seq, gbrna.seq, gbrod.seq, gbsts.seq, gbsyn.seq
gbuna.seq, gbvrl.seq, gbvrt.seq
and supporting the transparent specification (as far as the program is
concerned) of single entries of databases, like "gb:humhb*" for all of
the human beta globin GenBank entries, and of single entries of any
files, like "myseqs at 3,4" or "myseqs at gb:humhba1" to specify either the
third or fourth entry of file "myseqs" or the entry in "myseqs" whose
identifier is the GenBank HUMHBA1 locus. Also, the database entries
can be specified using the database specific identifiers (i.e.,
GenBank locus numbers, PIR entry names, ...) or using the
cross-database accession, NID or PID numbers. Or all three, if you
In addition, the distribution comes with a reimplementation and
extension of Don Gilbert's readseq program (called fmtseq). In
addition to a much better user interface, this program also has the
ability to perform "no loss" conversions between the non-GCG and GCG
forms of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF and IG/Stanford
entries, and the ability to take the output from one of the FASTA and
BLAST alignments and construct a "big alignment" by lining up all of
the pairwise alignments into a multiple alignment.
There's some other stuff too, but really, don't you think that's enough?