Parser for GenBank flatfile

Rob Read xxae021 at chpc.utexas.edu
Tue May 14 15:34:07 EST 1991


This is to announce a software system available from the 
GenTools project at the University of Texas Center for High
Performance Computing which may be of interest to programmers 
(and their bosses) who are trying to extract information from 
the GenBank flat file format.

I have written a parser for this format using the tools
Flex and Yacc (or Bison) which I hope will make it easy for
a C programmer with some Yacc experience to extract information
from the GenBank flat file format, or to translate it into
some other format.  The code translates most (99%) of the 
GenBank entries into a prolog-like language; a programmer could
easily produce output in any other required format.

This is expected to be useful to many of those who cannot afford
or gain access to the (undoubtably superior) relational format
of GenBank on the RDBMS SYBASE, or those who wish to write
special programs to extract information from the feature tables.

The software is an alpha release; few others have tested it and
I have only tested it on Sun Sparcstations.  However, I suspect many
programmers would like to see the code (grammar) I have written,
even if they do not intend to use it, because it represents the most
concrete description of the GenBank format (including the feature
table) that I have seen.

I call the code "gbparse".  There is documentation in the
package.  The code is not trouble free, in part because it
must deal with actual syntax errors in the distributed flat files.
Although the grammar may be wrong in several ways, many of the
"parsing errors" which it reports for Release 66 are in fact
mismatched quotes in the files, which are hard to deal with.
The program is somewhat robust in reporting errors in particular

Gbparse-0.0 is available by e-mailing a request to :

gentools at chpc.utexas.edu

Due to legal complications here, our distribution
will not be by anonymous ftp, at least at first.
It will, of course, be free to non-profit oriented
organizations.  The source is distributed (in this
case the source is all that is useful.)

Thanks to Jacob Engelbrecht and Jo Pelkey? for some initial testing
and Dan Davison for a starting grammar.

Questions, comments, bugs, and so on, should be reported to:

gentools at chpc.utexas.edu

Robert L. Read
GenTools Project Programmer
UT-Center for High Performance Computing (CHPC)
Balcones Research Center
10100 North Burnet Road, CMS 1.154
Austin, Texas 78712

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net