a generic CGI program to retrieve biological database entries
in various formats and styles (using SRS)
by Heikki Lehvaslaiho <heikki at ebi.ac.uk>
The program dbfetch is a perl CGI.pm-based script to serve database
entries from up-to-date servers. It an extension of a interface used
by the EBI script emblfetch. Serving raw sequence entries via http
protocol makes it easy to create application programs accessing any
sequences by their id, only.
I started to write this program with these specifications in mind:
1. Retrieve biological database entries over the Web based on unique
2. Offer consistent, platform and database engine independent
3. Easy-to-write URL syntax where the ID is simply added to end.
4. Serve entries not only in HTML but also in raw, easy to parse
5. Modular, expandable structure.
Although the underlying database engine used in this script is SRS,
the program can easily be modified to access other indexing systems
(e.g. through the EMBOSS entret program).
This script is NOT (at the moment) offering free text or keyword
Most importantly, this approach is not dependant on some heavy hard to
maintain technology (CORBA). All it needs is a http connection and a
parser for a database ASCII format. These parsers are now available in
various open source projects (bioperl, biopython, biojava).
Casual users need simple ways to access and browse database entries on
the web. The HTML form-based interface caters for these users.
Increasingly, users of bioinformatics services write small programs to
analyze sequence and other database entries. However, it is difficult
to maintain locally up-to-date databases and, in a larger environment,
make those databases visible to all users. dbfetch makes it easy to
access database entries from anywhere.
As a first step, BioPerl modules Bio::DB::EMBL use dbfetch to retrieve
data into Bio::Seq objects. The whole process is writable in three
lines of BioPerl code:
$embl = new Bio::DB::EMBL;
$seq = $embl->get_Seq_by_acc('J02231');
# do what needed to the entry
print "seqid is ", $seq->id, "\n";
Currently Bio::DB::SwissProt accesses swissprot entries from the
Expasy server and users can point their requests to its mirrors. This
Expasy script has limitations (not serving TREMBL entries) and not it
is not available to other databases. dbfetch tryies to overcome these
The dbfetch uses local SRS calls to retrieve entries. Each style (html
or raw) is defined in its own subroutine. The details about each
database (name, update database name, id field names, format) is kept
in a global hash. An other hash stores a regular expression to
a unique identifier from an entry. These two hashes and the subroutine
building the web page are the only places that need to be touched when
a new database is added. After modification it is advisable to run the
dbfetch from command line to trigger a subroutine which check the two
hashes for consistency.
WHAT YOU COULD DO
Please install dbfetch to your local server and let me know that it is
available for inclusion into bioperl modules.
Bioperl and related open source projects (e.g. biojava and biopython)
have so far focused on sequence analysis. dbfetch makes it easier than
ever to work with other data types, If you are willing to create or
have suitable code for parsing and creating objects for other formats,
please join in. To start with, the BioPerl project would welcome
classes to store an manipulate literature reference (Medline) and
protein structure (PDB) data which are database entries served by
the current EBI dbfetch script.
The dbfetch script is running at:
It is available under Perl artistic license from the BioPerl
(http://bioperl.org) CVS repository or directly from:
and in due course in the next release (0.8) of BioPerl.