Building Mol Biol programs and the NCBI toolkit...

Keith Robison robison at nucleus.harvard.edu
Fri Jul 22 08:36:33 EST 1994

Tony Nugent (tnugent at gucis.cit.gu.edu.au) wrote:

: Hi all,

: I am in the early stages of developing a program that will analyse
: molecular sequences (nucleotide and amino acids).  What functionality
: it will finally have will be largely a matter of time and resources
: (this is for a third-year university project subject, but has every
: possibility of growing in a post-grad honours project too).

: I may be re-inventing the wheel, but that is not the point (at least
: not for me)... I want to use this project to LEARN - about C++,
: protability issues, algorithm development and implementation,
: tuning up my researching skills, learning more about molecular
: biology itself, and so on.

Go! Go! Go!

: I am planning to use C++ (using make), and make it as portable as possible
: for both DOS and UNIX (and NO windoze pleez!:)  Somehow I doubt that
: I will get much more done than to build a basic "data engine" and a
: user-interface, but I want to design this so that just about any
: sort of analytical functionality can be enabled ("plugged in" as
: it were) whenever it is designed and written.

: I have been gathering and reading literature and journals, and the
: magnitude of such a project is starting to dawn on me - it's 
: overwhelming, but not _too_ frightening!  :-)

Start small, then build.

: I've just fetched sdk.doc from ncbi.nlm.nih.gov and printed it out
: (ouch - WinWord format... this caused some problems, and it would be
: nice if a postscript file was also available).

: This is an *AWESOME* document!  (The size and scope of the ncbi
: toolkit is likewise rather awesome:).  It is largely a specification
: document for the format of the data structures that programs
: that do this sort of thing should use.

: I have a zillion questions formulating in my head about all this,
: and I hope that someone with experience will be able to point me
: in the right direction.  (I will be emailing toolbox at ncbi about
: this too).

: 1.  Is the toolbox used as the basis of real programming projects
:     for molecular biology?  

:     A collolary: are the specifications laid out by NCBI used as
:     the basis for program design work by real software developers?

Yes, though there are only a few examples currently extant.  NCBI's
Entrez is the best example, but there is also Don Gilberts SeqPup
(sequence analysis & gopher client).  I believe MacVector now incorporates
portions of the toolkit. 

: 2.  The specifications are written in a C-like format, and the .h
:     files look as if all the tools are written in plain ANSI C.
:     with C++ portability in them (ie, extern "C" defines).
:     I want to write my code in C++, describing sequences as true
:     OBJECTS (OOP).

:     Has this been done before?
:     Will I be re-inventing the wheel?

You should look at Don Gilbert's DClap library for one approach to
mating the NCBI toolkit with C++ (gopher/ftp over to ftp.bio.indiana.edu).
There are a number of biological C++ class libraries running about,
including my own (molbio++).  Try searching the bionet archives (at
the source gopher.bio.net or on the IUBio gopher mentioned above) for
"class and library".  I also have a strong hunch that there is at least 1
other project involving biology, C++, and the NCBI toolkit which is underway
but hasn't yet surfaced (watch this space...).

And don't worry too much about reinventing the wheel.  If you are learning
something for yourself, then it is a worthwhile project.  Also, just
because it has been done before doesn't necessarily mean it was done
perfectly (nearly impossible, due to tradeoffs).  People had been lighting
the world for thousands of years, but that didn't stop T.A. Edison!

Good luck!

Keith Robison
Harvard University
Department of Cellular and Developmental Biology
Department of Genetics / HHMI

robison at mito.harvard.edu 

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net