I am in the early stages of developing a program that will analyse
molecular sequences (nucleotide and amino acids). What functionality
it will finally have will be largely a matter of time and resources
(this is for a third-year university project subject, but has every
possibility of growing in a post-grad honours project too).
I may be re-inventing the wheel, but that is not the point (at least
not for me)... I want to use this project to LEARN - about C++,
protability issues, algorithm development and implementation,
tuning up my researching skills, learning more about molecular
biology itself, and so on.
I am planning to use C++ (using make), and make it as portable as possible
for both DOS and UNIX (and NO windoze pleez!:) Somehow I doubt that
I will get much more done than to build a basic "data engine" and a
user-interface, but I want to design this so that just about any
sort of analytical functionality can be enabled ("plugged in" as
it were) whenever it is designed and written.
I have been gathering and reading literature and journals, and the
magnitude of such a project is starting to dawn on me - it's
overwhelming, but not _too_ frightening! :-)
I've just fetched sdk.doc from ncbi.nlm.nih.gov and printed it out
(ouch - WinWord format... this caused some problems, and it would be
nice if a postscript file was also available).
This is an *AWESOME* document! (The size and scope of the ncbi
toolkit is likewise rather awesome:). It is largely a specification
document for the format of the data structures that programs
that do this sort of thing should use.
I have a zillion questions formulating in my head about all this,
and I hope that someone with experience will be able to point me
in the right direction. (I will be emailing toolbox at ncbi about
1. Is the toolbox used as the basis of real programming projects
for molecular biology?
A collolary: are the specifications laid out by NCBI used as
the basis for program design work by real software developers?
2. The specifications are written in a C-like format, and the .h
files look as if all the tools are written in plain ANSI C.
with C++ portability in them (ie, extern "C" defines).
I want to write my code in C++, describing sequences as true
Has this been done before?
Will I be re-inventing the wheel?
3. If I do produce data ojects for molecular sequences in C++,
are there people who would be willing to comment on and help
me do such work? Indeed, would anyone be interested in this?
(I have no problems in sharing what I do).
4. I need to get my hands on some code to produce effective user
interfaces... I would prefer not to have to develop this myself.
I'm using Xterm in a machine running UNIX, Sun ver 4.3.1 (I think),
and Borland C++ version 4.0 (I have yet to decide on the make
utility that I want to use, but it will likely be gmake <gnu>).
Can anybody point me in the right direction for such publically
available code (easy to use, and mostly in C++)?
I have more questions, but this will do for now (hoping that this
posting will generate email correspondence that will keep the
"irrelevant" traffic in Usenet to a minimum:)
Many thanks in advance for any information and help.
/ _ \ Tony Nugent Griffith University Brisbane Queensland Australia \ __
\_@) \ Email: tnugent at gucis.cit.gu.edu.auT.Nugent at sct.gu.edu.au \ (_@ \