open-source software for bioinformatics (was Re: Unix vs Linux - the movie.)

Andrew Dalke dalke at acm.org
Tue Aug 1 23:51:13 EST 2000

John S. J. Anderson wrote:
>I _was_ seriously suggesting that you should just be able to read the
>source, however. If the source is so obfustacated and poorly commented
>that someone qualified to review the rest of the paper can't figure it
>out, then the paper shouldn't be published.
>I've never actually seen the BLAST source -- I suppose I could try to
>have a look at it. I wouldn't have thought it that complex/long (100s of
>kLOC, that is). I would have guessed that most of the complexity was
>re-iterative in nature, rather than explicit in the code.

Just as background, BLAST is built from components of the NCBI toolkit,
which is quite large.  They define a lot of their own data structures,
so to thoroughly understand the algorithm you have to understand the
toolkit.  The actual BLAST code is much smaller, though I don't have
a line count handy.  We did have a problem with BLAST on a DEC Alpha
box and I tried to follow the code.  It proved to be too non-standard
for me to understand and I ended up just checking for the edge condition
which triggered the problem and special casing it.  (By non-standard
I mean the toolkit contains its own development model and it takes time
to understand that model because it isn't used by anyone else.)

>In the review situations I've been involved in, each paper was subject
>to probably 10 person-hours of effort, split across reading (and
>re-reading) the manuscript under review, tracking down and reading
>relevant existing lit, thinking about the results and claims in the
>paper, and actually writing the review. (These were molecular biology
>papers, by the way.) I don't think reviewing source code would bloat
>that time factor too much, as you're not going to be reviewing results
>as much in a bioinformatics paper.

Here's what "Code Complete" by Steve McConnell says on code reading
  o  Code reading detecs about 3.3 defects per hour
  o  Listings range from 1000 to 10,000 lines, with 4,000 being typical
  o  Two or more people read the code
  o  Reviewers read the code independently.  Estimate a rate of about
     1,000 lines a day

So if you assume a 4,000 line program then the review takes about
4 days, or around 24 hours, which is over twice the time you estimated
for a paper review.

                    dalke at acm.org

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net