open-source software for bioinformatics (was Re: Unix vs Linux - the movie.)

alrichards at my-deja.com alrichards at my-deja.com
Fri Jul 28 05:18:03 EST 2000

In article <871z0ekhyo.fsf at genehack.org>,
  "John S. J. Anderson" <jacobs+usenet at genehack.org> wrote:
> Kevin> It turns out that most users don't care about the source code,
> Kevin> so just distributing executables and documentation satisfies
> Kevin> over 90% of the users (how many of you have had to get source
> Kevin> code for BLAST?).  If you want to retain control over your
> Kevin> program (an idea that is anathema to some in the open-source
> Kevin> community), then a two-tier license agreement is often the best
> Kevin> strategy---most users get a simple license with permission to
> Kevin> use the executables, and those who really have a need for the
> Kevin> source code get a more detailed license.
> Agreed, most users don't care. Heck, most of them _don't_ want the
> code; it's just useless text files taking up disk. But, to return to
> the context this thread started in, the concept of peer review seems
> to me to require that you expose your code _at_least_ to a few outside
> reviewers. Not just the output of running the program, but the actual
> code -- because otherwise, the correctness of your results can not be
> verified.

Although I'm no longer involved with scientific software development -
the last program I wrote was in Fortran IV on an old Vax - I would like
to propose a completely contrary opinion. Here it is: programs should
not be made available at all - or at least should not be released for at
least 2 years after the associated paper is published. Why? Scientists
are naturally lazy - like most human beings - and the only way to
be really sure about the quality of scientific work based on
computational methods is to _reimplement_ the algorithm. Finding
bugs in code is one thing - the program crashes when you input a
negative number for example. Any half competent programmer could track
down a bug like that. However, finding a subtle mistake in a complex
set of statistical codes, for example, is beyond the ability of anyone
who is not a) an expert in the scientific field and b) an expert in
programming. This combination is so rare that it cannot be relied
upon to keep checks on the quality of bioinformatics methods.

Of course having the source code is a convenience for tracking down
annoying "user interface" bugs and the like - but claiming that having
the source code is the best way to ensure scientific accuracy is I
believe not valid. The analogy to experimental papers is a useful one.
If I publish a paper describing a wet-lab experiment then I try to
describe the steps in as much detail as I think is necessary for
someone with suitable skills and a standard lab to replicate the whole
experiment. The onus is on people who make use of my work to try to
replicate the results and the onus is then on me to help explain what's
wrong when people find they cannot replicate the experiment. Note that
I don't expect these people to turn up on my doorstep wearing a lab
coat ready to use my lab and my reagents. The point of replication
is that someone should easily be able to replicate the experiment
elsewhere. Who knows, the reason my results were so good might be that
my buffer solutions are contaminated with silver salts? Or my lab
is close to electric power lines. The only way to discover that fact
would be to replicate the same experiment with similar but non-identical
reagents and apparatus.

Now lets take this analogy to scientific codes. Somebody describes
a new algorithm in a paper which produces excellent results. Assuming
this method is complex (e.g. the BLAST program or molecular mechanics
software) then the chances are that the paper omits some key facts that
turn out to be critical to the success of the program. How is this
going to be discovered? Not by releasing the source code that's for
sure. It's all very well saying that you are _able_ to look at the
source code - but do you? Assuming the program as supplied produces the
expected results and does not crash, why are you going to bother pouring
through several hundred thousand lines of code to compare the code with
what is described in the paper. How many times has BLAST been
reimplemented to validate the method? Answer: probably never. How
many people have picked apart the BLAST code and compared it line by
line with the algorithm described in the paper? I bet the answer to
this is close to if not equal to zero as well. Why would you need to?
The code is public domain and the software seems to work properly. It's
the computer equivalent of buying a molecular biology "kit".

So how could BLAST be properly validated? The authors should not release
the code - or at least keep the code secret for a period of at least
2 years. What would happen then? People would try to replicate the
method with new software to check they get the same results. This is
how science is done in other areas. Of course, if the authors of a
bioinformatics paper do not provide enough information to allow the
algorithm to be reimplemented then that's another problem entirely -
and again the only way to properly identify that problem is for people
to try to replicate the software.

** Alan **

Sent via Deja.com http://www.deja.com/
Before you buy.

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net