Dear fellow netters,
Like many of you, I use BLAST at NCBI for searching sequence databanks.
Like many of you, I don't like using programs when I don't understand what
(and how) they do.
Hence I've tried to read the original papers about BLAST and, in particular,
I've tried to understand how they compute the probability P(N) associated
with a given score. I must confess that I failed to fully understand, either
because I'm just stupid and/or because it is not clearly written. In any
case, I thought that P(N) was computed from the figures obtained by a very
large number of simulations. If this was true, then this probability should
be the same for the same hit whatever the databank used.
A colleague of mine recently searched a protein sequence with BLAST against
the "non-redundant protein databank" and against Swissprot. She got in both
cases the same hit with the same score, but with different probabilities.
With the non-redundant database P(N) was 0.84 and with Swissprot P(N) was
0.51. The segment pairs were exactly the same in both cases.
Could somebody help me understand?
Thank you,
--------------------------------------------------------------------
| Jean-Loup Risler | |
| CNRS | risler at frcgm51.bitnet |
| Centre de Genetique Moleculaire | risler at cgmvax.cgm.cnrs-gif.fr |
| 91198 Gif sur Yvette Cedex France | |
--------------------------------------------------------------------