Bill Pearson wrp at cyclops.micr.Virginia.EDU
Tue Mar 2 12:53:51 EST 1993

In article <1993Mar2.120540.1382 at gserv1.dl.ac.uk> risler at cgmvax.cgm.cnrs-gif.fr writes:

> Hence I've tried to read the original papers about BLAST and, in particular,
> I've tried to understand how they compute the probability P(N) associated
> with a given score. ...  In any
> case, I thought that P(N) was computed from the figures obtained by a very
> large number of simulations. If this was true, then this probability should
> be the same for the same hit whatever the databank used.

	The P value is calculated analytically, it is not based on
simulations.  Equation [5] of Karlin and Altschul, PNAS (1990) 87:2264
tells us that the probability is a function of the length of the query
sequence, the length of the database sequence, and a factor, lambda,
which is calculated from the scoring matrix and the probabilities of
the residues in the query sequence and in the library.

> A colleague of mine recently searched a protein sequence with BLAST against
> the "non-redundant protein databank" and against Swissprot. She got in both
> cases the same hit with the same score, but with different probabilities.
> With the non-redundant database P(N) was 0.84 and with Swissprot P(N) was
> 0.51. The segment pairs were exactly the same in both cases.

	In blast, the P value is corrected for the length of the
database as well.  Thus, the same alignment from two different
database searches may have different P values if the databases are
different in length or amino acid composition.

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net