Efficiency of score matrices

Thu Sep 12 13:10:19 EST 1991

>        What do you mean when you say the search is 90% efficient? Does this
> mean that up to 10% of the matches in this range could be missed, and that
> an even higher proportion of matches outside the range (for example, 100%
> identity) could be missed as well?

  Concerning the definition of "efficient" in the context of score matrices,
a PAM-47 matrix is 100% efficient for the alignment of sequences separated
by 47 PAMs of evolution, because it extracts the maximum average information
possible per alignment position.  Whether this will be sufficient to pull
a particular alignment from background noise will depend on the length of
the alignment and the size of the database searched.  The same PAM-47
(nucleotide substitution) scores are 90% efficient for an alignment of
segments separated by 21 PAMs of evolution, in that they yield only 90% of
the score (in bits) that would be achieved for such an alignment using PAM-21
scores.  In this context, "efficiency" does NOT refer to the percentage of
true homologies that will be found or missed by a given scoring method.  If
all the homologies to be found in a database are very strong, then even
inefficient scores will suffice; at the same time, homologies that contain
insufficient information to rise above background noise will be missed even
by 100% efficient scores.  It is important to use efficient scores only for
those "twilight" similarities that are at the border of what can distinguished
from chance.  For more on this subject see the paper referenced in the original
posting (JMB 219:555-565).
  By the way, BLAST and other heuristic database search tools will miss
finding certain significant similarities for algorithmic reasons unrelated
to the efficiency of the scores used (JMB 215:403-410).  The details of
each particular algorithm need to be considered to understand why this is
so, and to assess how great a problem it presents.

