A question has arisen on the net concerning why the scores +5 for a match and
-4 for a mismatch are used as a default by BLASTN, and what the consequences of
changing the default match score from +5 would be.
For a general discussion of the meaning of scoring matrices, see the paper
"Amino acid substitution matrices from an information theoretic perspective,"
J. Mol Biol. 219:555-565 (1991). The theory therein presented can be applied
as well to nucleic acid substitution matrices, as will be discussed in the
forthcoming paper, "Improved sensitivity of nucleic acid database searches
using application specific scoring matrices," by D.J. States, W. Gish & S.F.
Altschul. Briefly, assuming a simple model of DNA evolution in which all
nucleotides and all substitutions are equally likely, scores for any given PAM
distance can be calculated as before. (One PAM corresponds to a single
substitution per 100 nucleotides.) The scores for a given PAM distance D are
optimized for sequences diverged by that many PAMs, but are quite efficient for
a range of actual PAM distances near D. Furthermore, the average amount of
information available per position in an alignment of two sequences at distance
D is readily calculated. The total information (in bits) needed to distinguish
an alignment from chance is approximately the log base 2 of the product of the
lengths of the sequences being compared. (In a database search, one of these
sequences is the complete database.)
Given the model described above, there are only two distinct scores (match
and mismatch), and multiplying by a constant changes only the number of bits
represented by a unit score. Fixing the score for a mismatch at -4 allows a
range of PAM matrices to be selected by varying M, the score for a match, as
summarized in the following table.
PAM Percent Bits/Unit Average information 90% Efficiency
M distance conserved score per position (bits) range (PAMs)
1 0.3 99.7 1.992 1.97 0 - 5
2 5.3 94.9 0.968 1.63 0 - 17
3 16.0 85.6 0.595 1.18 1 - 33
4 30.2 75.0 0.396 0.79 8 - 49
5 47.0 65.1 0.275 0.51 21 - 68
6 65.0 56.5 0.196 0.32 36 - 86
7 86.0 48.8 0.138 0.19 56 - 108
8 109.0 42.5 0.096 0.11 79 - 131
It will be seen that M = +5 (the BLASTN default) corresponds to a PAM distance
of 47 PAMs, or sequences that are about 65% conserved when back mutations are
considered. At this distance, about half a bit of information is available per
position in an alignment of homologous sequences. In a search of a database
containing 64,000,000 nucleotides using a query sequence of length 1000, about
36 bits of information will be needed to achieve significance, corresponding to
an alignment length of about 72. PAM-47 scores are at least 90% efficient in
detecting the similarity of sequences diverged by anywhere from 21 to 68 PAMs
(82% to 55% sequence conservation), which seems like the most typical range of
similarity sought. However, by varying M as shown above, one may select other
effective PAM matrices, which are efficient for other ranges of sequence
conservation. Running BLASTN with M equal to 3, 5 and 7 (PAM matrices 16, 47
and 86), one achieves at least 90% efficiency over the whole range of PAM
distances 1 to 108.
Stephen Altschul, Warren Gish & David States
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bethesda, MD 20894