Not to belabor the point, but I had a few comments about Gaston Gonnet's
response to comments on his recent science paper ( > comments by Gonnet)
>I should note that I received a copy of the article, as it appeared,
>only 3 days ago. Checking the above remark, I contemplated with
>horror that Fig 2 has been mislabelled in the following way: What
>is presented in Fig 2 is a Dayhoff matrix for PAM 250, it is the
>best approximation that we could compute at this time. It says
>however, "The recommended mutation matrix.." It should say "The
>recommended Dayhoff matrix...".
>
Personally, I believe this nomenclature to be extremely unfortunate. The
term "Dayhoff matrix" is nearly universally used to mean the MDM78
(mutation data scoring matrix; log-odds matrix for 250 PAMs). To call the
Gonnet et al. matrix a Dayhoff matrix is to imply that it was derived by
Dayhoffs methodology. A less connotation loaded term would be log-odds
matrix. The term PAM-250 matrix has also been used virtually as a synonym
for the MDM78 matrix.
>
>Now, some people have immediately recognized this as a Dayhoff matrix,
>which is good. A mutation matrix has all positive entries, is diagonally
>dominant and has no entry greater than 1. So Fig 2 is not a mutation
>matrix but a Dayhoff matrix. This matrix is the one we recommend to
>be used, together with our new deletion-penalty formula, for the N&W
>algorithm.
It seems unlikely that such a matrix has been used with the
Needleman-Wunsch algorithm. As you may recall, the NW algorithm (Needleman
and Wunsch, J. Mol. Biol. 48, 443-453, 1970) does not use an affine gap
cost, although they do suggest that the "penalty factor could be a function
of the size and/or direction of the gap". NW requires a scoring table with
all positive values since only the last row and column of the alignment
matrix are examined for the maximum score. A scoring table with negative
values is not guaranteed to give an optimum alignment with the NW
algorithm. Note that needleman and Wuncsh refer to the position containing
the maximum score as being in the first row or column since they build the
alignment from N to C terminus, however they mean the last row or column
calculated during the alignment. Since many sequences in the database have
locally similar segments embedded in unrelated sequences (i.e. cases of
partial homology or gene fusion), one wonders what kind of alignments would
result.
I think there is another point that is being overlooked. Dayhoff et al. did
not use closely related sequences to calculate the MDM78 matrix because
they were unable to align distantly related sequences. A primary reason was
to be sure that they were comparing sequences that differed only minimally
in function. Sequences that are no more than 15% different in sequence are
much less likely to adopt grossly different three-dimensional structures
than those that are, for example, 40% different. In the case of less
similar sequences you measure not only the probability that a given
single residue mutation can be accepted at a certain position, but also the
overlaying probability that the conformation of the whole segment has
changed and that only some combinations of segment sequences can fold into
active structures. For distantly related sequences you are in the position
of comparing apples and oranges, the two positions you are comparing are
likely to have different structures and functions (in the micro-structural
sense not necessarily the enzymatic sense). It is not surprising that 250
PAM log-odds matrices extrapolated from pairs at different evolutionary
distances differ; one should be surprised if they did not. Another way of
looking at this is that as you examine greater and greater evolutionary
distances, you see more adaptive differences and fewer random (neutral or
nearly neutral differences).
Since I was not around when the original MDM78 work was done, I don't know
how important these various considerations were in formulating the
analysis. Perhaps someone at PIR could mention some of the unpublished
background sometime -- I think it would be very interesting to know more
about the context that this work was done in. It seems to me that, for all
of its percieved faults, the analysis that produced the MDM78 matrix was
very perceptive and years ahead of its time with respect to the interaction
between sequence and structure.
----------------------------------------------------------------------------
Michael Gribskov
San Diego Supercomputer Center
gribskov at sdsc.edu
(619) 534 - 8312