Gaston H. Gonnet of Informatik, ETH (Zurich) writes,
"'significant' is related to the probability of an alignment
being derived from homology as opposed to being random. This
is measured directly by the score of the alignment when you
use Dayhoff matrices. So the highest the score the highest
I want to point out that "significance" is a subjective term. In order to
determine whether an alignment is significant you have to ask yourself
whether the number of matches is greater than what is expected for any
two random sequences then you have to decide what level of matching you
consider to be significant. For many amino acid sequence comparisons the
significance is obvious but problems arise when we are dealing with alignments
that have large numbers of gaps and few identities. In order to make life
easier for biologists a number of computer algorithms have been exploited
in order to increase alignment "scores". Of course these programs increase
the scores of random sequences as well. The hope is that the scores of related
proteins on the verge of significance will be increased by a larger amount
thus moving these scores into the (subjective) "significance" catagory.
It order to inform readers about subjective criteria of "significance"
one should say what the random scores are and how they are calculated and
what cutoff point has been selected (and why). It is important that the random
sequences reflect the average amino acid composition of proteins.
I don't think that it is correct to say that significance "...is measured
directly by the score of the alignments when you use the Dayhoff matrices".
A more correct statement would be; "We believe that use of the Dayhoff
matrix reflects some sort of biological reality which allows us to detect
homology which is not otherwise obvious; we believe that values above x
indicate that two sequences are homologous". I have seen several examples of
the misuse of such matrices where authors claim that two proteins are
homologous on the basis of questionable scores.
Gaston H. Gonnet also says,
"'distant' is related to how long ago or recently the two
sequences diverged. This is measured in PAM units as I
explained in a recent posting."
Evolutionary distance is actually measured in years or some other unit of
time. When comparing two sequences we can estimate the distance by examining
the degree of similarity. Conceptually, the easiest way to do this is a
direct comparison of aligned sequences. As soon as you start introducing
gaps into the alignment you have to make subjective decisions about the
value of these gaps. Whenever you start "comparing" non-identical amino acid
residues you have to make subjective decisions about the value of these
"matches". One such subjective decision is to use a Dayhoff matrix. The more
assumptions you make the greater the danger of error. We should try very hard
to remember that the output of computer programs (eg. PAM units) are only as
good as the subjective assumptions that were made in writing the program.
(I am assuming that the program was correctly written.)
What I would really like to see is some serious discussion about the
usefulness of gap penalties and mutation matrices. How confident can we be
that marginally significant scores actually reflect evolutionary relatedness?
Has anyone looked closely at the relationship between alignment programs
and mutation matrices? My own experience indicates that no aligment programs
are capable of aligning multiple sequences as well as an intelligent human.
Those programs that simply align pairs of sequences often produce results that
are very different from a serious multiple sequence alignment. I assume that
when constructing a Dayhoff matrix only identical amino acids are counted
in the initial alignment but that gaps are permitted. Is this correct?
Laurence A. Moran (Larry)
Dept. of Biochemistry