In an earlier article I pointed out that "significance" is a subjective term
and Gaston Gonnet responded,
"yes, I agree, but with "subjective terms" we cannot do science.
The least controversial definition of "significance" is one which
relates the probability of an homology against the (null hypothesis)
probability of a random coincidence. As the model of homology gets
more precise, or you start including information of other nature
(e.g. 3-d structure) then the probabilities may be computed
differently. But the definition remains the same."
I think that you are missing the point, Gaston. Simply relating the
probability is not sufficient. In order to claim significance you also have
to make a decision about the cutoff point. This decision is subjective.
Furthermore you make a subjective decision when you decide how to measure
similarity in the first place. I don't necessarily agree that your decisions
and assumptions concerning these measurements are valid. This is what science
is all about.
By the way, it is not necessarily true that similar 3-D structures indicate
homology.
When I said,
"Evolutionary distance is actually measured in years or some
other unit of time. When comparing two sequences we can estimate
the distance by examining the degree of similarity."
Gaston Gonnet replied,
"beg to disagree. Evolutionary distance, as shown by Dayhoff and
many other people, is best measured in PAM units or any units of
mutation. The reason is simple, when given just the sequences, we
can estimate directly their ED, but we cannot estimate their
time-distance without considering at least 3 of the biases which
affect the relation between amount of evolution and time. These are:
(a) species reproduce at very differnt rates
(b) crucial proteins mutate much more slowly than less important
proteins (due to a strong natural selection)
(c) changes in the environment "force" some rapid mutations.
So it would be nice to measure time, but we can at best measure
amount of evolution (amount of change)."
I suspect that you actually agree with my statement. Would you be happy to
rephrase your response to say that "Evolutionary distance ... is best
ESTIMATED in PAM units ..."? Species diverge over time not over PAM units!
Our calculations may or may not be a valid ESTIMATE of the time of divergence
but we should not lose sight of the fact that they ARE estimates with many
unproven assumptions.
Allow me to make a comment about your three biases.
a) It is true that modern species reproduce at different rates
but whether or not this has much effect on sequence similarity
is still open to debate.
b) Yes, this is true. I work with the most highly conserved proteins
known in biology and they change at a snail-like pace compared
to others such as the globins and cytochromes.
c) Changes in environment cannot "force" mutations. What does this
mean?
I stated that the best way to detect similarity was to compare aligned
sequences directly and I pointed out that introducing gaps forces one
to select a (subjective) value for these gaps. Similarly a comparison
of non-identical residues requires a subjective decision concerning the
value of such comparisons.
Gaston responded,
"subjective decisions about the values of gaps is what has been
done until recently. We have now given a model under which parameters
can be computed from the available samples. I am afraid that you
tend to imply that alignment is "black magic" or "art". I disagree
strongly with this view. We should establish models, compute the
parameters for these models, verify/reject the models against reality
and move into better models when the old ones become unsuitable to
describe reality. This is the way that science makes progress, not
with "subjective measures". There are hundreds of examples of this
methodology in science."
With all due respect, I do not consider your "model" to be entirely objective.
I still believe that estimating the value of gaps is a difficult problem
that ultimately boils down to a "guesstimate".
And yes, I am implying that alignment is an "art". In fact I will go as far
as to say that I can do a multiple alignment better than any computer program!
I can certainly do a better job than many authors who publish alignments in
Nature or Cell or many other journals. This does not mean that we shouldn't
keep trying to write algorithms that will do the job perfectly, it simply
means that we have a long way to go. I tend to agree with Swofford and
Olson who write,
"Alignment is probably the most difficult and least understood
component of a phylogenetic analysis from sequence data....
we offer the following advice: When regions of the sequence
are so divergent that a reasonable alignment cannot be obtained
by manual methods using a sequence editor ("by eye"), those
regions should probably be eliminated from the analysis."
D.L. Swofford and G.J. Olson "Phylogeny Reconstruction" in
MOLECULAR SYSTEMATICS, D.M. Hills and C. Moritz eds. Sinauer
(That ought to stimulate Swofford to enter this debate! (-: )
Gaston, your comments about how science works seem to miss the point that
we progress by making hypotheses which often are no better than intelligent
guesses. Often they are wrong, sometimes spectacularly. I also have trouble
with your suggestion that we compare computed measures of similarity against
"reality". What is "reality" in this context? Can you give an example?
I said,
"I assume that when constructing a Dayhoff matrix only identical
amino acids are counted in the initial alignment but that gaps
are permitted. Is this correct?"
Gaston replied,
"no, you are mistaken, please read Dayhoff's original paper, the
procedure is much more sophisticated. If you would understand their
ideas, you would be much more confident in using their tools."
How interesting. When you construct a new "Dayhoff" matrix do you use the
old one to improve the alignments that form the database? If not, then what
"sophisticated" assumptions do you make that justify comparing non-identical
residues in the original alignments? Do you think that these assumptions
might affect the final matrix?
By the way, I have used the Dayhoff matrix in some of my distance
calculations. I find that it does not change the shape of the tree but it
does alter some of the distances. Since most of the variation that I see
is in regions of the protein that are not constrained, use of the matrix
is not likely to be very helpful. I also find that the original Dayhoff
matrix does not agree with the variation that I see in my alignments.
Furthermore, my impression is that the presence of sequence mistakes in my
database is a far more serious source of error than whether or not I use
a particular Dayhoff matrix. Others may find it more useful.
Laurence A. Moran (Larry)
Dept. of Biochemistry