Paul,
You are almost there at this stage. Yes you are correct in all of what
you have said. Unfortunately we have an additional problem if we want
to estimate the number of substitutions that have occurred since two
sequences have shared their most recent common ancestor. This problem
lies in those substitutions that have occurred in the same place more
than once...superimposed substitutions or "Multiple hits" at the same site.
We try to reconstruct the number of substitutions that probably occurred
by transforming the "observed" number of differences using some kind of
Log-normal transformation (although there are other methods).
What does this mean? Well if we observe 5% difference, then the log
transformation might say that the _real_ number of substitutions that
probably occurred was, say, 5.5% (numbers off the top of my head). If
we observe a difference of 20 percent, then the transformation might
predict that the _real_ number of substitutions was 30%, if the observed
was 75%, then the transformed might be 250% (or an average of 2.5 'hits'
per site).
What have we seen? Well the transformation generally gives more and
more of a correction as the sequences diverge. When the sequences are
reasonably closely related, the chance for superimposed substitutions is
quite small, but as the sequences diverge, then the chance for multiple
hits increases drammatically.
suggested reading:
Rod Page and Eddie Holmes new book (oh, crap, I can't remember the
name). Published by Sinauer, I think.
Molecular systematics 2nd edition. edited by Hillis et al., published by
Sinauer Associates. (particularly the phylogeny reconstruction chapter)
Hope this helps,
James
"Paul D. Roughan" wrote:
>> Does anyone know what exactly sequence divergence estimates mean in
> terms of base pair mismatch? For instance, does a figure of 12%
> divergence between the same stretch of DNA in two bacterial strains mean
> that 12 percent of the bases in identical positions in the two strands,
> are different? By this criterion, two completely unrelated sequences
> should display a theoretical maximum of 75% divergence, if the sequence
> was long enough (with 4 possible bases, mismatches would occur in 3 out
> of 4 cases).
>> Is this the method used to generate similarity estimates? Any assistance
> in this would be welcome.
>> --
> Paul D. Roughan
--
Dr. James O. McInerney,
Dept. Biology, Dept. Zoology,
Natl. Univ. Ireland, The Natural History Museum,
Maynooth, and Cromwell road,
Co. Kildare, Ireland London SW7 5BD, UK.
Phone +353 1 708 3860 +44 171 938 9163
Fax +353 1 708 3845 +44 171 938 9158
email james.o.mcinerney at may.iej.mcinerney at nhm.ac.uk