IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Invariable sites question

Korbinian Strimmer strimmer at zi.biologie.uni-muenchen.de
Wed Nov 27 06:39:23 EST 1996


> 
> Not sure what it is that you say is not being considered.  DNAML takes
> as the base frequencies (default ones -- the user can put in their own
> values if they want, too) the average base frequencies over the sequences.
> Of course this weights different sequences as if they were independent,
> which they aren't.  Optimally one would instead estimate them by
> maximum likelihood.  I think PAUP* will be able to to do that.  But the
> results will, I think, rarely be noticeably better that way.
> 

OK, I'll try to be more precise (sorry for you folks with a slight
aversion against maths ;_)

Let's focus exlusively at one site in an sequence alignment. This
site shows a certain pattern of nucleotides (amino acids).
For the moment let us assume that this site is variable. Then we
can compute a probability P to observe this pattern, given a tree and
a model of sequence evolution M.  M usually is a simple Markov model
with stationary frequencies Pi[x] where x is a specific nucleotide
(amino acid). If all sites in an sequence alignment are variable then
simply counting the frequencies of each nucleotide (amino acid) in the
data set gives a good (ML) estimate of Pi[x].  So far so good.
Let us now assume that the site examined is invariable.  Then
the probability K to see the pattern is

               |  0 if site shows a non constant pattern
           K = | 
               |  K[x] if pattern consists of nucleotide (aa) x

where K[x] is the frequency of nucleotides (amino acids) on
invariable site.  If the prior probability to be invariable
(for a given site) is f then the total likelihood is

    L = f K + (1-f) P

In the literature and the implementations that I know one
does not distinguish between K[x] and Pi[x] though both have
a completly different meaning (and probably different values).
If there are no invariable sites then Pi[x] = actual frequencies
in the data and K[x] = 0, and the other way round if all sites
are invariable.  I agree that using Pi[x]          for both
                                        empirical
Pi[x] and K[x] does probably not have a critcial influence on the
final result but if there is a strong bias towards sites being invariable
then there might be a difference.  I think a good way might be counting
two different sets of base composition (constant-non constant) to
get estimations of the intersting base composition (invariable-variable)

Korbinian




More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net