The average number of "matches" with exactly those parameters
that arises randomly is not too difficult to figure out. The
common region comes from a portion of sequence one of length LR.
There are (L1 + 1 - LR) such regions. There are (L2 + 1 - LR)
places where it can come from the second sequence. You then
need to consider which N of the LR places have mismatches.
There are (LR choose N) ways to pick those (or perhaps (LR-2
choose N) ways if you insist that both ends have a match.) Now
you must calculate the probability of a match. If we're talking
nucleotides and we assume they are equally likely then, no
matter which nucleotide is on sequence 1, there is a 25% chance
that it matches on sequence 2. There's a 75% chance of a
mismatch. Putting it all together gives
E = (L1 + 1 - LR) * (L2 + 1 - LR) * (LR choose N) * (25%)^(LR-N) * (75%)^N
This is how many matches with the above parameters you expect to
find from random sequences. If E is 1.0 or more then the fact
that you found such a "match" is pretty boring. If E is small
then you can say with some confidence that the "match" you found
is significant.
In article <4u6oio$5tg at mserv1.dl.ac.uk>,
Leonid A. Sadofiev <leosad at may.stud.pu.ru> wrote:
> Dear all,
>> I can't find a good idea, how to calculate:
>> Than I comparing two sequences (amino acid or nucleotide)
> with length L1 and L2, I get a common region with
> length LR, containing N mismatches.
> The questions are:
> What a chance to obtain such region in unrelated sequences ?
> Can I use the binomical formulas for this case ?
>> Could any body send me the formulas to calculate this chance
> or reference for it ?
>> Please reply to leosad at may.stud.pu.ru>> Thanks in advance.
> Leonid A. Sadofiev
>