Biosequences .. Software .. Molbio soft .. Network News .. FTP

# How to calculate ?

Lee Newberg leen at bio-3.bsd.uchicago.edu
Fri Aug 9 00:29:37 EST 1996

```The average number of "matches" with exactly those parameters
that arises randomly is not too difficult to figure out.  The
common region comes from a portion of sequence one of length LR.
There are (L1 + 1 - LR) such regions.  There are (L2 + 1 - LR)
places where it can come from the second sequence.  You then
need to consider which N of the LR places have mismatches.
There are (LR choose N) ways to pick those (or perhaps (LR-2
choose N) ways if you insist that both ends have a match.)  Now
you must calculate the probability of a match.  If we're talking
nucleotides and we assume they are equally likely then, no
matter which nucleotide is on sequence 1, there is a 25% chance
that it matches on sequence 2.  There's a 75% chance of a
mismatch.  Putting it all together gives

E = (L1 + 1 - LR) * (L2 + 1 - LR) * (LR choose N) * (25%)^(LR-N) * (75%)^N

This is how many matches with the above parameters you expect to
find from random sequences.  If E is 1.0 or more then the fact
that you found such a "match" is pretty boring.  If E is small
then you can say with some confidence that the "match" you found
is significant.

In article <4u6oio\$5tg at mserv1.dl.ac.uk>,
> Dear all,
>
> I can't find a good idea, how to calculate:
>
> Than I comparing two sequences (amino acid or nucleotide)
> with length L1 and L2, I get a common region with
> length LR, containing N mismatches.
> The questions are:
> What a chance to obtain such region in unrelated sequences ?
> Can I use the binomical formulas for this case ?
>
> Could any body send me the formulas to calculate this chance
> or reference for it ?
>
>