Thierry Moreau asked:
>We have purified a cystatin inhibitor and sequenced it by Edman degradation. This sequence (about 110 residues) contains many amino acid doublets for example: GG, AA, YY, VV, MM, PP...
>I would like to know if the presence of such doublets has a special meaning due to e.g some genetic events?? or if it's a normal and common feature in proteins??
How many is many? For an average protein the likelihood that two
residues selected at random will be the same is about 0.07 (it could be
much higher than that in a protein with an unusual distribution of amino
acids, and it cannot be less than 0.05 if the protein contains only the
standard 20 types of amino acid). Thus we should start out by expecting
around eight doublets in a protein of 110 residues, but I wouldn't start
worrying about special causes unless the observed frequency was around
twice that or more, say around 15 or more.
One could of course do the calculation in a much more rigorous way than
the back-of-an-envelope approach adopted here. To get a more accurate
idea of whether your distribution is unusual you could randomize the
sequence of your protein in the computer 1000 or so times, to obtain
1000 or so random permutations of the sequence, and count the proportion
of random permutations that contain as many doublets as you observe, or
more.
As 110 is not a big number, maybe you could include the complete
sequence in your reply.
Athel Cornish-Bowden
Email: athel at bigfoot.com
Home page: http://ir2lcb.cnrs-mrs.fr/lcbpage/athel/homepage.htm
(changed)