IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Dear all,Evaluating low identity scores

Brian Foley btf at t10.lanl.gov
Mon Jan 12 16:35:15 EST 1998

thorsten burmester wrote:
< ...
> relationships of proteins with only some 15 to 20% identity scores.
> ... a possible method to evaluate the
> significance of such low similarity scores would be to randomise the
> sequences of these proteins by keeping the relative amino acid
> composition. 
> ... in case this original alignment was
> significant, the new similarity/identity scores should be
> significantly lower. However, if the observed identity is just due to
> similar amino acid compositions, the scores should be similar.
> My questions:
> 1. Does this sound reasonable, and has anybody ever tried a similar
> approach before?

	I have used this aproach myself, either to compare the
two sequences in question, or to take each one, randomize the
order of the amino acids and then compare to the entire database
	The overall score is not the only important thing to
consider.  One must also consider which amino acids are
contributing to the score.  If you take a protein class for 
which many different sequences are known (DNA polymerases,
GTP-binding proteins, 7 transmembrane domain receptors, etc)
you will see that there are invariant amino acids, known to
be critical to the function of the protein, that are conserved
in all members of the class.  
	The chance that two proteins share 15% amino acid 
identity by chance is quite high, but the chance that the
15 amino acid residues per hundred that they share are the
very same ones shared by all other members of this class
is much lower.
	In a search of one protein against a database, I am
more willing to expend energy in testing the significance of 
a result if all of the high-scoring proteins have something
in common.  For example if my unknown protein is compared to 
the translation of GenBank and the top 50 scores are all to
DNA-binding proteins from different organisms, I am excited
to look at the results even if the top score is one 15% identity.
If on the other hand the top ten scores are near 20%, but the
proteins have nothing in common, and different sites are
matched in each of the top ten pairs, I am not very excited.

> 2. Do you know any program that can randomise an amino acid sequence
> as described above?

	The U of Wisconsin Genetic Computer Group (GCG) package
has a SHUFFLE program.  I am sure there are others as well.

> Thanks for your help.
> Thorsten
> --
> Thorsten Burmester - thorsten at erfurt.thur.de

|Brian T. Foley               btf at t10.lanl.gov                       |
|HIV Database                 (505) 665-1970                         |
|Los Alamos National Lab      http://hiv-web.lanl.gov/index.html     |
|Los Alamos, NM 87544  U.S.A. http://www.t10.lanl.gov/~btf/home.html |

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net