# What is a good weighting scheme for seqence alignment statistics?

Michael Gribskov gribskov at SDSC.EDU
Mon Jun 14 12:25:54 EST 1993

```Mark Gerstein asks:

>I have a long sequence alignment of roughly 500 sequences...
>...I want to compute statistics that weight the
>low homology sequences higher.  I thought of weighting each sequence
>in the average by 1/(homology^2).
>
>Is there a better weighting scheme? If it is published, what is the
>appropriate literature reference.

This is an important question in deriving patterns that are not biased towards
a subset of the sequences used in defining them.  In the hopes of
generating some discussion on this topic, here are my ideas:

The simplest approach is to use an set of ad hoc weights based either on
known phylogeny, or a crude clustering.  The general idea is to divide the
sequences into groups and then give each group  equal weight (with the
weight for each sequence in the group equal to the group weight divided by
the number of sequences in the group). A simple approach is to group
sequences with some threshold level of sequence identity, e.g. 40%,
together into groups.  This is basically the approach used by  Henikoff and
Henikoff (Proc.Natl.Acad.Sci.USA 89,10915-10919, 1992).

The basic idea is that if you have two sequences that are 90% identical,
you don't have two sequences worth of information.  Indeed 90% of the
information is common to the two sequences, and 10% of each sequence is
unique so that you have 1.1 sequences (0.9+0.1+0.1) of information.
similarly, two sequences that are 80% identical correspond to 1.2 sequences
(0.8+0.2+0.2).

I use a program that calculates the weights by clustering the sequences at
a series of identity threshold levels (usually every 10%, i.e. 100%, 90%,
80% etc. ).  This is equivalent to building a crude sequence tree with
multifurcating branches, and is much simpler and faster than calculating an
actual phylogenetic tree.  I also take into account that completely
independent sequences (non-homologous) have some average level of identity
- I assume that sequences less than 20% identical count as full sequences.
I can make this program available if you are interested.  Both this
approach, and the one of Henikoff above are simplifications of the idea of
Felsenstein (Am.Naturalist 125,1-15, 1985) for
dealing with phylogentically correlated data.  This approach has been
applied to pairs of sequences by Altshcul et al. (J.Mol.Biol. 207,647-653,
1989) in the context of finding the optimal sum-of-pairs multiple
alignment.

Another interesting approach is that of Sibbald and Argos (J.Mol.Biol.
216,813-818, 1990) which calculates weights on each sequence based on the
volume around the sequence in an abstract sequence space.  This volume is
determined by randomly generating sequences based on the observed residue
frequencies at each position, and for each sequence counting the number of
times it is the most similar to the random sequence.  I found this method
to work well, but the simulation is very time consuming for large numbers
of sequences.  The weights often do not agree with my intuitive ideas,
varying in a capricious way across clusters of sequences, but the results
are sensible.
With all of these methods there is some question of how to treat gaps, but
the problem seems more severe with this approach.

All of these approaches, I believe, will give weights that are more like
1/similarity than 1/(similarity)^2.  I hope this helps,

Michael Gribskov

--------------------------------------------------------------------------

Michael Gribskov
San Diego Supercomputer Center

gribskov at sdsc.edu
(619) 534 - 8312

```

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net