And now for something completely different (gap costs)

Gaston Gonnet gonnet at inf.ethz.ch
Fri Jun 26 12:04:10 EST 1992

In article <199206260442.AA22330 at Menudo.UH.EDU> Davison at UH.EDU (Dan Davison) writes:
>edwin rock said:
>> 	You're right, the insertion cost equation: a + bx does NOT
>> make much sense.
>I beg to disagree. Try reading the literature. Fitch and Smith, PNAS,
>March 1985, Optimal Sequence Alignments. 

This is the subject of another paper, we never tried to explain the
new deletion model in the Science paper.  The deletion paper is
titled "Empirical and Structural Models for Insertions and Deletions
in the Divergent Evolution of Proteins".  I will send a preprint to
anybody interested (please request by e-mail to oetting.inf.ethz.ch).
These results were presented at EMBL and more recently at NIH.

You may not know, but an insertion cost function linear in the
gap length (a+bx) corresponds to an exponential distribution in
the gap length.  It has already been observed by other researchers
(Demchuk et at.) that the exponential distribution does *not* fit
the actual data.  Maybe you do not want to be bothered with real
data, but just in case you do, here is the summary of one of the
10 PAM bands that we studied:  (my comments in ###)

PAM distance bounded between 21.7 and 29.5
50664 alignments considered, of which 42748 were within given bounds
822 connected components with a suitable match out of 2660
### out of 42748 alignments, there were 2660 "families" and only 822 were
	selected, the selection criteria is explained in the full paper ###

1733 indel gaps, 9339 deleted amino acids, 294881 aligned positions
	### for a single PAM band, this is a very significant sample size ###
indels/matching positions 1 : 170.2
deleted/matching positions 1 : 31.6

Average gap length: 5.39 +- 0.67  ### with 95% confidence ###

       Gap    number
     length   of gaps
        1      679
        2      270
        3      176
        4       98
        5       79
        6       67
        7       58
        8       47
        9       38
       10       29
       11       14
       12       16
       13       19
       14       17
       15       10
       16        7
       17       10
       18       13
       19        5
       20        7	### Now we fit the data up to here ###
Linear cost model, Simil = -27.89 - 1.400*(k-1), RMS=1.71051
Logarithmic model, Simil = -26.07 + 6.350*ln(k), RMS=0.621163
       21        7
       22        5
       23        5
       24        5
       25        5
       26        2
       27        4
       29        3
       30        3
       32        4
       33        4
       34        4
       35        3
       39        1	### Now we fit the data up to gap length 40 ###
Linear cost model, Simil = -28.70 - 0.9785*(k-1), RMS=2.59559
Logarithmic model, Simil = -25.93 + 6.597*ln(k), RMS=0.746104
       41        1
       43        2
       54        2
       56        2	### Now we fit the data up to gap length 60 ###
Linear cost model, Simil = -29.19 - 0.7832*(k-1), RMS=3.14195
Logarithmic model, Simil = -25.87 + 6.703*ln(k), RMS=0.804665
       >60      12
Once that we have probabilities of gaps, we can compute their cost
in a measure equivalent to the rest of the values of the Dayhoff
matrix alignments.  Some people may recall that the Dayhoff matrices
are normally computed to maximize the sum of 10*log(probability).  So
the right penalty of a gap is 10*log(prob-gap-of-size-k).

The lines labelled "Linear cost model" are a fit of the actual data
to an exponential distribution (which once taken the logarithms
becomes a linear cost).  The lines labelled "Logarithmic model"
assume a Zipfian probability (proportional to 1/k^theta) which give
a logarithmic penalty gap.  We propose a Zipfian gap distribution
in our paper, as a much more accurate model of deletions.

You can see two facts with the above data:  (a) that the RMS (Root
Mean Square deviation) is higher for the linear model, and deteriorates
rapidly, as the logarithmic stays about the same.  (b) and what I consider
most significant, the parameters of the fit change dramatically when
you consider larger and larger samples, a typical indication of fitting
to the wrong model.  The parameters for the logarithmic fit are very

A remarkable fact is that the gap length distribution remains roughly
constant for all PAM bands that we studied.

More tests are described in the paper.  In simple terms, what happens
is that the tail of the distribution is way too long for the rapid
initial decline.  And the rapid initial decline is supported by a very
large sample, so it must be quite accurate.

>>  Gonnet, Cohen, and Benner (1992.  Science 256:1443-1445) in their
>> "Exhaustive Matching of the Entire Protein Sequence Database" have
>> found that the probability of a gap occurring 
>> 	It's an interesting (and exciting!) paper.
>It is a piece of junk. (stronger word deleted). It contains *no*
>information that allows their work to be reproduced or checked in any
>way. An example of something, I'm not sure what, but it is NOT a good
>paper. It may be good science but you can't tell from what they wrote.
>Gonnet et al. has been discussed extensively on the info-gcg list and
>the upshot is that someone who has actually tested their "PAM" matrix
>(it's not; just shows that they do not understand what they are
>talking about, or they redefine words to mean what they want) that the
>results are, at best, no better, and often worse, than the *real* PAM
>120 and PAM250 matrices.

I am sorry, but Science forced us to cut down the number of pages
to a bare minimum.

We do not talk about "PAM matrices", we talk about mutation matrices
computed at a given PAM distance and Dayhoff matrices, again computed
at a given PAM distance.  Are you sure you understand what is what?

Our Dayhoff matrices, for any PAM distance, have been derived from
two orders of magnitude more information than most earlier efforts.
Together with the new deletion-cost model, they restore an accurate
notion to the N&W algorithm (that of finding the alignment which
maximizes the probability of having evolved from the same common
ancestor).  Many matrices in the literature and all the deletion
models I know are just recipes, with no formal backing.

We have found that our alignments, in particular for distant relations,
are better than the ones obtained with other matrices.  For closer
relations, any scoring matrix works well, even an identity matrix works!

If you are interested in understanding the theory of proper sequence
alignment, I can send you a few chapters that explain this in detail.

Gaston H. Gonnet,  Informatik, ETH, Zurich.

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net