Gap penalties, PAM matrices and so on

Mark Cohen cohen at cumuli.vmsmail.ethz.ch
Fri Jun 26 13:33:17 EST 1992

In article <1992Jun26.103554.12221 at gserv1.dl.ac.uk> BIONET at EARN.FRCGM51 
J.L. Risler Writes:

>- what is a "liberal target score" ??
One that makes sure that No potential matches were overlooked, we later state
that the matches found in the first round of matching were refined to remove
the questionable matches.

>- "mutation matrices ... differ, depending on whether they were derived
>  from protein pairs that are distantly homologous or from protein pairs
>  that are closely homologous". What a discovery !!
>- how can anyone align confidently protein sequences that are "distantly
>  homologous" and use the results to build a matrix ?

We did not align "distantly homologous" and build a matrix from the results
We aligned all the proteins in the data base with all the others.  Where 
the scores obtained (using Dayhoff's 1978 matrix) indicated that the 
alignments were significant (ie that the probability of the alignment was 
significantly higher than alignment of two random sequences) these alignments
were used in the construction of the matrix.
>- what are "distantly homologous" proteins ? Two proteins that get a low
>  score when aligned ? I bet that this is the case between *any* pair of
>  sequences. GAP or BESTFIT, for example, will always return something..
>  Or, maybe, two proteins whose score is below a "liberal target score" ?
Distantly homologous proteins are exactly that.  Proteins for which the 
alignment score is high enough above the score of aligned random sequences
yet not so high as to be unambiguously related.  An example might be 
eubacterial/eukaryotic GAPDH and archaebacterial GAPDH.  Significant
alignment is found only in a very short section of the sequence.  Doolittle
even argued that the archaebacterial enzyme was more related to Bovine 
transhydrogenase than other GAPDHs.

>- what is the influence of the enormous redundancy found in protein
>  databanks (hundreds of cytochromes, thousands of histones, zillions of
>  globulins, ...)
Dayhoff calculated her matrix specifically from these groups because at the
time they represented the vast majority of proteins that had been sequenced
and that could be accurately aligned.  A much more serious problem is the
immunoglobulins which actually account for a far larger number of matches
found than expected simply from the representation in the database.
We will in future publish the matrices calculated with, without and only for
the immunoglobulins.  The results do not change our opinion significantly.

>- the explanation for the -3/2 power concerning the probability of a gap is
>  a joke ? Seems like an insertion is made up by synthetizing an
>  oligopeptide whose ends must lie close together, then open the protein
>  where the insertion must take place, and then insert the oligopeptide ...
In effect yes that is exactly how an insertion seems to occur when you look
at a protein.  Albeit that the synthesis of the oligo peptide occurs along 
with the protein into which it is being inserted.  We understand that there
is insertion into the gene we simply believe that the data is explained by
selection at the protein level.  In other words insert what you like in the 
gene, sure it will be expressed but if the reading frame shifts or the protein
no longer functions we never see the results.  Insertion (and deletion) of 
randomly coiled loops on the surface of the protein that do not disrupt the
core structure and thus the function is exactly what you might expect.
The k^-3/2 term is an experimental result.  The probability of the two ends
of a chain being close in space is dependant on the length of the chain as
described in the paper, or you can read Flory's book on polymers.
>        Well, I prefer to stop here. May I draw your attention on the paper
>by Jones, Taylor and Thornton in the last CABIOS issue ? Their aim was also
>to build an updated Dayhoff matrix. They did it, with the difference that
>their procedure is crystal clear. And that, by necessity, their matrix was
>not built with "distantly homologous proteins".
Jones et al found like us that the differences between the Dayhoff 1978 matrix
and the recalculated matrix were largest for the least common amino acid
pairs, eg W-Y or W-C etc.  Their paper is somewhat longer than ours hence
their more detailed explanation.

A final comment specifically for Dan Davidson.
You made a suggestion that our gap penalty did not have a length term (this was on the info-gcg list which I don't have direct access to).  We gave 3 equations
we stated in the text where P is the probability of a gap of length k then

a linear approximation being


All of them have a k term ?

Please can you make any comments or followups here or bionet.general, we don't
have the info-gcg list.  I suspect that much of the adverse comment is a result
of misunderstanding.

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net