Gaps and PAMs

David T Jones ucbcdtj at ucl.ac.uk
Mon Jun 29 11:54:00 EST 1992

Gaston Gonnet writes:
>yes, I agree, but with "subjective terms" we cannot do science.  The
>least controversial definition of "significance" is one which relates
>the probability of an homology against the (null hypothesis) probability
>of a random coincidence.  As the model of homology gets more precise,
>or you start including information of other nature (e.g. 3-d structure)
>then the probabilities may be computed differently.  But the definition
>remains the same.

Alignments can be significant and yet be wrong. My experience of
multiple sequence alignment as a developer and as a user is that
every program I've used, modified or written will make significant
errors in alignments when resonably remote sequences are used - say
sequences with < 45% identical residues. I always find myself either
editing automatically generated multiple alignments, or twiddling
the alignment parameters to get closer to what I believe is a better
alignment. When I have some reference structures to guide the
alignment then things are even easier - I can generate a structural
alignment and check the sequence alignment to ensure that at the
very least structurally equivalent residues match up. When I have
no structural knowledge of the sequences, and apart from perhaps
conserved functionally important residues I have no other knowledge
of which features should be aligned in the sequences, then I do alas
rather consider myself to be performing black magic - maybe grey
magic is a bit more appropriate.

I don't think alignment technology has reached a point where an
automatic procedure can take as input an entire protein sequence
databank and generate as output a complete set of accurately aligned protein
sequence families. The alignments may well be statistically significant,
and may well look highly plausible, but these observations alone cannot
guarantee the correctness of alignments.

Surely there is a danger that these inevitable misalignments between
remote sequences will produce a greater degree of error in the
calculated matrices than the databank errors, extrapolation errors
and genetic code bias you are trying (not unreasonably) to avoid?

- David Jones -
jones at bsm.bioc.ucl.ac.uk

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net