second posting: homology, identity.....

Charles Bailey bailey at hmivax.humgen.upenn.edu
Tue Feb 22 14:18:37 EST 1994

In article <2kcqem$8uj at news.univ-rennes1.fr>, moreaut at univ-tours.fr writes:
> When I align two related proteins with Gap or Bestfit in the GCG package,
> the % identity and % similarity vary between about 35 to 50 % identity.
> This percentage obviously depends of the gap parameters chosen for this 
> alignment. OK

Very true, and a point often missed by people who use only the 'default
parameters' when aligning.

> I would like to know what is the best thing to do to quantify the alignment
> is it better to give a range for the % identity or is it better to give
> % identity for different sections of the 2 proteins ?
> Same questions for homology and/or similarity ?

This is rather a thorny question.  The alignment metric, reported as the
'Quality' value in Gap output, is the best value to use when comparing
aligments to see which one is a better result using this algorithm with these
parameters.  Whether this means a given alignment is a better reflection of
some biological fact is something one can determine only by considering the
alignment in light of other data.  For instance, it's reasonable to conclude
that two cDNAs which encode proteins with similar functions and align well with
each other may be related by evolution.

In the Gap output, the % identity is no more than it says - the percent of
symbol pairs in the two aligned sequences whose members are identical.  The %
similarity is the percent of symbol pairs in the two aligned sequences for
which the score in the comparison table is greater than a threshold value,
which is 0.5 by default, but can be changed using the /Pair command line

> I should be happy to get clear definitions on homology, similarity
> and identity percentages

identity = exact match

similarity = components of the two sequences (nucleotides, amino acids, or
whatever) are related to each other by some score, using some algorithm, which
measures the amount of change one would have to impose on one sequence to get
to the other sequence.  (Actually, this is 'difference', which is the inverse
of similarity, but the two are based on the same approach.)

homology = components of the two sequences are biologically related by a
process such as divergent evolution.  This cannot be determined by an alignment
like the one Gap performs (so there is nothing in the Gap output which claims
to be 'homology'), though such alignments are often part of an investigation
into whether two sequences are homologous.  In practice, many people use the
term 'homologous' to mean 'similar', on the assumption that if two sequences
are similar to each other, they are probably the result of evolutionary
divergence from a common ancestor.  There is often but not always, some truth
to this; I'll leave it to the people who really know what they're doing to
correct me if I've made mistakes here.

                    Charles Bailey

!              Computational Biology and Informatics Laboratory
!         Dept. of Genetics, Univ. of Pennsylvania School of Medicine
!              Philadelphia, PA USA 19104     Tel. (215) 573-3112
!          Internet: bailey at genetics.upenn.edu  (IN
                            My words, not theirs.

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net