Sequence Similarity??

William R. Pearson wrp at dayhoff.med.Virginia.EDU
Thu Apr 21 08:01:25 EST 1994

In article <0097D3CC.9A84E9A0 at vms.csd.mu.edu>,
 <6566friedman at vms.csd.mu.edu> wrote:
>As a new reader of the bionet.software.gcg news group, I am looking for 
>answers to a problem that has been bothering me for some time.  Students 
>in my department often will perform homology searches (fasta or tfasta) of 
>the GenEMBL data base and pull up sequences of low, but they claim, 
>significant similarity.  Clearly, if this were a simple sampling of a 
>population, they would be expected to demonstrate "significance" to a 
>specified confidence level.  With DNA or protein sequences, we seem to 
>simply nod and wink at the comparison and say "yeah, that looks 
>I would appreciate your responses concerning how this problem has been 
>treated and some specific references dealing with this question.

	There are now some very good methods for evaluating the
statistical significance of a match.  The current version of the FASTA
package, which is available via anonymous ftp from virginia.EDU in
pub/fasta/fasta17.shar, provides prdf and prss, which generates the
distribution of scores that are obtained when a sequence is compared
to 100 - 500 random sequences of the same length and amino-acid
composition.  These scores can be used to calculate the parameters of an
extreme value distribution, which will then give you the expectation of
obtaining a score by chance.

	Unfortunately, GCG has not included prdf (fasta scores) and
prss (Smith-Waterman scores) or their predecessors rdf2 and rss in
there "supported" package, but I believe that it is part of the
unsupported software.  Hopefully this will change with an upcoming version.

Bill Pearson

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net