In article <0097D3CC.9A84E9A0 at vms.csd.mu.edu>,
<6566friedman at vms.csd.mu.edu> wrote:
>>As a new reader of the bionet.software.gcg news group, I am looking for
>answers to a problem that has been bothering me for some time. Students
>in my department often will perform homology searches (fasta or tfasta) of
>the GenEMBL data base and pull up sequences of low, but they claim,
>significant similarity. Clearly, if this were a simple sampling of a
>population, they would be expected to demonstrate "significance" to a
>specified confidence level. With DNA or protein sequences, we seem to
>simply nod and wink at the comparison and say "yeah, that looks
>homologous".
>>I would appreciate your responses concerning how this problem has been
>treated and some specific references dealing with this question.
There are now some very good methods for evaluating the
statistical significance of a match. The current version of the FASTA
package, which is available via anonymous ftp from virginia.EDU in
pub/fasta/fasta17.shar, provides prdf and prss, which generates the
distribution of scores that are obtained when a sequence is compared
to 100 - 500 random sequences of the same length and amino-acid
composition. These scores can be used to calculate the parameters of an
extreme value distribution, which will then give you the expectation of
obtaining a score by chance.
Unfortunately, GCG has not included prdf (fasta scores) and
prss (Smith-Waterman scores) or their predecessors rdf2 and rss in
there "supported" package, but I believe that it is part of the
unsupported software. Hopefully this will change with an upcoming version.
Bill Pearson