6566friedman at vms.csd.mu.edu writes:
>As a new reader of the bionet.software.gcg news group, I am looking for
>answers to a problem that has been bothering me for some time. Students
>in my department often will perform homology searches (fasta or tfasta) of
>the GenEMBL data base and pull up sequences of low, but they claim,
>significant similarity. Clearly, if this were a simple sampling of a
>population, they would be expected to demonstrate "significance" to a
>specified confidence level. With DNA or protein sequences, we seem to
>simply nod and wink at the comparison and say "yeah, that looks
>homologous".
>I would appreciate your responses concerning how this problem has been
>treated and some specific references dealing with this question.
The most commonly used method with FASTA for estimating statistical
significance is to use a shuffling method -- produce n (say 100)
shuffles of the matched sequence and compare the scores against
the random sequences versus the score for the original one.
One of the significant advantages of BLAST over FASTA is the built-in
statistical estimator of alignment confidence. While there are caveats
to using the P value from BLAST, it does give a good 1st order estimate
of the significance.
The recent review in Nature Genetics by Altschul and company
(6:119-129) is a good place to look for further reading on these
subjects.
Keith Robison
Harvard University
Department of Cellular and Developmental Biology
Department of Genetics / HHMI
krobison at nucleus.harvard.edu