In article <MacMS.25746.31051.brutlag at cmgm.stanford.edu> brutlag at CMGM.STANFORD.EDU (Douglas Brutlag) writes:
>Bill,
>> Isn't FASTA with optimization identical to the Smith-Waterman? The
>optimization step in FASTDB is precisely a Smith-Waterman scoring of the top
>5,000 sequences, and hence FASTDB with optimization is a Smith-Waterman
>analysis on those sequences. ...
No, FASTA uses a band of 32 residues for optimization.
Smith-Waterman uses both sequences in their entirety for the
optimization. FASTA with ktup=1 and optimization is about 5 - 10 X
faster than Smith-Waterman, reflecting the fact that the average query
sequence size is about 150 - 300 residues. With FASTA, you can either
optimize every sequence or optimize those with a score greater than a
threshold - either method works as well as Smith-Waterman.
Regarding the gold-standard - I work with as many
superfamilies as I can find, with several members of the superfamily
(some randomly chosen), and I do comparisons with Smith-Waterman.
Since I am trying to find sequences that share a common ancestor (and
thus have a common structure), I think false-negatives are exactly
that. There is little evidence for common structural motifs that can
be recognized by sequence comparison in the absence of a common
ancestor. Most recently, I have moved from a "criterion" that is a
fixed function of the scores of the top-scoring unrelated sequences
(the Genomics paper) to one that balances the number of high-scoring
unrelated and low-scoring related sequences. This gives the same
results, but seems esthetically more pleasing.
I feel pretty uncomfortable with "motifs" that result from
convergence. I prefer to focus on common ancestry. For me, that
solves many of the problems you mention.
Bill