Someone posted a query recently about how to search for a
defined site (such as a restriction enzyme recognition site)
in a biological sequence, and the one reply I saw mentioned
simple scanning algorithms. I would just like to point out
that while such algorithms are both easy to write and commonly
used (every searcher I have written has used one), a more elegant
and sometimes faster algorithm is a Finite State Machine. In brief,
a Finite State Machine (FSM) improves on a simple scanning algorithm by
knowing when it can "skip ahead".
For example, suppose you wish to look for the site "AATTTCCC" in DNA.
Suppose a stretch of DNA contained the sequence "AATTTGG". A simple
scanner would compare (| - match, * - mismatch):
try #1 try #2 try #3
AATTTCCC AATTTCCC AATTTCCC
|||||* |* *
AATTTGG ATTTGG TTTGG
Now, we humans know that once #1 has failed, you can skip past the whole
AATTTGG segment because "ATTTGG" cannot overlap "AATTTCCC" in any way.
An FSM incorporates this idea, and thus can be faster.
If you can find the journal, a very good description of FSMs can be
found in:
Tyler, EC, MR Horton, and PR Krause. 1991. Comp.&Biomed Res. 24:72-96.
A review of algorithms for molecular sequence comparison.
Another relevant reference is:
Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. 1990. JMB 214:1-8.
Basic local alignment search tool.
as BLAST uses an FSM in the initial stages of the database search.
Keith Robison
Harvard University
Department of Cellular & Developmental Biology
Department of Genetics / HHMI
robison at ribo.harvard.edu
P.S. Apologies for not posting this as a follow-up -- I was rushing to
get a poster ready when the original showed up and the local
news server has apparently "expired" it.