In article <920629220811.21a51350 at sds.sdsc.edu> gribskov at SDSC.EDU (Michael Gribskov) writes:
>> It seems unlikely that such a matrix has been used with the
> Needleman-Wunsch algorithm. As you may recall, the NW algorithm (Needleman
> and Wunsch, J. Mol. Biol. 48, 443-453, 1970) does not use an affine gap
> cost, although they do suggest that the "penalty factor could be a function
> of the size and/or direction of the gap". NW requires a scoring table with
> all positive values since only the last row and column of the alignment
> matrix are examined for the maximum score. A scoring table with negative
> values is not guaranteed to give an optimum alignment with the NW
> algorithm. Note that needleman and Wuncsh refer to the position containing
> the maximum score as being in the first row or column since they build the
> alignment from N to C terminus, however they mean the last row or column
> calculated during the alignment. Since many sequences in the database have
> locally similar segments embedded in unrelated sequences (i.e. cases of
> partial homology or gene fusion), one wonders what kind of alignments would
> result.
===========================================
I think you should all know that what is called "Needleman & Wunsch" is
an application of dynamic programming. Wagner et al. (JACM 1974) also found
the same algorithm and named it "string correction problem". In parallel
the engineers discover it too and called it "Viterbi's algorithm"
(I think?). This is a case of an algorithm which is likely to be
rediscovered again and again and be renamed each time. I do not blame
the authors, other than for not being familiar with combinatorial algorithms.
But all these algorithms are "dynamic programming" algorithms (known
maybe since WWII).
Dynamic programming used to match two strings with a given "cost" matrix
and a deletion penalty function finds the maximum sum of costs of any
possible alignment of the two strings.
All of these variations are possible:
(a) to anchor the matching at both ends
(b) to leave one end anchored and find the optimal second end
(c) to find the overall optimal match of any two subsequences
(d) to specify a deletion cost depedent on each amino acid
being deleted
(e) to allow a linear gap penalty function (a+bk). (This is called
Gotoh's algorithm)
(f) the entries of the "cost" matrix are arbitrary real numbers
(no sign restriction whatsoever)
(g) you can minimize instead of maximizing the total cost.
(h) recover any one of the alignments which gives the optimal cost
(i) you can use an arbitrary concave gap-penalty function at
an increase in complexity (a log(n+m) factor, Webb and Myers).
All of these variations run in time proportional to m*n where
m and n are the lengths of the two sequences, and use space
proportional to min(m,n).
All these variations find the absolute maximum (ie. no approximations).
Notice that when the log-odds matrix is used, we maximize a sum of
logarithms of conditional probabilities, hence we maximize the
product of these probabilities, hence we find a maximum likelihood
alignment. People familiar with estimation theory will recognize
that this is an unbiased estimator.
In summary, I should have not used the term Needleman & Wunsch if you
are going to take such a narrow view of the algorithm. I should have
used the term "dynamic programming" (which is what I normally use).
Gaston H. Gonnet, Informatik, ETH, Zurich.