In article <3824F286.16121ACC at wh.math.uwaterloo.ca>,
hcwang at WH.MATH.UWATERLOO.CA (Huai-chun Wang) wrote:
> I wonder whether there are softwares that do multiple alignment of any
> plain text files in addition to DNA or protein sequence file. I tested
> ClustalW but it can only align legal DNA base or amino acid residues,
> and igonore all other English characters and symbols that typically
> occured in an English text. Does any one suggest such a program for this
> new need? Thank you.
Clustal can take external alignment matrices. You could create
one of your own (perhaps a simple "identity" style matrix) which
includes any characters you would like to consider valid.
I can't remember how Clustal handles invalid characters during
input (I believe they are simply ignored) so you would probably have
to change the source for that section.
The standard UNIX text utility "diff" might prove useful in this
situation but this is line-based so you would have to place each
"sequence" in a separate file with one character per line.
(I have never tried this approach).
If the degree of "homology" is low you could try programs which
look for statistically significant groups of characters.
Such programs can operate without pre-determined matrices.
I have used MEME from SDSC to identify motifs in groups of
My personal preference would be to strip the sequences of invalid
characters (sed script) and then run Clustal with an identity
matrix. This would give an indication of similarity but would
not tell you how good the punctuation is in the individual files.
Bernard P. Murray, PhD
bpmurray at cgl . ucsf . edu
Department of Cellular & Molecular Pharmacology, UCSF