IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

short patterns

David Mathog mathog at caltech.edu
Mon May 6 10:49:01 EST 2002

bakheet wrote:
> Hello Everybody,
>  I am writing to you to seek your help in solving my problem.  I have two
> sets of sequences each is 77 sequences (ranging from 40 b.p to 5000 b.p.),
> and I am trying to find out a short pattern that is found in set 1 but not
> in set 2 (or the other way) .  Does HMMR  do the job ? If not which program
> can help knowing that  I am getting errors if I  align them. Could you
> please help in solving this problem.

If by "short pattern" you include fixed sequences, such as ACCGGT then
you can use the EMBOSS program wordcount.  It will list all 6-tuples
(for instance) which you could then sort and compare for differences using
a spreadsheet.  Odds are your pattern isn't that well defined but
if it's nearly that well defined you might be able to spot it.  You'll
be amazingly lucky if your pattern turns out to be a simple fixed tuple though!
(Sorry I can't provide you with a similar GCG solution.  Not that I don't
have one. But under the GCG license terms I cannot distribute the GCG based
program I wrote that has similar functionalty.)

MEME may work if you're using a (recent) version of GCG that includes it.
Unless you know a priori that the signal is always in a fixed orientation, and
you have all sequences in that orientation, you're going to have to do some
To search both strands one must specify -TWOStrands and that requires
that you also use  -ONEEXactly.  Unfortunately -ONEEXactly requires that this
pattern really be present once in every DNA sequence.  Odds are exceedingly poor
that that will be true for any real pattern (unless you already knew what it
in which case you wouldn't have posted here!).  So you'll probably have to work
around this by making a palindrome out of each of your sequences (use
REVERSE and then append the complement to the original).  Then you can use
the -ZEROORMore model on the palindrome based data set.  You could also
just use the database of forward and reverse sequences separately, but
then odds are high that at least 50% of your database sequences will not
have the pattern. Anyway, the -ZEROORMore model will most likely
reflect the data you have.

If you don't have MEME try the NCBI's Gibbs program which is in some ways
similar to it.   I've never used Gibbs for nucleic acids though and
I'm not sure from looking at the documentation if it will consider
both strands of a sequence, so you may need to use the palindrome trick
here as well.


David Mathog
mathog at caltech.edu

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net