Question about Mulfold (M. Zuker et al.'s)

Eddy Sean sre at al.cam.ac.uk
Tue Dec 13 08:13:36 EST 1994

In article <ehom1-071294143620 at mac04.parrishd.swarthmore.edu> ehom1 at cc.swarthmore.edu (Erik Forbes Y. Hom '95) writes:
  >... I, however, have a
  >sequence of ~4000 that I want to fold to see if the start codon or SD
  >domain could potentially interact with bases more distant than the
  >neighboring ~150 bases...
  >domains on this interaction).  Is there anyway I could do this without
  >constructing some chimeric sequence between the initiation region and the
  >distant region of interest (I don't think the results I would get from this
  >would be very realistic)?  Do you thing it is reasonable
  >(thermodynamically) for a such a distant region (say ~500 bases away) to
  >base pair with the initiation region?  (I can't, for certain, justify one
  >or the other because I can't do the folding!).  What are the limitations
  >for the UNIX version??  Help!  Thanks for your time!

First off, RNA folding programs are not accurate (moreover, their
accuracy falls off rapidly with sequence length); so if you do do
this, view your results with extreme caution (i.e., as a guide to
further experiment or wild speculation, not as gospel truth).
Thermodynamic-based RNA folding programs generally predict a large
number of very different alternative structures within 10% of the
global energy minimum.  Between the error bars on the thermodynamic
parameters, the approximations made in doing the fold prediction, and
what we don't know about RNA folding, it is not possible to
distinguish between these alternatives without experimental or
comparative sequence data.

That said, UNIX versions of the Zuker program should be able to fold
4000 bases without too much trouble if you have a lot of memory and
don't mind waiting for a while. The algorithm runs in time
proportional to N^3, where N is the length of the sequence, and in
memory proportional to N^2. Though I'm not familiar with the internals
of the GCG Zuker implementation (it may have limits imposed on N, for
instance), I'd guess that you'll need something like 96 Mb RAM on your
machine (a high-end UNIX box) -- I think the suboptimal algorithm
keeps three NxN matrices of 2-byte short ints, but I could be off by
two-fold either way. As for time, I'm not sure, but I'd guess between
10 and 100 hours on a workstation, based on the supercomputer
benchmarks I've seen for the Zuker algorithm (which range from 1 to 3
hours for a 4000-mer).

And yes, long-distance pairings are reasonable. Look at the consensus
secondary structures for ribosomal RNA, for instance.  I'd guess there
probably needs to be a lot of structure in between, so you don't pay
some horrendous entropic cost for tying down the ends of a big stretch
of sequence.

- Sean Eddy
- MRC Laboratory of Molecular Biology, Cambridge, England
- sre at mrc-lmb.cam.ac.uk

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net