Scripted Directed Assembly Problem-

jkb at mrc-lmb.cam.ac.uk jkb at mrc-lmb.cam.ac.uk
Tue Apr 2 08:07:08 EST 2002

In <3CA4EBA8.4D9ADB9 at genome.wi.mit.edu> major <major at genome.wi.mit.edu> writes:

> I'm having a problem getting a VERY large directed assembly to build
> automatically.
> We use Staden.2000.0 currently.
> I have 73,574 reads which comprise 27 contigs in an assembly.  When I
> run the directed assembly graphically from gap4, the gapDB is built, but
> painfully slowly(I've never let it runt o completion on this large data

Indeed it did used to be very slow. However it turned out that nearly all the
time was being spent working out which readings already existed in the
database. This was rather easy to solve.

Consequently the 2001.0 release is now MUCH faster!
Also note that if you have zero maximum percentage mismatch for directed
assembly then the consensus is not continuously recomputed (as zero %age
mismatch is used as a marker for no-match checking). This is only relevant if
all your alignments have been performed by an external assembly engine.

> set).  When I use a modified assemblye4
> script(http://www-genome.wi.mit.edu/personal/major/assemble4), I get
> this error:

Ideally you ought to use stash now and use "load_package gap" instead of
gap4sh. (Gap4sh has gone from the latest release.)

> Processing number 7995: G59P61559FC1.T0
> Fri 29 Mar 14:27:21 2002 SYSMSG : No such file or directory [2]
> Fri 29 Mar 14:27:21 2002 ERROR  : invalid type [1001]
> Fri 29 Mar 14:27:21 2002 COMMENT: reading record 0
> Fri 29 Mar 14:27:21 2002 FILE   : gap-io.c:171

Record 0 is the main 'database structure'. It is often read, but typically
errors reading this imply that an attempt was made to read a non-existant
record. Zero is used in fields containing record numbers (eg sequence,
reading name, annotations, etc) to indicate that no record exists.

Why this happens for you I do not know. Perhaps it's a bug (in our code)
caused by a missing field in an experiment file.

> *Note* when run on assemblies with < 8000 reads, this builds a valid
> gapDB with no problems.
> When running the directed assembly via the gap4 GUI, I start gap4 with
> -maxseq 2100000 -maxdb 100000, then create a new DB and start the
> Directed assembly.  I've let it work to read number 20,000 before
> quitting the program.(very, very slow)

This does sound like a maxdb issue then. 8000 is indeed the default maxdb
value. Within a script simply do:

global maxseq maxdb
set maxseq 2100000
set maxdb   100000

James Bonfield (jkb at mrc-lmb.cam.ac.uk)   Fax: (+44) 01223 213556
Medical Research Council - Laboratory of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, England.
Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/

More information about the Staden mailing list

Send comments to us at biosci-help [At] net.bio.net