DNA sequence assembly - does it make sense to display traces?

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Fri Oct 20 19:38:03 EST 1995

We've got Xbap on a Unix machine here, and DNAstart and Sequencher running
on a couple of Macs in other labs.  There are other similar programs
around. All of them keep (some variation of) the DNA trace information
around for reference during final editing.

There are reasons why one would like to eliminate trace storage.  The trace
information typically takes up 20 to 100 times the storage space of the
sequence from that particular run, and in a big sequencing project, the
need to have all of these traces around results in a requirement for an
exceedingly large amount of online storage.  This is because there may be 5
to 10 sequences overlapping in each region.  (Some methods use less). One
extreme case concerns a 900 kb region, where the person assembling it
claims to be occupying 3 gigabytes of disk space (thankfully on somebody
else's machine!)  In this particular effort, the disk storage per final
output base apparently varies somewhat from region to region, but runs 
between 3000 and 5000 bytes per base.  (Other large scale sequencers please
chime in with your particular data.)  95 to 99 percent of this is trace

The question is, does keeping all of these traces around really make sense?

Naively, the response is yes, of course you need the trace so that you can
figure out which of several traces is correct if there is a conflict.

But wait, what does it mean to "figure out which trace is correct"?  I've 
watched a bunch of people doing this, and they all do it pretty much the
same way.  They put up the traces in question ONLY when they get a mismatch
in the sequences, and make a judgement of the quality of one trace versus
another.  When it comes comes down to "good trace" versus "poor trace" they
(reasonably) consider the problem resolved in favor of the data in the
former.  When it comes down to "poor" versus "poor" they *may* decide one
is better than the other, but there is considerably less confidence in the
final call at that position. 

If all the users are doing with that trace information is classifying one
trace as better than the other, shouldn't they be able to do that at the
beginning, and just store that classification?  Whatever the classification
method, it should only require a couple of bytes per base per trace to hold
the result - which would result in significant disk space savings. 

So, back to the storage problem.  First pass - put the traces through some
scoring algorithm that assigns each called base a "cleanliness score". Then
remove the trace information from the disk.  (Yes, after archiving it!). 
Throw the sequences into the assembly process.  The consensus base 
called is a nonlinear weighting of the "cleanliness scores" - a hundred
awful traces doesn't hold the same information as one clean one. In the end
there will be bases that aren't determined, but that will be because all of
the traces are in some sense "poor" - and they would likely not have been
determined, or determined correctly,  by a human assembler either. This
method will also throw out some regions where all the traces were poor but
the bases called happened to agree.  Most human assemblers don't do that,
but probably they should. Note also that at least the final base calling
here is fully automated - no humans have to make quality calls on anything
at any point. 

The hard part of this is obviously putting together the scoring method that
mimics the judgements of a human assembler.  Presumably this is within the
realm of rule based systems or neural nets.  In any case, with the large 
sequencing projects that are available now, it should be relatively easy to
test the results of the various scoring and weighting methods against the
sequence produced by the human assemblers.  Explicitly, assume that the
contig assembly order is correct.  So: 

  1.  Put all traces used through the scoring method under test, generate
      "scored" bases.
  2.  Output final bases based on "score weighting" scheme under test and
      the contig information from the reference assembly.
  3.  Calculate mismatches against the known sequence.

This should run pretty quickly and scale well - the number of operations
being roughly proportional to the total assembled sequence length.

Anybody out there doing this???


David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net