DNA sequence assembly - does it make sense to display traces?

Tim Littlejohn tim at megasun.BCH.UMontreal.CA
Wed Oct 25 05:45:49 EST 1995

In article <469fdb$p89 at gap.cco.caltech.edu> mathog at seqaxp.bio.caltech.edu writes:
>We've got Xbap on a Unix machine here, and DNAstart and Sequencher running
>on a couple of Macs in other labs.  There are other similar programs
>around. All of them keep (some variation of) the DNA trace information
>around for reference during final editing.
>So, back to the storage problem.  First pass - put the traces through some
>scoring algorithm that assigns each called base a "cleanliness score". Then
>remove the trace information from the disk.  (Yes, after archiving it!). 
>Throw the sequences into the assembly process.  The consensus base 
>called is a nonlinear weighting of the "cleanliness scores" - a hundred
>awful traces doesn't hold the same information as one clean one.

In the pregap process in the Staden package, each base can be given a
"quality assignment" which is, as I understand it, exactly what you are
referring to. This quality information is stored as a matrix in the Staden
experimental file format that gives (as a %) the confidence in each base.
"Quailty" can then be used to identify potential problems in the contig
editor by simply adjusting the quality cut off.

This system is used primarily to identify regions that the user may want
to double check at the trace level. I don't think there is any new function
to "vote" against "low-quality" sites where they differ from "high-quality"
ones in gap/xgap/gap4. I guess this would be relatively trivial for the
developers to add this. I for one would want any such automatically converted
bases marked with a special tag type, however!

I can still hear users crying for the original data (traces) if an approch
like this is used, however. Rather than try to avoid the problem of storing
trace data completely, I would suggest storing of traces only in areas that
are highly likely to be used for confirming a sequence. This of course would
require pre-knowlege of the assembly, so the archiving of non-interesting
(i.e. unambiguous) traces would have to be left to the assembly program. 

The assembly process would then have to cope with concepts of strand coverage
as well as quality (i.e. only archive traces that contain no ambiguities in
their un-clipped regions and that span both strands). Once ambiguities are
resolved, traces could be archived too. The contig editor could handle this,



Tim Littlejohn- Organelle Genome Megasequencing Program (OGMP)

Snail Mail: Departement de biochimie        Phone: (514) 343-6111, x5149
            Universite de Montreal          Fax:   (514) 343-2210 
            C.P. 6128, succursale Centre-ville
            Montreal (Quebec), H3C 3J7

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net