Dear all
I am really pleased to announce that the MIRA 2.9.15 sequence assembler is
available: http://chevreux.org/projects_mira.html
This is the second version of the MIRA assembler that is accessible to a
broad public and which is able to assemble (de-novo and mapping) sequences
gained through Roche instruments like the GS20 and GS FLX.
Compared to 2.9.8, MIRA 2.9.15 got faster, less memory hungry and more
accurate when working with 454 data. Additionally the extensive examples
and a tutorial on how to start 454 or 454 / Sanger hybrid assemblies have
been reworked.
While the 454 assembler - Newbler - is comparatively memory inexpensive and
quite fast (embarassingly so), however there are a few things that might
count in favour of MIRA:
- MIRA also uses and assembles repetitive areas
- MIRA can correctly disambiguate repeats based on error pattern analysis.
One base difference is enough for this.
- MIRA does not cut reads into parts and scatter those parts all over
different contigs.
- MIRA allows hybrid assemblies in which discrepancies between sequencing
methods are readily tagged for visual inspection
- MIRA builds less contigs that are longer and cover more of the target
genome
>From the example assembly of S.pneumoniae TIGR4 with data available from the
NCBI:
Reference genome: 2160842 bases (GenBank: AE005672.2; GI:85720550)
MIRA 454 pub* Newbler (1.1.02.15)
------- ------- -------
# Contigs >= 500 bases 109 218 264
Bases in contigs >=500 2141384 2016795 2003320
N50 39183 14589 12074
N90 11597 4525 3875
N95 7660 2882 2562
MIRA has half the number of contigs and these are ~3 times larger than the
ones from Newbler. And MIRA included ~125 KB of repetitive sequence which
Newbler left out.
Note: "454 pub" is the data set referenced by the publication of Margulies
et al. article in Nature (GenBank: AAGY02000000, GI:110677268) and which
was made with an early verion of Newbler. Newbler 1.1.02.15 is the current
version from the Roche Off-Instruments package from June 2007.
MIRA is furthermore able to perform true hybrid sequence assembly. That is,
instead of assembling the consensus of 454 data with Sanger reads, MIRA
assembles 454 reads together with Sanger reads. An example how this looks
like when assembled against a backbone is shown at
http://chevreux.org/mira_ex_454sanger.html where one also can see how going
with a hybrid strategy helps to overcome sequencing errors that are typical
for either strategy.
As known from the earliest MIRA versions since 1999 (see
http://www.bioinfo.de/isb/gcb99/talks/chevreux/), the repeat resolving
algorithms are able to (more or less cleanly) separate reads from different
locations as long as there is 1 base differentiating the reads of the
different repetitive places. This should alleviate a little bit the repeat
problem.
Please note that 2.9.15 is still in development though and not entirely
optimised throughout all the algorithms. Therefore, MIRA 2.9.15 should NOT
be used for productive assembly but rather be used as testing version to
gather feedback of parties interested in hybrid assembly strategies. Also,
one needs a fast machine and quite an amount of memory. As a rough
estimate: per 100 MB raw data one needs some 4-5GB RAM, ~15GB disk
and ~21hrs of computation time.
Versions of MIRA 2.9.15 for 64bit and 32bit Linux machines can be downloaded
from http://chevreux.org/mira_downloads.html
Regards,
Bastien