In article <mailman.111.1182964648.11350.comp-bio from net.bio.net>,
Chris Hoffmann <hoffmanc from mail.med.upenn.edu> wrote:
>I was wondering about DNADIST, from the PHYLIP package.
>I am conducting a big sequencing project and there will be several phases. I
>would like to construct a distance matrix using DNADIST with a initial
>dataset and later on only add more sequences to the set. but I didn't want
>to have to re-run the program with all the sequences again. is there a way
>to only insert the new data into the matrix?
>For example:
>initially I want calculate the distances from sequences in group of
>sequences A;
>then when I get group of sequences B, calculate the distances within
>sequences in group B;
>and calculate the distances between sequences in group A and B without
>having to re-calculate the distances for group A again.
>Tthis is a simple example, I am actually likely to have 5 or more sets of
>sequences, ranging from 5000 to 20000 sequences per group (perhaps more).
>I realize I may have to adapt the code (another issue entirely) but what I
>am concerned is if the methods used by DNADIST give reliable results if I
>calculate them in this fashion.
1. Dnadist will not add the new distances without recomputing the old ones
in this way.
2. In any case, for the F84 distances the formulas use the base frequencies
found (empirically) in the input sequences. If you add more input
sequences you then most likely have slightly altered empirical frequencies
so you want to recompute the original ones anyway.
3. I suspect our formulas can compute this many distances, but
4. With 20,000 sequences there are 400,000,000 distances in all which, if each
is about 10 bytes long, is a table 4 GB in size. That is too big to
use. You ought to therefore reconsider your motivation for doing this.
I have posted this rather than emailing to the original poster because
it might be educational for others using our programs.
----
Joe Felsenstein joe from removethispart.gs.washington.edu
Department of Genome Sciences and Department of Biology,
University of Washington, Box 355065, Seattle, WA 98195-5065 USA