Robert Rumpf asked about bootstrapping phylogenetic trees using
the PHYLIP package....
The PHYLIP package allows you to bootstrap trees constructed using
any of the tree construction methods (including Fitch-Margoliash,
Neighbor-Joining, Parsimony). The basic idea is that you create
many shuffled datasets by running SEQBOOT, then run the tree construction
method of your choice, and then run the CONSENSE program to get the
final bootstrapped tree.
e.g. to bootstrap a Neighbor-Joining tree derived from nucleic acid
sequences, run
SEQBOOT, then DNADIST and NEIGHBOR, then CONSENSE
e.g. to bootstrap a Parsimony tree derived from nucleic acid sequences,
run
SEQBOOT, then DNAPARS, then CONSENSE
e.g. to bootstrap a Neighbor-joining tree derived from protein sequences,
run
SEQBOOT, then PROTDIST and NEIGHBOR, then CONSENSE
e.g. to bootstrap a Parsimony tree derived from protein sequences, run
SEQBOOT, then PROTPARS, then CONSENSE
The PHYLIP documentation suggests that the sequential running of several
programs could be automated, and indeed Tim Littlejohn (tim at bch.montreal.ca)
has already provided some elegant sample scripts (available by gopher-ing onto
megasun.bch.umontreal.ca) for Unix systems.
I have some much uglier scripts for both Unix and VMS that require hand
editing before running. However, these allow options to be selected for
the PHYLIP programs (I believe Tim Littlejohn's scripts use the default
values) - my scripts could be improved drastically (see example below: for
Unix, bootstrapping a Neighbor-Joining tree based on protein sequences) -
however, maybe someone has already done something like this?
My scripts also try to deal with 2 other problems:
(1) The fact that the shuffled files produced by SEQBOOT can be very
large e.g. a 4k PHYLIP input file shuffled 2000 times will be
8Mb! It seems sensible to put this in a scratch area.
(2) Using the scratch area creates the problem, however, that your
job may interfere with other PHYLIP jobs because PHYLIP
programs create files like "outfile". These other PHYLIP jobs may
belong (a) to other users, or (b) be other PHYLIP jobs that you have
running. My example script (clumsily) solves this by using a directory
on the scratch area for each user. Each user also has several
subdirectories (e.g. dir1, dir2, etc). Dir1 might be used by DNA Parsimony
PHYLIP jobs, dir2 be DNA Neighbor-Joining PHYLIP jobs etc. So as long
as a user runs only one class of job at a time there will be no problem!
The example script is using /scratch/username/dir2 to "park" the
intermediate files. PNJB = "protein Neighbor-Joining Bootstrap".
Apologies for the scruffy script. Hopefully it is of use.
Frank Wright
SASS, University of Edinburgh,
J.C.M.B. room 3610, Kings Buildings,
Edinburgh EH9 3JZ, Scotland, U.K.
frank at sass.sari.ac.uk
#==========================================================
# PNJB.batch 1.1 (c) Frank Wright 13th Feb 1993
# (writes "scratch" files to /scratch/username/dir3)
#
# YOU MUST HAVE CREATED "/scratch/username/dir3" beforehand!
#
# Bootstrapping PROTDIST & NEIGHBOR (Neighbor Joining)
# (kimura distance; 100 bootstrap trials)
#
# Things to alter:
#
# (0) Change all references to "username" to your own username!
#
# (1) Check the comment lines beginning # <<<< - I have
# tried to remind you what is required.
#
# (2) The number of bootstrap trials (shuffles). Remember
# that this number should be the same for the SEQBOOT
# and PROTDIST and NEIGHBOR programs!
#
#==========================================================
# <<<< move to directory required...
# copy file to scratch area, and "cd" to your scratch dir..
#==========================================================
cd
cd myworkdir
cp hbbpep.phy /scratch/username/dir3/infile1
cd /scratch/username/dir3
#==========================================================
# first make shuffled "multiple data sets" with SEQBOOT ...
#==========================================================
# <<<< After next line: replace with own options if reqd.
seqboot << END1
infile1
123
R
100
2
Y
END1
#==========================================================
# now run PROTDIST on shuffled "multiple data sets"...
#==========================================================
\rm infile1
mv outfile infile2
# <<<< After nextline: replace with own options if required.
protdist << END2
infile2
P
M
100
2
Y
END2
#==========================================================
# Tidy disk space & run NEIGHBOR-NJ on "multiple" dists...
#==========================================================
\rm infile2
mv outfile infile3
neighbor << END3
infile3
M
100
O
1
2
3
Y
END3
#==========================================================
# Tidy disk space & run CONSENSE to get bootstrapped tree..
#==========================================================
\rm infile3
\rm outfile
mv treefile infile4
consense << END4
infile4
O
1
R
2
Y
END4
#==========================================================
# Tidy disk space.
# <<<< Exit to own dir.
# <<<< give meaningful names to output files.
#==========================================================
\rm infile4
cd
cd myworkdir
mv /scratch/username/dir3/outfile PNJB.out
mv /scratch/username/dir3/treefile PNJB.tree
#==========================================================