Brian Osborne (bosborne at NATURE.BERKELEY.EDU) wrote:
:: I can remember that someone, somewhere wrote a program in
: Perl to extract individual sequences from multiple fasta files. Does
: anyone know where I can find this work?
Here's a script, "splitfasta" to do it.
------------------------ cut here ----------------------------------------
#!/usr/local/bin/perl
eval "exec /usr/local/bin/perl -S $0 $*"
if $running_under_some_shell;
#
# split a fasta file into separate sequence files
#
$/ = '\777'; # entire input to be read in one slurp
$seqs = <>; # read input, assigning to single string
while ($seqs =~ m/^(>[^>]+)/mg) { # match indiv. sequences by '>'s
push(@seqs,$1); # and store in array
}
for (@seqs) {
# only allow characters A-Z,a-z,0-9,'_','-', and '.' in names;
# change if you're more liberal
/^> *([\w\-\.]+)/ && ($seq_name = $1);
if ($seq_name) {
open(OUTFILE,">$seq_name");
print OUTFILE "$_";
}
else {
warn "couldn't recognise the sequence name in \n$_";
}
}
__END__;
------------------------ cut here ----------------------------------------
____________________________________________________________
Will Fischer
Biology Department wfischer at indiana.edu
Jordan Hall http://www.bio.indiana.edu/~wfischer
Indiana University Lab: 812-855-2549
Bloomington, Indiana 47405 USA FAX: 812-855-6705