John Ladasky wrote:
> To everyone who has responded to my query:
>> Thank you!
>> I have a lot to think about. Java, Perl, Python, and Tcl all seem to
> have a critical mass of advocates within the bioinformatics community.
> I am also finding class libraries on the Web in several of these
> languages, which may help to carry the water for me on my first
> project.
>> I may be able to stick with my initial, less-than-fully-informed
> language choice. The BioJava library has a GenBank file class,
> although I can't yet tell how comprehensive it is. You can put a lot
> of stuff into a GenBank file, however it's not the SAME stuff in each
> file. I'm looking to strip exons out of eukaryotic genome sequence
> files, so I'll need access to the information in the CDS field. I can
> probably restrain my own inquiries to files that have only one CDS
> contained within them -- but a truly robust computer program would be
> able to extract multiple CDS's from a single file, if they were
> present, and keep track of each gene separately. Whew! Maybe another
> time.
Don't constrain - it's not really that hard to do.
With Bioperl 1.2.
use Bio::SeqIO;
# sequence input parser
my $input = new Bio::SeqIO(-format => 'genbank',
-file => 'myfile.gb');
# output sequence writer
my $outuput = new Bio::SeqIO(-format => 'fasta',
-file => '>exons.fa');
my $Genecounter = 1;
while( my $seq = $input->next_seq ) { # iterate through each seq in file
# get all the CDS features from the sequence
my @CDS = grep { $_->primary_tag eq 'CDS' } $seq->top_SeqFeatures();
foreach my $cds ( @CDS ) {
# get just the sequence that is annotated as CDS
my $spliceseq = $cds->spliced_seq();
# a little bit of work to get the genename out
# if it is annotated
my ($genename);
if( $cds->has_tag('gene') ) {
($genename) = $cds->each_tag_value('gene');
} elsif( $cds->has_tag('note') ) {
($genename) = $cds->each_tag_value('note');
} else {
# apply some other criteria here if you prefer
($genename) = "Gene".$Genecounter++;
}
$spliceseq->display_id($genename); # set the sequence name
$out->write_seq($spliceseq);
}
}
BioJava can also do this I'm sure - Checkout "BioJava in Anger"
http://bioconf.otago.ac.nz/biojava
>> Perl looks useful. But its readability appears to be subject to the
> whims of the programmer, just like in C. Shortcuts abound. You can
> write a single line of code that does everything but the laundry.
> Good luck trying to figure out what it means when you go back to read
> it a month later!
>> Python looks very easy. Almost too easy. Is it underpowered? I'm
> still trying to find out. There's a BioPython web page... it can't be
> completely useless.
>> Tcl looks reasonable, from what little I've seen so far. Some of the
> links I've tried to follow which would describe Tcl bioinformatics
> tools are broken. Still searching.
>> Again, thanks!
>> --
> John J. Ladasky Jr., Ph.D.
> Department of Biology
> Johns Hopkins University
> Baltimore MD 21218
> USA
> Earth