Thanks from biologist choosing new programming language

Jason Stajich jason.stajich at NOSPAMBOTS-duke.edu
Mon Feb 10 06:16:08 EST 2003

John Ladasky wrote:
> To everyone who has responded to my query:
> Thank you!
> I have a lot to think about.  Java, Perl, Python, and Tcl all seem to
> have a critical mass of advocates within the bioinformatics community.
>  I am also finding class libraries on the Web in several of these
> languages, which may help to carry the water for me on my first
> project.
> I may be able to stick with my initial, less-than-fully-informed
> language choice.  The BioJava library has a GenBank file class,
> although I can't yet tell how comprehensive it is.  You can put a lot
> of stuff into a GenBank file, however it's not the SAME stuff in each
> file.  I'm looking to strip exons out of eukaryotic genome sequence
> files, so I'll need access to the information in the CDS field.  I can
> probably restrain my own inquiries to files that have only one CDS
> contained within them -- but a truly robust computer program would be
> able to extract multiple CDS's from a single file, if they were
> present, and keep track of each gene separately.  Whew!  Maybe another
> time.

Don't constrain - it's not really that hard to do.

With Bioperl 1.2.

use Bio::SeqIO;

# sequence input parser
my $input = new Bio::SeqIO(-format => 'genbank',
                            -file   => 'myfile.gb');
# output sequence writer
my $outuput = new Bio::SeqIO(-format => 'fasta',
                              -file   => '>exons.fa');
my $Genecounter = 1;
while( my $seq = $input->next_seq ) { # iterate through each seq in file
   # get all the CDS features from the sequence
   my @CDS = grep { $_->primary_tag eq 'CDS' } $seq->top_SeqFeatures();
   foreach my $cds ( @CDS ) {
      # get just the sequence that is annotated as CDS
      my $spliceseq = $cds->spliced_seq();

      # a little bit of work to get the genename out
      # if it is annotated
      my ($genename);
      if( $cds->has_tag('gene') ) {
        ($genename) = $cds->each_tag_value('gene');
      } elsif( $cds->has_tag('note') ) {
        ($genename) = $cds->each_tag_value('note');
      } else {
	# apply some other criteria here if you prefer
         ($genename) = "Gene".$Genecounter++;
      $spliceseq->display_id($genename); # set the sequence name

BioJava can also do this I'm sure - Checkout "BioJava in Anger"

> Perl looks useful.  But its readability appears to be subject to the
> whims of the programmer, just like in C.  Shortcuts abound.  You can
> write a single line of code that does everything but the laundry. 
> Good luck trying to figure out what it means when you go back to read
> it a month later!
> Python looks very easy.  Almost too easy.  Is it underpowered?  I'm
> still trying to find out.  There's a BioPython web page... it can't be
> completely useless.
> Tcl looks reasonable, from what little I've seen so far.  Some of the
> links I've tried to follow which would describe Tcl bioinformatics
> tools are broken.  Still searching.
> Again, thanks!
> --
> John J. Ladasky Jr., Ph.D.
> Department of Biology
> Johns Hopkins University
> Baltimore MD 21218
> Earth

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net