IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

GenBank v 132 errors & genbanktogcg

Mike Cherry cherry at genome.stanford.edu
Wed Nov 13 18:25:18 EST 2002


Hello,

There are four lines in the GenBank v 132 files released on November 7th 
that cause the genbanktogcg program to skip entries.

gbpln5.seq     EMEMTDNA
gbpri23.seq    HUMALUL1A
gbvrl3.seq     MAARNA33
gbvrt2.seq     XLXK81A1

In the plant case the extra line causes genbanktogcg to skip >35K 
entries.  For the primate, viral and vertebrate files just the one entry 
is skipped.

Look for lines that are longer than 86 characters, example Perl code below.

I just edited the files to change the wrapping, so all lines are less 
than 86 characters.  If the files are too big for your editor you can 
use the UNIX "split" command to divide the file into many smaller files. 
   I used something like "split -400000 filename" to give files of 
400,000 lines to edit.

The
-Mike

P.S.  This has been reported to the GCG support folks at Accelrys.  They 
have been in contact with NCBI about the GenBank format problems.


---
#!/usr/bin/perl

use strict;

while ( <> ) {
     chop;
     my $line = $_;
     my $len = length($line);
     if ($len > 85) {
         print "$len =\t$line\n";
     }
}




More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net