Hello,
There are four lines in the GenBank v 132 files released on November 7th
that cause the genbanktogcg program to skip entries.
gbpln5.seq EMEMTDNA
gbpri23.seq HUMALUL1A
gbvrl3.seq MAARNA33
gbvrt2.seq XLXK81A1
In the plant case the extra line causes genbanktogcg to skip >35K
entries. For the primate, viral and vertebrate files just the one entry
is skipped.
Look for lines that are longer than 86 characters, example Perl code below.
I just edited the files to change the wrapping, so all lines are less
than 86 characters. If the files are too big for your editor you can
use the UNIX "split" command to divide the file into many smaller files.
I used something like "split -400000 filename" to give files of
400,000 lines to edit.
The
-Mike
P.S. This has been reported to the GCG support folks at Accelrys. They
have been in contact with NCBI about the GenBank format problems.
---
#!/usr/bin/perl
use strict;
while ( <> ) {
chop;
my $line = $_;
my $len = length($line);
if ($len > 85) {
print "$len =\t$line\n";
}
}