IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Warning, LMFLCHR12 in Genbank 121 breaks (older?) GENBANKTOGCG

David Mathog mathog at seqaxp.bio.caltech.edu
Fri Jan 19 15:36:18 EST 2001


I don't know who else is still using GCG 8.1 (or GCG in general, this group 
has been very quiet lately...) or if this problem exists in more recent
GCG versions, but in any case, here is a bug to watch out for. 

The GCG program GENBANKTOGCG (v8.1) was hard coded to a limit of 2Mb of
sequence (DBMAXSEQLEN in dbdefs.h) and 20k lines of header information 
(MAXHEADROOM in genbanktogcg.c) for handling entries from Genbank flat 
files.  BOTH of these limits were violated by one or another entry in
Genbank 121 in the HTG24 division.  Thanks to our friends in the
Leishmania sequencing project there were several HUGE files in this
release, and it was only these entries that triggered this bug.  For
instance, the LMFLCHR12 entry violated both limits.  As the genbanktogcg
program had NO bounds checking whatsoever on either of the two affected
arrays the code corrupted itself when it hit these entries and gave
erroneous results. 

Following my signature you'll find a "diff" of the original and modified 
genbanktogcg.c file.  Essentially I just set it to blow up when it hits
this condition, since the way the program is constructed it needs to be
recompiled when this sort of problem is found.  I also raised the limits to
30k lines and 10Mb of sequence.

Regards,

David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 

$ diff genbanktogcg.c .c_dist

************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
   24   *
   25   * 19-JAN-2001. David Mathog.  Program did not check most buffers for overrun,
   26   * and large Genbank sequences caused overflow and corruption.  Added checks
   27   * for max number of header lines and DBMAXSEQLEN and put in messages to
   28   * expand those if the values were exceeded.  Did not put in checks for WpComment
   29   * because these other two checks _should_ make that array safe
   30   *
   31   *
   32   ******************************************************************************/
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
   24   ******************************************************************************/
************
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
   72   /* MATHOG. increased from 20000 to 30000 lines */
   73   #define MAXHEADROOM 30000       /* max lines of GenBank heading */
   74   
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
   64   #define MAXHEADROOM 20000       /* max lines of GenBank heading */
   65   
************
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
  426   /*  MATHOG.  Original code did not check for buffer overrun.
  427       Wimp out and detect it, blow up, and tell user to increase
  428       headroom.  This one messes up if there are exactly MAXHEADROOM
  429       header lines, but that's an indication that it should be increased
  430       too.
  431     
  432   
  433       if(!strncmp(HeadBuff[HLines++], "ORIGIN", 6))
  434   */
  435       if(HLines >= MAXHEADROOM-2){
  436       /* header is too large, blow up now */
  437         WriteF("\n\\b *** FATAL ERROR in \"%.2n\", \n", InFile);
  438         WriteF(" An entry has more than %d header lines!!!! ***\n\n",
  439         MAXHEADROOM);
  440         WriteF(" Modify genbanktogcg.c to increase MAXHEADROOM and try again\n");
  441         GCGExit(EXITBAD);
  442       }
  443       if(!strncmp(HeadBuff[HLines++], "ORIGIN", 6))
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
  417       if(!strncmp(HeadBuff[HLines++], "ORIGIN", 6))
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
  676         
  677       /* MATHOG.  check length of string + j and blow out if
  678          it's bigger than DBMAXSEQLEN */
  679          
  680       if(strlen(string) + j > DBMAXSEQLEN){
  681         WriteF("\n\\b *** FATAL ERROR in \"%.2n\", \n", InFile);
  682         WriteF(" An entry is larger than DBMAXSEQLEN (%d) bases !!!! ***\n\n",
  683         DBMAXSEQLEN);
  684         WriteF(" Modify dbdefs.h to increase DBMAXSEQLEN and try again\n");
  685         GCGExit(EXITBAD);
  686       }
  687       
  688       for(i = 0; string[i]; i++) {
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
  650       for(i = 0; string[i]; i++) {
************

Number of difference sections found: 5
Number of difference records found: 40

DIFFERENCES /IGNORE=()/MERGED=1/OUTPUT=USRDISK:[USERS.MATHOG]KILLME.TXT;2-
    PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15-
    PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1






More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net