I don't know who else is still using GCG 8.1 (or GCG in general, this group
has been very quiet lately...) or if this problem exists in more recent
GCG versions, but in any case, here is a bug to watch out for.
The GCG program GENBANKTOGCG (v8.1) was hard coded to a limit of 2Mb of
sequence (DBMAXSEQLEN in dbdefs.h) and 20k lines of header information
(MAXHEADROOM in genbanktogcg.c) for handling entries from Genbank flat
files. BOTH of these limits were violated by one or another entry in
Genbank 121 in the HTG24 division. Thanks to our friends in the
Leishmania sequencing project there were several HUGE files in this
release, and it was only these entries that triggered this bug. For
instance, the LMFLCHR12 entry violated both limits. As the genbanktogcg
program had NO bounds checking whatsoever on either of the two affected
arrays the code corrupted itself when it hit these entries and gave
erroneous results.
Following my signature you'll find a "diff" of the original and modified
genbanktogcg.c file. Essentially I just set it to blow up when it hits
this condition, since the way the program is constructed it needs to be
recompiled when this sort of problem is found. I also raised the limits to
30k lines and 10Mb of sequence.
Regards,
David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech
$ diff genbanktogcg.c .c_dist
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
24 *
25 * 19-JAN-2001. David Mathog. Program did not check most buffers for overrun,
26 * and large Genbank sequences caused overflow and corruption. Added checks
27 * for max number of header lines and DBMAXSEQLEN and put in messages to
28 * expand those if the values were exceeded. Did not put in checks for WpComment
29 * because these other two checks _should_ make that array safe
30 *
31 *
32 ******************************************************************************/
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
24 ******************************************************************************/
************
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
72 /* MATHOG. increased from 20000 to 30000 lines */
73 #define MAXHEADROOM 30000 /* max lines of GenBank heading */
74
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
64 #define MAXHEADROOM 20000 /* max lines of GenBank heading */
65
************
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
426 /* MATHOG. Original code did not check for buffer overrun.
427 Wimp out and detect it, blow up, and tell user to increase
428 headroom. This one messes up if there are exactly MAXHEADROOM
429 header lines, but that's an indication that it should be increased
430 too.
431
432
433 if(!strncmp(HeadBuff[HLines++], "ORIGIN", 6))
434 */
435 if(HLines >= MAXHEADROOM-2){
436 /* header is too large, blow up now */
437 WriteF("\n\\b *** FATAL ERROR in \"%.2n\", \n", InFile);
438 WriteF(" An entry has more than %d header lines!!!! ***\n\n",
439 MAXHEADROOM);
440 WriteF(" Modify genbanktogcg.c to increase MAXHEADROOM and try again\n");
441 GCGExit(EXITBAD);
442 }
443 if(!strncmp(HeadBuff[HLines++], "ORIGIN", 6))
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
417 if(!strncmp(HeadBuff[HLines++], "ORIGIN", 6))
************
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15
676
677 /* MATHOG. check length of string + j and blow out if
678 it's bigger than DBMAXSEQLEN */
679
680 if(strlen(string) + j > DBMAXSEQLEN){
681 WriteF("\n\\b *** FATAL ERROR in \"%.2n\", \n", InFile);
682 WriteF(" An entry is larger than DBMAXSEQLEN (%d) bases !!!! ***\n\n",
683 DBMAXSEQLEN);
684 WriteF(" Modify dbdefs.h to increase DBMAXSEQLEN and try again\n");
685 GCGExit(EXITBAD);
686 }
687
688 for(i = 0; string[i]; i++) {
******
File PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1
650 for(i = 0; string[i]; i++) {
************
Number of difference sections found: 5
Number of difference records found: 40
DIFFERENCES /IGNORE=()/MERGED=1/OUTPUT=USRDISK:[USERS.MATHOG]KILLME.TXT;2-
PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C;15-
PRGDISK:[GCG.GCGSOURCE.SOURCE]GENBANKTOGCG.C_DIST;1