IUBio

2BIT format?

Sean Eddy eddy at wrasse.wustl.edu
Thu Feb 19 14:08:47 EST 1998


friard at area.ba.cnr.it (Olivier Friard) writes:
> contains 2 types of files (.SEQ and .REF). In the .SEQ file, they are 2 
> types of sequences: the first one type is classic readable ASCII but 
> the second one is 2BIT unreadable format. Does someone know how read this 
> 2BIT format?

I got a parser working for 2BIT after a painful couple of days. The
basic idea is that GCG 2BIT packs four nucleotides into an eight-bit
byte as two-bit values. If the sequence has IUPAC ambiguity codes, it
can't be packed into 2 bits, and the whole sequence stays as ASCII.
Whether a sequence has been compressed to 2BIT or left as ASCII is
noted in the description line. The resulting files are thus mixed
ASCII/binary files and have to be read in fairly carefully. You have
to deduce exactly how many binary chars to read before you start
parsing in ASCII again.

My parser's mostly undocumented, and I don't have enough time to go
through and document it. But in case it helps I'll attach my C code
for the two crucial functions, in a form very close to being
compatible with Don Gilbert's excellent readseq package. Hopefully
between that and GCG Fortran source you can derive enough info to grok
the essentials of their 2BIT encoding.

-------

static void
readGCGdata(struct ReadSeqVars *V)
{
  int   binary = FALSE;         /* whether data are binary or not */
  int   blen;                   /* length of binary sequence */
  
                                /* first line contains ">>>>" followed by name */
  if (Strparse(">>>>([^ ]+) .+2BIT +Len: ([0-9]+)", V->sbuffer, 2) == 0) 
    {
      binary = TRUE;
      SetSeqinfoString(V->sqinfo, sqd_parse[1], SQINFO_NAME);
      blen = atoi(sqd_parse[2]);
    } 
  else if (Strparse(">>>>([^ ]+) .+ASCII +Len: [0-9]+", V->sbuffer, 1) == 0) 
    SetSeqinfoString(V->sqinfo, sqd_parse[1], SQINFO_NAME);
  else 
    Die("bogus GCGdata format? %s", V->sbuffer);

                                /* second line contains free text description */
  getline(V);
  SetSeqinfoString(V->sqinfo, V->sbuffer, SQINFO_DESC);

  if (binary) {
    /* allocate for blen characters +3... (allow for 3 bytes of slop) */
    /* note we allocate enough space for the *uncompressed* sequence  */ 
    if (blen >= V->maxseq) {
      V->maxseq = blen;
      if ((V->seq = (char *) realloc (V->seq, sizeof(char)*(V->maxseq+4)))==NULL)
        Die("malloc failed");
    }
                               /* read (blen+3)/4 bytes from file */
    if (fread(V->seq, sizeof(char), (blen+3)/4, V->f) < (size_t) ((blen+3)/4))
      Die("fread failed");
    V->seqlen = blen;
                               /* convert binary string to ASCII seq string */
    GCGBinaryToSequence(V->seq, blen);
  }
  else readLoop(0, endGCGdata, V);
  
  while (!(feof(V->f) || ((*V->sbuffer != 0) && (*V->sbuffer == '>'))))
    getline(V);
}


/* Function: GCGBinaryToSequence()
 * 
 * Purpose:  Convert a GCG 2BIT binary string to DNA sequence.
 *           0 = C  1 = T  2 = A  3 = G
 *           4 nts/byte
 *           Converts right to left so we can do conversion in place.
 *           
 * Args:     seq  - binary sequence. Converted in place to DNA.
 *           len  - length of DNA. binary is (len+3)/4 bytes
 */
int
GCGBinaryToSequence(char *seq, int len)
{
  int   bpos;                   /* position in binary   */
  int   spos;                   /* position in sequence */
  char  twobit;
  int   i;

  for (bpos = (len-1)/4; bpos >= 0; bpos--) 
    {
      twobit = seq[bpos];
      spos   = bpos*4;

      for (i = 3; i >= 0; i--) 
        {
          switch (twobit & 0x3) {
          case 0: seq[spos+i] = 'C'; break;
          case 1: seq[spos+i] = 'T'; break;
          case 2: seq[spos+i] = 'A'; break;
          case 3: seq[spos+i] = 'G'; break;
          }
          twobit = twobit >> 2;
        }
    }
  seq[len] = '\0';
  return 1;
}




-- 
- Sean Eddy; Dept. of Genetics, WashU School of Medicine, St. Louis



More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net