Creating a GCG dbase

Rodrigo Lopez rodrigol at biotek.uio.no
Thu May 25 15:13:21 EST 1995

In article <3pv9se$l6g at mserv1.dl.ac.uk>,
   "Adrian Jenkins" <ajenkins at chalsig.nibsc.ac.uk> wrote:
>Hi folks,
>sorry to ask this question, but if I wanted to convert a sequence dbase into 
>gcg format, how would I go about it?
>A stating point (gopher www etc) of where i can get the software/information 
>all im after.
>The sequence dbase Im after is the Los alamos HIV sequence dbase (protein +
>DNA) available in either Embl of Genbank format.
>I will be using gcg on an SG indigo server.
>Thanking you in advance

(Please note the following instructions assume you are well versed on the use
of the GCG package and understand the procedure and its implications. It also 
assumes you are USING GCG 7.x or GCG 8.0 under some UNIX variant.)

Since the databases already exists in EMBL (i.e. dat file) and Genbank format 
(i.e seq file), then why not use embltogcg or genbanktogcg. You will get 
access to these programs at your site if you:

% gcg
% gcgsupport

Now, this will work for the DNA databases without doubt. For the protein ones
they might. The AA one-letter codes in the protein databases will be treated 
as ambuguity codes and produce no-binary GCG libraries. You will have to test 
the protein database by fetching a few entries and making certain that they 
all contain the GCG separator line with 'TYPE: P' and NOT 'TYPE: N' on it.

Before you can do this you will have to make appropriate library names using
the 'name' program. Something like this should work:

(I don't know that the name of the distribution files is but let's assume they 
are: hiv-n.dat for the DNA and hiv-p.dat for the protein sequence database.)

% gcg
% gcgsupport
% embltogcg hiv-n.dat -dir=/usr/databases/hiv -def 
% embltogcg hiv-p.dat -dir=/usr/databases/hiv -def
% seqcat /usr/databases/hiv/*.seq
% name -s -q hiv-nrootdir /usr/databases     ! where you keep the databases
% name -s -q hiv-ndir     hiv-nrootdir:hiv    ! where you have the hiv db
% name -s -q hiv-n        hiv-ndir:hiv-n     ! where you have hiv-n.seq 
% name -s -q hiv-prootdir /usr/databases
% name -s -q hiv-pdir     hiv-prootdir:hiv
% name -s -q hiv-p        hiv-pdir:hiv-p

Now you could try:

% names hiv-n:*
% names hiv-p:*

and then fetch some entries. Check the GCG separator line on a few protein 
sequences. If you see some of these with 'TYPE:N', you can use reformat:

% reformat *.hiv-p -pro

to make sure you get the right TYPE and avoid some programs crashing down on
you. Hope yu don't hit this feature.


More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net