In article <3pv9se$l6g at mserv1.dl.ac.uk>,
"Adrian Jenkins" <ajenkins at chalsig.nibsc.ac.uk> wrote:
>>sorry to ask this question, but if I wanted to convert a sequence dbase into
>gcg format, how would I go about it?
>>A stating point (gopher www etc) of where i can get the software/information
>all im after.
>>The sequence dbase Im after is the Los alamos HIV sequence dbase (protein +
>DNA) available in either Embl of Genbank format.
>>I will be using gcg on an SG indigo server.
>>Thanking you in advance
(Please note the following instructions assume you are well versed on the use
of the GCG package and understand the procedure and its implications. It also
assumes you are USING GCG 7.x or GCG 8.0 under some UNIX variant.)
Since the databases already exists in EMBL (i.e. dat file) and Genbank format
(i.e seq file), then why not use embltogcg or genbanktogcg. You will get
access to these programs at your site if you:
Now, this will work for the DNA databases without doubt. For the protein ones
they might. The AA one-letter codes in the protein databases will be treated
as ambuguity codes and produce no-binary GCG libraries. You will have to test
the protein database by fetching a few entries and making certain that they
all contain the GCG separator line with 'TYPE: P' and NOT 'TYPE: N' on it.
Before you can do this you will have to make appropriate library names using
the 'name' program. Something like this should work:
(I don't know that the name of the distribution files is but let's assume they
are: hiv-n.dat for the DNA and hiv-p.dat for the protein sequence database.)
% embltogcg hiv-n.dat -dir=/usr/databases/hiv -def
% embltogcg hiv-p.dat -dir=/usr/databases/hiv -def
% seqcat /usr/databases/hiv/*.seq
% name -s -q hiv-nrootdir /usr/databases ! where you keep the databases
% name -s -q hiv-ndir hiv-nrootdir:hiv ! where you have the hiv db
% name -s -q hiv-n hiv-ndir:hiv-n ! where you have hiv-n.seq
% name -s -q hiv-prootdir /usr/databases
% name -s -q hiv-pdir hiv-prootdir:hiv
% name -s -q hiv-p hiv-pdir:hiv-p
Now you could try:
% names hiv-n:*
% names hiv-p:*
and then fetch some entries. Check the GCG separator line on a few protein
sequences. If you see some of these with 'TYPE:N', you can use reformat:
% reformat *.hiv-p -pro
to make sure you get the right TYPE and avoid some programs crashing down on
you. Hope yu don't hit this feature.