IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

GCG at the HGMP-RC

Miss. V Wood vwood at hgmp.mrc.ac.uk
Mon Mar 3 09:54:24 EST 1997


This document includes:


1. INSTALLING NEW SITE-SPECIFIC DATABASES FOR GCG 
	As a complete novice installing GCG I have encountered various
	challenges sourcing and setting up site specific databases at
	the UK HGMP-RC.
	This document contains an overview of the most up to date
	locations of these additional databases and any problems 
	encountered which will hopefully be time-saving to others.
	Most of the information contained is in the manual but not 
        necessarily in the order required for new database installation.
	We run a unix version of GCG on solaris 5.5.1.  EGCG 9 is not
	currently installed, and may require additional data files.
	If anybody knows of any more up to date versions of any 
	data-files, or quicker ways to achieve solutions please feel 
	free to correct me.


i)SOURCES OF THE SITE SPECIFIC DATABASES MAINTAINED AT HGMP

ii) INSTALLATION PROCEDURE

2. CHANGES TO DATABASE LOGICAL NAME
	We have also made changes to the database logical name 
	nomenclature of our copy of version 9 which other sites may wish 
	to implement. 

3. GCG DATAFILE UPDATING

	
================================================================

1. INSTALLING NEW SITE-SPECIFIC DATABASES

================================================================


i) SOURCES AND OF THE SITE SPECIFIC DATABASES MAINTAINED AT HGMP
----------------------------------------------------------------

TREMBL
======
SITE   	ftp://ftp.ebi.ac.uk/pub/databases/trembl/remtrembl/*.dat.Z
        ftp://ftp.ebi.ac.uk/pub/databases/trembl/sptrembl/*.dat.Z
VERSION	2.0
DATE    25/2/97
FILES REQUIRED  As above


EPD
===
SITE	ftp://ftp.nig.ac.jp/pub/db/epd/	
VERSION	48.0	
DATE    01/9/96
FILES REQUIRED  gcg.blk  renamed to epd.dat


OWL
===
SITE	 ftp://ftp.seqnet.dl.ac.uk/pub/database/owl/	
VERSION  29.1
DATE     12 Jan 1997
FILES REQUIRED  owl.nam.Z,owl.ref.Z, owl.seq.Z


NRL3D
=====
SITE	ftp://ftp.ebi.ac.uk/pub/databases/nrl3d/nrl_3d.*
VERSION	20.0
DATE    30/9/95
FILES REQUIRED  nrl_3d.nam, nrl_3d.ref, nrl_3d.seq


KABAT
Does anyone know of a compatible version of this data?



ii)PROCEDURE
------------

1. Retrieving the data
----------------------

Create a directory for each new database in your GCG 9 data area.
i.e.  /data/gcg9/gcgowl  etc...
ftp the relevant data release into this directory.

Initialise gcg and the support environment.


2. Defining the logical names and directory path
------------------------------------------------

Define your database in the file gcgdbconfigure/dblogicals.
This file defines the database logical names and the directory path
to the data as created in 1.

e.g.
TremblDir       /data/gcg9/gcgtrembl
EpdDir          /data/gcg9/gcgepd
NRLDir          /data/gcg9/gcgnrl
OwlDir          /data/gcg9/gcgowl

run gcg 'newdblogicals' -This establishes the logical names defined in
dblogicals.

Testing:
Check the logical names have been set up correctly with the gcg 
command 'name'

e.g. 'name TremblDir' -will show the directory logical path


3. Defining the individual database divisions
---------------------------------------------

Edit the file gcgdbconfigure/dbnames.map
This file maps each database distribution file with its data library 
names and location. Every database division must have an entry in this
file.


e.g.

 Flat           Directory       Library         Short   Database
 File           Logical         Logical         Logical Release
 Name           Name            Name            Name    Name
_______         _________       _______         _______ ________ !  ..
.
.
.
sptrfun         TremblDir       sptr_fun        sptr_fun   SPTREMBL !
sptrhum         TremblDir       sptr_hum        sptr_hum   SPTREMBL !
sptrinv         TremblDir       sptr_inv        sptr_inv SPTREMBL !
sptrmam         TremblDir       sptr_mam        sptr_mam SPTREMBL !
sptrmhc         TremblDir       sptr_mhc        sptr_mhc SPTREMBL !
sptrorg         TremblDir       sptr_org        sptr_org SPTREMBL !
sptrphg         TremblDir       sptr_phg        sptr_phg SPTREMBL !
sptrpln         TremblDir       sptr_pln        sptr_pln  SPTREMBL !
sptrpro         TremblDir       sptr_pro        sptr_pro SPTREMBL !
sptrrod         TremblDir       sptr_rod        sptr_rod SPTREMBL !
sptrvrl         TremblDir       sptr_vrl        sptr_vrl SPTREMBL !
sptrvrt         TremblDir       sptr_vrt        sptr_vrt SPTREMBL !
 
remtrimmuno      TremblDir      remtr_immuno    remtr_immuno REMTREMBL 
remtrpatent      TremblDir      remtr_patent    remtr_patent REMTREMBL 
remtrpseudo      TremblDir      remtr_pseudo    remtr_pseudo REMTREMBL 
remtrsmalls      TremblDir       remtr_smalls   remtr_smalls REMTREMBL 
remtrsynth       TremblDir       remtr_synth    remtr_synth  REMTREMBL 
 
epd         EpdDir          epd             epd		EPD
 
nrl_3d      nrlDir          nrl_3d          nrl         NRL
 
owl         OwlDir          owl             owl         OWL

 
 ****REMEMBER THE CARRIAGE RETURN AFTER YOUR FINAL ENTRY*****

Note that the flat file names for trembl entries have been prefixed
by sptr and remtr.  This is because the actual basenames for these
files are the same as the EMBL base names.

When the gcg  formatting utilities are run (see below) they are 
supplied with the relevant flat file names, and obtain the relevant 
data  by locating the appropriate flat-file entry in dbnames.map,
This would obviously cause problems where flat-file names are
duplicated.

To avoid this problem, we have prefixed the division names in 
dbnames.map with sptr and remtr. These are then referenced by a 
directory of pointers in:

/data/gcg9/gcgtrembl/pointers
In this directory, create a symbolic link to each 'real' database
division,
linking it to its 'new name' in dbnames.map.

e.g.
ln -s /data/gcg9/gcgtrembl/fun.dat sptr_fun.dat
etc.

Then in the next section when you supply the raw data to the database
formatting utility, you will submit the list of pointers as the input
file instead.

run 'newdbfiles' to establish the logical names specified in dbnames.map 
(and farm.configure, detailed later)

Testing:
You can check these entries with:

'name (database or synonym)'  AND  'namels (directory logical name)'

e.g.
'name epd'   will return:
EpdDir:epd
'namels epddir'
will show the directory contents


4. Formatting the databases for GCG
-----------------------------------

If a database is in a standard format, use the database utilities. 
e.g. embltogcg to create gcg indices
 
If the .seq .ref and . header files are available 'dbindex' will 
complete the indexing.

The command line parameters for each of the above databases are:

embltogcg -protein -in=/data/gcg9/gcgtrembl/pointers/*.dat -rel=2.0
 -day=25 -month=2 -year=1997

embltogcg -in=/data/gcg9/gcgepd/epd.dat -rel=48.0 -day=01 -month=29
-year=1996
	
dbindex -in=/data/gcg9/gcgowl/owl.seq -default

dbindex -in=/data/gcg9/gcgnrl/nrl.seq -default


Testing:
'names nrl:* -def'
etc.
will show what matching entries are available.


5. farmediting
--------------

Farms group together several data libraries so they can be searched
as a unit. Users can use the farm name or a synonym so they can search
all divisions of a farm.
 
farmedit was used to edit farm.configure and the farm files in 
dbconfigfiles/

Databases added:
trembl
remtrembl
sptrembl
 
databases described unambiguously in dbnames.map i.e. those with only 
one division must NOT have a farm file.
 
run 'newdbfiles' to create


6. To make your additional databases available to Seqlab
-------------------------------------------------------

Edit the file seqlab.dbs
- this will add your new databases to the seqlab database options.

*** make sure you are editing the site-specific copy of seqlab.sds
in gcgdbconfigure, and check that the modifications show up in seqlab.

e.g.

 ..
Swissprot
Sw_new          
PIR
GenEMBL
  New           
.
.           
  GSS           
TAGS
GenEMBLminus    
Trembl          Added
  SpTrembl      Added
  RemTrembl     Added
GenPept         Added
EPD             Added
NRL		Added
OWL 		Added


Testing: 
Check these are working as follows:
Run seqlab (as an ordinary user NOT su)

Check these show in the seqlab browser.

Select 'File' from the main window
	'Add sequences from..'
	 'Databases'

Click on the chosen database in the database browser window
select 'show matching entries'.

The relevant entries should appear in the database browser window.

7. Blast Configuration
----------------------

At the HGMP-RC we provide a native blast service for many databases.
We allow users to use GCG blast, but use these native indices rather
than
provide additional indices for GCG, to save disc space and CPU
resources.
This means that the blast databases may be more up to date than the 
other GCG sequence databases.
To do this: 

edit blast.ldbs to point to the local blast directories


8.Versions file  
---------------
Edit the versions file in gcgcore/scriptdoc/versions.txt
to provide details of your additional databases.

9.Seqcat 
---------
Creates files of definitions for use by stringsearch, you need to run 
seqcat for each database installed.

e.g.

'seqcat databasename'

10. SRS indexing
----------------

No SRS indexing has been completed for these databases at present. 


================================================================

	2. CHANGES TO DATABASE LOGICAL NAME

================================================================

We are concerned that the default definition of GenBank and EMBL have
had EST's STS's and now GSS's removed. Although we appreciate
the necessity to be able to search the data without these sections, 
we feel strongly that this provides a trap for the unwary user who
assumes that they are searching the databases in their entirety.

Our concerns have been justified by the number of queries from users
who haven't found expected matches.

We have therefore changed the nomenclature for version 9 so that all
the EST's, STS's and GSS's are added back into GENBANK and EMBL.
WE have removed genbankplus as a synonym, and replaced the reduced
databases by genbankminus and emblminus.

This change requires a minimum of edits:

Using farmedit all the tags were added to genbank, embl and genembl
New farms genbankminus and emblminus and genemblminus were created 
without tags.
Farmedit was used to remove genbankplus and emblplus.

seqlab.dbs also needs to be edited to add genemblminus, and move the 
tags into the main genembl section.

================================================================

		3. GCG DATAFILE UPDATING

================================================================

	
At the HGMP -RC we aim to keep the datafiles for gcg as up to date
as possible.
Below are the sources of the files which have been updated for gcg9. 

DATAFILE:               Rebase
NAMES COMMAND:          data:enz*
GCGFILE:                enzyme.dat
GCGDIR:                 gcg9/gcgcore/data/rundata/
CURRENT RELEASE:        v608    Jun95
NEW RELEASE:        	v702    Feb97
SOURCE SITE:            ftp://ftp.ebi.ac.uk/pub/databases/rebase
SOURCE FILE:            rebase.gcg
REFORMATTING:           none required but paste the old header to the 
			top of the file and rename as enzyme.dat
			update versions.txt


GCGFILE:                enzrefs.txt  and enzsources.txt
GCGDIR:                 gcg/gcgcore/data/moredata/
CURRENT RELEASE:        v gcgref 608 jul31 96 v commdata 608 jul31 96
SOURCE SITE:            ?
SOURCE FILE:            ?
REFORMATTING:
 
I have not been able to find new versions of the above files 


GCG is supplied with only 3 codon usage tables:
mus.cod         rat.cod         template.cod
45 additional codon tables are available at the EBI.
It should be noted that many of these are made with relatively few 
sequence sand we aim to replace these with the CUTG tables from 
http://www.dna.go.jp/ which will require conversion to gcg format.


DATAFILE:               Codon usage tables
NAMES COMMAND:          data:*cod
GCGFILE:                *.cod
GCGDIR:                 gcg/gcgcore/data/traindata/
CURRENT RELEASE:        All available tables in gcg format at the EBI
SOURCE SITE:            ftp://ftp.ebi.ac.uk/pub/databases/codonusage/
SOURCE FILE:            *.cod
REFORMATTING:           none
 
GCG was supplied with Prosite current release,
but the relevant ftp site is closely monitored for new releases

DATAFILE:               prosite
NAMES COMMAND:          names data:prosite*
GCGFILE:                prosite.patterns
GCGDIR:                 /gcg9/gcgcore/data/rundata/
CURRENT RELEASE:        Release13        Nov95
SOURCE SITE:            ftp://ftp.ebi.ac.uk/pub/databases/prosite/
SOURCE FILE:            prosite.dat
REFORMATTING:           prositetogcg


-- 
_______________________________________________________________
Valerie Wood			Tel	: +44-1223 49 4533
UK HGMP Resource Centre		E-mail	: vwood at hgmp.mrcmacmuk
Hinxton				
Cambridge
CB10 1SB
_______________________________________________________________



More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net