This document includes:
1. INSTALLING NEW SITE-SPECIFIC DATABASES FOR GCG
As a complete novice installing GCG I have encountered various
challenges sourcing and setting up site specific databases at
the UK HGMP-RC.
This document contains an overview of the most up to date
locations of these additional databases and any problems
encountered which will hopefully be time-saving to others.
Most of the information contained is in the manual but not
necessarily in the order required for new database installation.
We run a unix version of GCG on solaris 5.5.1. EGCG 9 is not
currently installed, and may require additional data files.
If anybody knows of any more up to date versions of any
data-files, or quicker ways to achieve solutions please feel
free to correct me.
i)SOURCES OF THE SITE SPECIFIC DATABASES MAINTAINED AT HGMP
ii) INSTALLATION PROCEDURE
2. CHANGES TO DATABASE LOGICAL NAME
We have also made changes to the database logical name
nomenclature of our copy of version 9 which other sites may wish
to implement.
3. GCG DATAFILE UPDATING
================================================================
1. INSTALLING NEW SITE-SPECIFIC DATABASES
================================================================
i) SOURCES AND OF THE SITE SPECIFIC DATABASES MAINTAINED AT HGMP
----------------------------------------------------------------
TREMBL
======
SITE ftp://ftp.ebi.ac.uk/pub/databases/trembl/remtrembl/*.dat.Zftp://ftp.ebi.ac.uk/pub/databases/trembl/sptrembl/*.dat.Z
VERSION 2.0
DATE 25/2/97
FILES REQUIRED As above
EPD
===
SITE ftp://ftp.nig.ac.jp/pub/db/epd/
VERSION 48.0
DATE 01/9/96
FILES REQUIRED gcg.blk renamed to epd.dat
OWL
===
SITE ftp://ftp.seqnet.dl.ac.uk/pub/database/owl/
VERSION 29.1
DATE 12 Jan 1997
FILES REQUIRED owl.nam.Z,owl.ref.Z, owl.seq.Z
NRL3D
=====
SITE ftp://ftp.ebi.ac.uk/pub/databases/nrl3d/nrl_3d.*
VERSION 20.0
DATE 30/9/95
FILES REQUIRED nrl_3d.nam, nrl_3d.ref, nrl_3d.seq
KABAT
Does anyone know of a compatible version of this data?
ii)PROCEDURE
------------
1. Retrieving the data
----------------------
Create a directory for each new database in your GCG 9 data area.
i.e. /data/gcg9/gcgowl etc...
ftp the relevant data release into this directory.
Initialise gcg and the support environment.
2. Defining the logical names and directory path
------------------------------------------------
Define your database in the file gcgdbconfigure/dblogicals.
This file defines the database logical names and the directory path
to the data as created in 1.
e.g.
TremblDir /data/gcg9/gcgtrembl
EpdDir /data/gcg9/gcgepd
NRLDir /data/gcg9/gcgnrl
OwlDir /data/gcg9/gcgowl
run gcg 'newdblogicals' -This establishes the logical names defined in
dblogicals.
Testing:
Check the logical names have been set up correctly with the gcg
command 'name'
e.g. 'name TremblDir' -will show the directory logical path
3. Defining the individual database divisions
---------------------------------------------
Edit the file gcgdbconfigure/dbnames.map
This file maps each database distribution file with its data library
names and location. Every database division must have an entry in this
file.
e.g.
Flat Directory Library Short Database
File Logical Logical Logical Release
Name Name Name Name Name
_______ _________ _______ _______ ________ ! ..
.
.
.
sptrfun TremblDir sptr_fun sptr_fun SPTREMBL !
sptrhum TremblDir sptr_hum sptr_hum SPTREMBL !
sptrinv TremblDir sptr_inv sptr_inv SPTREMBL !
sptrmam TremblDir sptr_mam sptr_mam SPTREMBL !
sptrmhc TremblDir sptr_mhc sptr_mhc SPTREMBL !
sptrorg TremblDir sptr_org sptr_org SPTREMBL !
sptrphg TremblDir sptr_phg sptr_phg SPTREMBL !
sptrpln TremblDir sptr_pln sptr_pln SPTREMBL !
sptrpro TremblDir sptr_pro sptr_pro SPTREMBL !
sptrrod TremblDir sptr_rod sptr_rod SPTREMBL !
sptrvrl TremblDir sptr_vrl sptr_vrl SPTREMBL !
sptrvrt TremblDir sptr_vrt sptr_vrt SPTREMBL !
remtrimmuno TremblDir remtr_immuno remtr_immuno REMTREMBL
remtrpatent TremblDir remtr_patent remtr_patent REMTREMBL
remtrpseudo TremblDir remtr_pseudo remtr_pseudo REMTREMBL
remtrsmalls TremblDir remtr_smalls remtr_smalls REMTREMBL
remtrsynth TremblDir remtr_synth remtr_synth REMTREMBL
epd EpdDir epd epd EPD
nrl_3d nrlDir nrl_3d nrl NRL
owl OwlDir owl owl OWL
****REMEMBER THE CARRIAGE RETURN AFTER YOUR FINAL ENTRY*****
Note that the flat file names for trembl entries have been prefixed
by sptr and remtr. This is because the actual basenames for these
files are the same as the EMBL base names.
When the gcg formatting utilities are run (see below) they are
supplied with the relevant flat file names, and obtain the relevant
data by locating the appropriate flat-file entry in dbnames.map,
This would obviously cause problems where flat-file names are
duplicated.
To avoid this problem, we have prefixed the division names in
dbnames.map with sptr and remtr. These are then referenced by a
directory of pointers in:
/data/gcg9/gcgtrembl/pointers
In this directory, create a symbolic link to each 'real' database
division,
linking it to its 'new name' in dbnames.map.
e.g.
ln -s /data/gcg9/gcgtrembl/fun.dat sptr_fun.dat
etc.
Then in the next section when you supply the raw data to the database
formatting utility, you will submit the list of pointers as the input
file instead.
run 'newdbfiles' to establish the logical names specified in dbnames.map
(and farm.configure, detailed later)
Testing:
You can check these entries with:
'name (database or synonym)' AND 'namels (directory logical name)'
e.g.
'name epd' will return:
EpdDir:epd
'namels epddir'
will show the directory contents
4. Formatting the databases for GCG
-----------------------------------
If a database is in a standard format, use the database utilities.
e.g. embltogcg to create gcg indices
If the .seq .ref and . header files are available 'dbindex' will
complete the indexing.
The command line parameters for each of the above databases are:
embltogcg -protein -in=/data/gcg9/gcgtrembl/pointers/*.dat -rel=2.0
-day=25 -month=2 -year=1997
embltogcg -in=/data/gcg9/gcgepd/epd.dat -rel=48.0 -day=01 -month=29
-year=1996
dbindex -in=/data/gcg9/gcgowl/owl.seq -default
dbindex -in=/data/gcg9/gcgnrl/nrl.seq -default
Testing:
'names nrl:* -def'
etc.
will show what matching entries are available.
5. farmediting
--------------
Farms group together several data libraries so they can be searched
as a unit. Users can use the farm name or a synonym so they can search
all divisions of a farm.
farmedit was used to edit farm.configure and the farm files in
dbconfigfiles/
Databases added:
trembl
remtrembl
sptrembl
databases described unambiguously in dbnames.map i.e. those with only
one division must NOT have a farm file.
run 'newdbfiles' to create
6. To make your additional databases available to Seqlab
-------------------------------------------------------
Edit the file seqlab.dbs
- this will add your new databases to the seqlab database options.
*** make sure you are editing the site-specific copy of seqlab.sds
in gcgdbconfigure, and check that the modifications show up in seqlab.
e.g.
..
Swissprot
Sw_new
PIR
GenEMBL
New
.
.
GSS
TAGS
GenEMBLminus
Trembl Added
SpTrembl Added
RemTrembl Added
GenPept Added
EPD Added
NRL Added
OWL Added
Testing:
Check these are working as follows:
Run seqlab (as an ordinary user NOT su)
Check these show in the seqlab browser.
Select 'File' from the main window
'Add sequences from..'
'Databases'
Click on the chosen database in the database browser window
select 'show matching entries'.
The relevant entries should appear in the database browser window.
7. Blast Configuration
----------------------
At the HGMP-RC we provide a native blast service for many databases.
We allow users to use GCG blast, but use these native indices rather
than
provide additional indices for GCG, to save disc space and CPU
resources.
This means that the blast databases may be more up to date than the
other GCG sequence databases.
To do this:
edit blast.ldbs to point to the local blast directories
8.Versions file
---------------
Edit the versions file in gcgcore/scriptdoc/versions.txt
to provide details of your additional databases.
9.Seqcat
---------
Creates files of definitions for use by stringsearch, you need to run
seqcat for each database installed.
e.g.
'seqcat databasename'
10. SRS indexing
----------------
No SRS indexing has been completed for these databases at present.
================================================================
2. CHANGES TO DATABASE LOGICAL NAME
================================================================
We are concerned that the default definition of GenBank and EMBL have
had EST's STS's and now GSS's removed. Although we appreciate
the necessity to be able to search the data without these sections,
we feel strongly that this provides a trap for the unwary user who
assumes that they are searching the databases in their entirety.
Our concerns have been justified by the number of queries from users
who haven't found expected matches.
We have therefore changed the nomenclature for version 9 so that all
the EST's, STS's and GSS's are added back into GENBANK and EMBL.
WE have removed genbankplus as a synonym, and replaced the reduced
databases by genbankminus and emblminus.
This change requires a minimum of edits:
Using farmedit all the tags were added to genbank, embl and genembl
New farms genbankminus and emblminus and genemblminus were created
without tags.
Farmedit was used to remove genbankplus and emblplus.
seqlab.dbs also needs to be edited to add genemblminus, and move the
tags into the main genembl section.
================================================================
3. GCG DATAFILE UPDATING
================================================================
At the HGMP -RC we aim to keep the datafiles for gcg as up to date
as possible.
Below are the sources of the files which have been updated for gcg9.
DATAFILE: Rebase
NAMES COMMAND: data:enz*
GCGFILE: enzyme.dat
GCGDIR: gcg9/gcgcore/data/rundata/
CURRENT RELEASE: v608 Jun95
NEW RELEASE: v702 Feb97
SOURCE SITE: ftp://ftp.ebi.ac.uk/pub/databases/rebase
SOURCE FILE: rebase.gcg
REFORMATTING: none required but paste the old header to the
top of the file and rename as enzyme.dat
update versions.txt
GCGFILE: enzrefs.txt and enzsources.txt
GCGDIR: gcg/gcgcore/data/moredata/
CURRENT RELEASE: v gcgref 608 jul31 96 v commdata 608 jul31 96
SOURCE SITE: ?
SOURCE FILE: ?
REFORMATTING:
I have not been able to find new versions of the above files
GCG is supplied with only 3 codon usage tables:
mus.cod rat.cod template.cod
45 additional codon tables are available at the EBI.
It should be noted that many of these are made with relatively few
sequence sand we aim to replace these with the CUTG tables from
http://www.dna.go.jp/ which will require conversion to gcg format.
DATAFILE: Codon usage tables
NAMES COMMAND: data:*cod
GCGFILE: *.cod
GCGDIR: gcg/gcgcore/data/traindata/
CURRENT RELEASE: All available tables in gcg format at the EBI
SOURCE SITE: ftp://ftp.ebi.ac.uk/pub/databases/codonusage/
SOURCE FILE: *.cod
REFORMATTING: none
GCG was supplied with Prosite current release,
but the relevant ftp site is closely monitored for new releases
DATAFILE: prosite
NAMES COMMAND: names data:prosite*
GCGFILE: prosite.patterns
GCGDIR: /gcg9/gcgcore/data/rundata/
CURRENT RELEASE: Release13 Nov95
SOURCE SITE: ftp://ftp.ebi.ac.uk/pub/databases/prosite/
SOURCE FILE: prosite.dat
REFORMATTING: prositetogcg
--
_______________________________________________________________
Valerie Wood Tel : +44-1223 49 4533
UK HGMP Resource Centre E-mail : vwood at hgmp.mrcmacmuk
Hinxton
Cambridge
CB10 1SB
_______________________________________________________________