IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Procedure file for reformatting REPBASE to GCG (OpenVMS)

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Tue Jul 2 15:04:38 EST 1996

Hi all,

Following my signature you will find a DCL procedure that will reformat
REPBASE into GCG format, set up a farm (REPBASE), logicals, and so forth. 
Obviously it won't run on any Unix variants, but it would be pretty
straightforward to translate it to csh, sh, perl, or whatever.  Probably
the most useful part of the GCG formatted databases are
REF_H/I/M/P/R/V/SIMPLE, which are respectively reference sequences for
Human, Invertebrate, Plant, Rodent, Vertebrate, and Simple repeats.  These
can be used to find and remove repetitive elements from a query sequence
prior to searching using the program SAD (SearchAndDestroy, see next post),
which generally eliminates the Line1, Alu etc. from the list, and brings up
the homology to some other gene that you are generally looking for. 

General instructions:

1.  Obtain REPBASE from the NCBI, decompress and untar it.

2   Run this batch job 

     $ submit/log -
        /param=("dbsdisk:[DATABASES.REPBASE]", -
        "/REL=1.0/MON=12/YEAR=1995") -

    Where the first parameter is the TOP of the repbase directory tree,
    and the second is the standard release information.

    This will create a repbase.farm file and place it in the appropriate

4.  Append  add_to_sitelogicals.com (produced by the procedure and left in
    the top directory of the repbase directory tree) to sitelogicals.com.
    Note if you already had a prior version of REPBASE installed just check
    that the names are the same and reconcile any differences. 

5.  Check to see that all processing completed normally and that all 
    databases created are valid.


David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 
$! make_gcg_datasets
$! 16-JUN-1996, David Mathog, Biology Division, Caltech
$! This procedure builds a set of REPBASE GCG datasets from the REPBASE distribution
$! which is in EMBL format.
$! P1 is the TOP directory of the REPBASE distribution
$!    The logical REPBASEDIR will point to the [.GCG] directory
$!    which will be created under that.
$! P2 is "/REL=6.0/Year=96/Month=8"
$! Update the sections.txt list below for other releases.
$! Here is a typical submit command for this procedure:
$! $ submit/que=ftp_axpsys/log -
$!      /param=("dbsdisk:[DATABASES.REPBASE]", -
$!              "/REL=1.0/MON=12/YEAR=1995") -
$! WARNING, If you interrupt this procedure be sure to find and rename
$! back the current *.temp file, otherwise the next run will not find
$! one of the files it wants!
$! Go to the appropriate directory, it must contain a DISK specification
$ if(P1 .nes. "")
$ then
$   set default 'P1'
$   repstring = P1 - "]" + ".GCG]"
$   gcgdir = f$search("gcg.dir")
$   if(gcgdir .eqs. "")then create/directory [.GCG]
$   set file/prot=w:re GCG.DIR
$   define repbasedir 'repstring'
$ else
$   write sys$Output "Specify P1 as the name of the REPBASE main directory"
$   exit
$ endif
$ if(P2 .eqs. "")
$ then
$   write sys$Output "Specify P2 as /REL=6.0/Year=96/Month=8"
$   exit
$ else
$   version = f$edit(P2,"TRIM,COMPRESS")
$ endif
$ create sections.txt
$ gcgsupport
$! make datasets out of each of them - this will take up a lot of space...
$ open/read ifil: sections.txt
$ open/write ffil: repbase.farm
$ write ffil: "Farm for REPBASE pieces, this farm accessed via name REPBASE"
$ write ffil: ".."
$ open/write sfil: add_to_sitelogicals.com
$ write sfil: "$! logicals and symbols for REPBASE access"
$ write sfil: "$ ASSIGN/NOLOG ""@GenRunData:Repbase.Farm"" Repbase"
$ top:
$   read/end=leave/error=leave ifil: string
$   file1  = f$element(0,",",string)
$   logic  = f$element(1,",",string)
$   if(file1 .eqs. "")
$   then
$     set def 'logic'
$     goto top
$   endif
$   write sys$Output "Now processing ''file1', logical = ''logic'"
$!  pieces for sitelogicals
$ write sfil: "$ ASSIGN/NOLOG REPBASEDIR:''logic' ''logic'"
$!  pieces for REPBASE farm
$ write ffil: logic
$! now make the dataset
$ maybe = file1
$ file = f$search(maybe)
$ if(file .nes. "")
$ then 
$! silly embltogcg program doesn't let you specify the output
$! file name, so temporarily rename the input file, then change
$! the name back when done with it, then rename all of the
$! files produced
$   rename 'file1' 'logic'.temp
$   embltogcg/directory=repbasedir/ln='LOGIC'/sn='LOGIC' - 
$   rename 'logic'.temp   'file1'
$ else
$   write sys$Output "Warning: file = ''file' not processed"
$   write sys$Output "Reason:  ''maybe' not found"
$ endif
$ goto top
$ leave:
$ close sfil:
$ close ffil:
$! seqcat the lot of them
$ seqcat/infile=repbasedir:*.seq
$ copy repbase.farm genrundata:
$ set file/prot=w:re genrundata:repbase.farm
$ set file/prot=w:re [.GCG]*.*
$ set file/prot=w:rwed sections.txt
$ delete sections.txt.
$ type sys$Input

     Clean up tasks:

     1.  Append  add_to_sitelogicals.com to sitelogicals.com.  Note if you
         already had a prior version of REPBASE installed just check that 
         the names are the same and reconcile any differences.

     2.  Check to see that all processing completed normally and that all 
         databases created are valid.

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net