SAGITTARIUS PIR-41 (30 June 1994) variant
*****************************************
SAGITTARIUS PIR is a highly compact databank variant of original PIR
database designed to assist individual researchers and software developers
in utilization of sequence database information without huge storage space
requests. It contains custom compressed PIR information and C-written interface
which allow fast direct access to the stored information without total
decompressing of corresponding files. Starting from this PIR-41 version, one
and the same databank files as well as interface C-file can be used on both
PC-compatibles and UNIX V computers (with forthcoming Mac interface version),
without any modifications. Interface supports all standard PIR Request Network
queries (i.e. get databank SEQ number by entry; for the defined databank SEQ
number, get specified information like: name, organism(s), keyword(s),
sequence, sequence features with coordinates etc.). In difference with PIR
Request Network, SAGITTARIUS PIR allows you to call PIR-contained information
directly from your C program, even on the personal computer separated from any
network. In addition, all numerical information like introns placement and
concrete feature coordinates are aquired by calling program in the form of
constants instead of text strings what simplifies (sub)sequence manipulations.
For even larger storage compactness and flexibility, SAGITTARIUS PIR was
realized in the form of separate file sets, where each file set contains
database information of independent type (i.e. sequences, entry indexes,
organisms etc.). On the particular computer, available configuration of the
PIR information could be easily changed as needed by the user itself without
any damage for retrievals of other types of stored information. For example, a
file set which contains protein sequences itself and their PIR entry indexes
(including reverse indexation arrays) takes less than 12Mb of disk space.
For PC-compatibles (with fortcoming version for UNIXV/Xwindow), a dialog
shell is available which supports all standard PIR Request Network queries
plus homology searches, alignments etc.
SAGITTARIUS PIR is distributed freely (all databank file sets, interface
C-file, test/example program C-file, PC-shell executable) via anonymous FTP.
File sets and interface can be used/included in any commercially distributed
package without any restrictions. Consultations and advanced interface
variants (currently used to support fast effective database manipulations
in other SAGITTARIUS family packages) are available from the developers upon
request.
For now, SAGITTARIUS PIR compressed databank stores in custom compressed
form following original informational types (fields) of PIR database:
- database entry index
- accession number(s)
- other (non-PIR) database crossreference(s)
- protein name
- organism name(s)
- alternative protein name(s)
- keyword(s)
- superfamily name(s)
- gene name(s)
- map position(s)
- unusual start codon(s)
- intron(s) placement
- literature reference(s), including for each:
-journal or citation
-author(s)
-title
-free-format comment
- sequence feature(s)
- free-format comment
- protein sequence itself
For PIR-41, all bank files takes 33+ Mb on hard disk (20+ Mb in
ZIP-compressed form). Each original database informational field (i.e.
sequences, organisms, names, keywords etc.) is stored in separate file set
what allows the user to configure reduced bank variants by simply excluding
unnecessary information files from unpacking. For example, deletion of
literature references reduces the bank to only 23 Mb. Core (minimal
configuration supported by available PC-shell) variant of databank files
includes only indexes and sequences. All more complete configurations could
be produced by simply adding (depacking from distributive) of corresponding
file sets.
List of distributive files with decompressed files description
**************************************************************
-------------------------------------------------------------------------------
ZipFile ZipSize Content Description
------------------------------------------------------\/ Core config part \/
CORE 11,105,308 Entry indexes + sequences itself
------------------------------------------------------\/ User-variable part \/
NAME 991,927 Sequence names
ORGANISM 488,638 Organisms
KEYWORD 270,868 Keywords
S_FAMILY 156,485 Superfamily classifications
CROSSREF 284,114 Other database crossreferences
FEATURE 850,677 Sequence features
GENE_MAP 25,803 Genetic map positions
ALT_NAME 174,754 Sequence alternative names
GENE 163,812 Genes
CODON 7,543 Unusual start codons
ACC_CODE 1,394,839 PIR accession codes
COMMENT 256,560 Sequence comments
INTRON 30,991 Intron(s) placement
REF_JOU 897,568 References core : references itself
REF_AUTH 1,593,198 Ref. extention : reference authors
REF_TITL 1,840,525 Ref. extention : reference titles
REF_COMM 92,982 Ref. extention : reference comments
------------------------------------------------------\/ Dialog shell for PC \/
PC_SHELL 146,967 PC-executable + two MAP-files (to \PIR)
------------------------------------------------------\/ Interface \/
INTERFAC 15,383 Interface and test program, C-files, PRJ file
-------------------------------------------------------------------------------
SAGITTARIUS PIR Data Bank Shell
*******************************
SAGITTARIUS PIR Automated Sequence Bank is a dialog shell for manipulation
of the compressed sequence database information with orientation on
MS DOS/Windows PC-compartibles, with installed hard disk optimizers
(like Smartdrive, Hyperdisk, Ncache etc.). 386 or 486 are recommended,
86 and 286 will be significantly slower but still are OK. Recommended minimal
size of memory allocation by disk optimizer is 512 Kb, but shell will be
still usable (but significantly slower) even without any optimizer.
The dialog data shell supports the following main operations:
- selection of sequences to bank buffer by
- dictionary-defined record for specified informational
field (name, source, keyword, feature etc.)
- user-defined context in specified informational
field (name, source, keyword, feature etc.)
- set of dictionary-defined records for different informational
fields (source, keyword, superfamily etc.)
- SEQ (non)perfect homology with user-defined short sequence
- store and retrieve buffer content (SEQ bank numbers and indexes)
between sessions
- output user-specified (buffer) SEQ data to disk files
- fast SEQ homology searches (for user-defined SEQ of length not more
than 50-100 positions, only 1 hour with full PIR bank on 486/33)
- fast subregion-sensitive pairwaise alignments (user-defined
sequence with buffer SEQ's or full bank)
- easy data access from user programs (C) as a support for
applications development
SAGITTARIUS data bank files are usually filled out by current available
PIR database information only by distributors (2 to 4 times in a year).
Distributive variant includes ready-for-use informational files, interface
and executables - all in compressed form.
-----------------------------------------------------------------------------
SAGITTARIUS PIR is available by anonymous FTP from:
FTP.SCRI.FSU.EDU, directory /pub/genetics/pir/
SAGITTARIUS PIR is also available by anonymous FTP from some
of the well-known bio-servers (IUBIO etc.).
----------------------------------------------------------------------------
Installation on UNIX V system
*****************************
All decompressed SAGITTARIUS databank files must be placed in one and
the same directory which name should be correctly specified in the interface
C-file (first strings). Placement of the 'X' symbol in the first position of
the corresponding text string (instead of '/') will force interface to carry
out formal check for databank files presence.
Installation on PC
******************
If you plan to use PC-shell, all decompressed SAGITTARIUS databank files
must be placed in the directory \PIR on any (but one and the same) logical
drive. Otherwise all decompressed SAGITTARIUS databank files must be placed
in one and the same directory which name should be correctly specified in
the interface C-file (first strings). Placement of the 'X' symbol in the first
position of the corresponding text string (instead of some drive letter) will
force interface to carry out 1) search of corresponding directory on all
available logical drives and 2) formal check for databank files presence.
Bank executable BANK.EXE may be placed in any directory on any logical
drive. Important: .MAP-files contained in the same PC-shell compressed file
should be moved into the directory \PIR where other databank files are placed.
It is highly reccomended to run BANK.EXE from directory (and/or logical
drive) other than data location to avoid random bank files structure
damage. Bank is oriented on file-server data accession and can find
\PIR directory (and test them for correct data configuration) on
any logical drive.
----------------------------------------------------------------------------
SAGITTARIUS PIR is a FREE DOMAIN software.
This package (with compressed data files) can be redistributed
freely without any limitations but only free of charge and for
non-commercial usage. No changes in data files and/or executables
are allowed.
You may include compressed SAGITTARIUS datafiles in your application
packages freely even in the case of any commercial usage.
--------------------------------------------------------------
For HELPFUL comments and discussions please contact
Dr. Victor B. Strelets (strelets at scri.fsu.edu)
Computational Genetics and Biophysics,
Supercomputer Computations Research Institute,
FSU B-186, Tallahassee, FL 32306-4052, USA
---------------------------------------------------------------
For control purposes you may use the following info about SAGITTARIUS PIR
distributive files (with information about containing compressed files):
CORE ZIP 11,105,308
SEQ0 BAN 299,040
SEQ BAN 9,613,474
DIC BAN 330,888
DIC2 BAN 106,496
IND0 BAN 299,040
IND2 BAN 301,056
IND BAN 598,080
0IND REV 299,040
1IND REV 299,040
NAME ZIP 991,927
NAM0 BAN 299,040
NAM2 BAN 180,224
NAM BAN 929,160
0NAM REV 179,196
1NAM REV 299,040
ORGANISM ZIP 488,638
SOU0 BAN 299,040
SOU1 BAN 78,468
SOU2 BAN 24,576
SOU BAN 105,816
0SOU REV 23,972
1SOU REV 461,716
KEYWORD ZIP 270,868
KW0 BAN 298,936
KW1 BAN 98,916
KW2 BAN 6,144
KW BAN 20,536
0KW REV 5,560
1KW REV 270,868
GENE ZIP 163,812
GENE0 BAN 286,272
GENE1 BAN 37,176
GENE2 BAN 30,720
GENE BAN 75,168
0GENE REV 29,748
1GENE REV 50,200
ALT_NAME ZIP 174,754
ANAM0 BAN 298,892
ANAM1 BAN 35,376
ANAM2 BAN 28,672
ANAM BAN 116,104
0ANAM REV 28,560
1ANAM REV 48,004
S_FAMILY ZIP 156,485
SFAM0 BAN 293,424
SFAM1 BAN 26,116
SFAM2 BAN 14,336
SFAM BAN 58,840
0SFAM REV 13,520
1SFAM REV 123,060
ACC_CODE ZIP 1,394,839
AC0 BAN 299,040
AC1 BAN 334,516
AC2 BAN 333,824
AC BAN 665,760
0AC REV 332,880
1AC REV 332,880
CODON ZIP 7,543
CDN BAN 192
CDN0 BAN 216,476
CDN1 BAN 148
CDN2 BAN 2,048
0CDN REV 96
1CDN REV 3,352
FEATURE ZIP 850,677
FT0 BAN 298,672
FT1 BAN 222,132
FT2 BAN 47,104
FT BAN 259,144
FTN0 BAN 298,672
FTN1 BAN 222,132
FTN BAN 904,984
0FT REV 46,272
1FT REV 214,004
GENE_MAP ZIP 25,803
MAP0 BAN 277,236
MAP BAN 29,480
COMMENT ZIP 256,560
CC0 BAN 283,604
CC BAN 468,496
CROSSREF ZIP 284,114
CR0 BAN 299,040
CR BAN 362,176
INTRON ZIP 30,991
INTR0 BAN 207,764
INTR1 BAN 39,084
REF_JOU ZIP 897,568
REF0 BAN 299,040
REF1 BAN 403,020
REF BAN 1,109,376
REF_AUTH ZIP 1,593,198
AUT0 BAN 299,040
AUT1 BAN 403,020
AUT BAN 2,419,248
REF_TITL ZIP 1,840,525
TITLE0 BAN 299,040
TITLE1 BAN 365,264
TITLE BAN 2,978,768
REF_COMM ZIP 92,982
REFCOM0 BAN 299,020
REFCOM1 BAN 49,324
REFCOM BAN 105,032
PC_SHELL ZIP 146,967
BANK EXE 468,640
BANK1 MAP 151,200
BANK2 MAP 151,200
-----------------------------------------------------------------
Standard disclaimer:
Author(s) will in no way be held liable for any loss of profit or
any other commercial damage including but not limited to special,
incidental, consequential or other damages from use of this
package. You may use them only with the understanding that
you use it at your own risk and that your use of the software
and datafiles is your agreement to this disclaimer.