Announcements of the Protein Identification Resource
Network Request Service
Highlights
1. PIR Release 34 and NRL_3D Release 10
2. Updated Distribution Information
3. Database Standardization Efforts
4. Internet Addresses for Anonymous FTP and Network Request Service
5. FASTA Searches for NRL_3D Only
6. Network Request Service Command Summary
Announcements
1. PIR Release 34 and NRL_3D Release 10
As of 30 September Release 34 of the PIR databases and Release 10 of the
NRL_3D database (corresponding to Brookhaven Protein Data Bank Release 61)
are now available through the PIR On-line system and Network Request Server.
Distribution of the tape and CD-ROMs of the new release will begin shortly.
Database Release Sequences Residues
PIR1 34.00 10,550 3,591,370
PIR2 34.00 16,188 4,330,190
PIR3 34.00 18,162 5,284,017
NRL_3D 10.00 1,457 244,804
Growth of the PIR databases is documented in the file DBGROWTH.LIS available
through the Network Request Server. The following files are also available
through the Server:
the list of superfamiles in PIR1 is in SUPERFAM.LIS,
the list of keywords in PIR1 and PIR2 is in KEYWORDS.LIS,
the list of features in PIR1 and PIR2 is in FEATURES.LIS.
2. Updated Distribution Information
The databases and programs of the PIR are distributed on magnetic tape and
on TK50 and TK70 cartridges in VAX/VMS format and in ASCII card image format;
the protein databases are updated and distributed on a quarterly basis, the
sequence analysis software package is updated irregularly. The prices listed
are per release and are subject to change. Tapes may be ordered on a one-time
or on a standing order basis.
The PIR-International Protein Sequence Database ($250) contains substantially
sequenced proteins and sequences translated from nucleic acid sequences.
The database is divided into three data sets categorized by the degree of
annotation in the sequence entries. The sequences in the PIR1 data set (and
some of the PIR2 data set) have been annotated to identify post-translational
modifications, active sites, signal sequences, disulfide bonds, etc. The PIR3
data set contains minimal entries that have not yet been examined by
scientific staff. The datatape also contains the NRL_3D database of sequence
information extracted from the Brookhaven Protein Data Bank.
The VAX/VMS format of the protein sequence datatape contains the PSQ
(Protein Sequence Query) and the NAQ (Nucleic Acid Query) retrieval programs
and programs for creating user databases. As a service to our users, the PIR
is also including files required to use the PIR database with the GCG software.
The ATLAS multidatabase retrieval program is available on CD-ROM ($100) along
with the PIR-International Protein Sequence Database, the ALN protein alignment
database, the NRL_3D database, the PATCHX database, and the GenBank Genetic
Sequence Databank. The ATLAS program is currently designed to run on PC/DOS
and VAX/VMS systems. Support for UNIX and Mac will be added.
The PATCHX database ($250) is produced by MIPS at the Max Planck Institute
for Biochemistry, Martinsried, Germany. The PATCHX database includes all
protein sequences (not identical with or contained in sequences from PIR1,
PIR2 and PIR3) from the following databases: MIPSOwn MIPS preliminary entries,
PIRMOD MIPS/PIR preliminary entries, MIPSH MIPS yeast entries, NRL_3D
Brookhaven Data Bank Sequences, MIPSTrn MIPS preliminary translations,
EMTrans (EMBL translation by F. Pfeiffer), SwissProt, GenPept (GenBank(R)
translation by Los Alamos Nat. Lab.), Kabat, and PSeqIP. All sequences
that are IDENTICAL within or between databases are presented only ONCE.
Also sequences completely contained within others have been removed.
The NBRF-PIR Sequence Analysis Software tape ($200) contains programs designed
to run on a VAX computer operating under VMS version 5. All programs are
written in VAX-11 Fortran (a superset of ANSI Fortran 77), with the exception
of the Lipman-Pearson programs (FASTA, RDF), which are written in VAX-11 C.
Included are:
database searching programs (SEARCH, ISEARCH, FASTA);
global similarity programs (ALIGN, IALIGN);
local similarity programs (RELATE & DOTMATRIX);
and prediction programs (PRPLOT & CHOFAS - from the IDEAS package).
More information about the databases, sequence analysis programs, tapes,
on-line services, custom services or prices can be obtained by contacting:
Kathryn E. Sidman
Protein Identification Resource
National Biomedical Research Foundation
3900 Reservoir Road, NW
Washington, DC 20007
Phone: (202) 687-2121
FAX: (202) 687-1662
E-mail: PIRMAIL at GUNBRF.BITNET
3. Database Standardization Efforts
The combined staffs of the PIR-International have been engaged in a vigorous
effort to standardize the keyword and features records occurring in the PIR1
and PIR2 databases. Previous efforts to standardize the species and reference
records and the title records for enzymes had been very successful. The
standardization effort progressed by:
(1) determining the complete variety of information that existed in those
records,
(2) formulating rules for which forms were acceptable and which were not,
(3) imposing those rules by correcting the non-compliant entries and
introducing additional checking procedures during the data entry process.
The success of this standardization effort for the keyword records can be
judged from these results: in Release 30 there were 1614 different keywords
with 63% of those keywords appearing in fewer than 4 entries; in Release 34
there are 1037 different keywords and 40% of those keywords appear in fewer
than 4 entries. The following table provides a more complete breakdown.
Frequency of Keywords
Frequency Different Keywords
in Entries Rel. 30 Rel. 34
>400 7 12
201-400 10 24
101-200 19 42
51-100 38 58
26-50 61 61
13-25 103 105
7-12 131 135
4-6 218 185
2-3 395 208
1 632 207
4. Internet Addresses for Anonymous FTP and Network Request Service
During September the PIR Network Request Service was made available through
the National Biomedical Research Foundation's Internet address. For users
on BITNET the address remains FILESERV at GUNBRF. For users on Internet and
other networks with gateways to Internet the preferred address is now
FILESERV at NBRF.Georgetown.Edu.
Provided in the last part of this announcement is a synopsis of instructions
for using this database query and FASTA sequence search service.
Each PIR release and its accompanying NRL-3D release are available for
anonymous FTP from the UH Gene-Server, ftp.bchs.uh.edu, IP address 129.7.2.43.
The login is "anonymous" and the password is your e-mail address. The files
are kept in pub/gene-server/pir/pir_relXX/{ascii,vms}. "XX" is the release
number. All files are stored as Unix 16-bit compressed files and the file
names end in .Z (e.g. pir.1.dat.Z) as a reminder.
The "ascii" directory contains the CODATA format files, and the "vms"
directory the NBRF format files and indices in VMS format. Note that two of
the files required by GCG V.7.X are not included; those can be generated by
GCG-supplied utilities.
Uncompress utilities are available for non-Unix systems;
the DOS archive sites have a file "cmprs430.zip";
the Info-Mac archives have "maccompress-32.hqx";
and various VMS archives have "lhzcomp.exe" or "decompress.exe".
The latter is also available in pub/gene-server/pir, with a sample
(but non-working) .CLD file.
Questions about the FTP server can be directed to Dan Davison, davison at uh.edu.
Our thanks to Bill Pearson and Dan Davison for their efforts in providing FTP
access to the PIR databases.
5. FASTA Searches for NRL_3D Only
Some users had suggested that they wanted to do FASTA sequence searches
only for the sequences with known 3-dimensional structures, the sequences
extracted from the Brookhaven Protein Data Bank in NRL_3D. Normally our
FASTA searches are done against all the protein databases, PIR1, PIR2, PIR3,
the non-redundant PATCHX (described in the August announcement and in part 2
above) and NRL_3D. Now when the command
USE BASES NRL_3D
is used before a SEARCH command, only the NRL_3D database will be used for
the FASTA search. Otherwise, all the protein databases will be used.
Thanks to Ada Prochnicka-Chalufour at the Pasteur Institute for her helpful
suggestion and her hospitality this spring.
6. PIR Network Request Service Command Summary
The National Biomedical Research Foundation Protein Identification Resource
network request service is a full-function fileserver and database query
system. It has been operating since August 1990 and is capable of handling
database queries, sequence searches and sequence submissions, in addition to
fileserver requests. To use this server, request commands should be sent to
FILESERV at GUNBRF on BITNET. The FILESERVer recognizes the following commands
sent either in a mail message, or (if the sender is on BITNET) in command
messages or in a file:
Command Action
------- -----------------------------------------------
ACCESSION list entry codes and titles by accession number
AND combine QUERY commands with Boolean AND
AUTHOR list entry codes and titles by author
BASES list accessible databases
CROSS list PIR entry codes and titles corresponding to
a particular nucleic sequence database entry
DEPOSIT deposit entry for database submission
END DEPOSIT terminate deposit entry
FEATURE list entry codes and titles by feature table entry
GENE list entry codes and titles for a gene name
GET return entry by entry code
HELP return HELP instructions
HOST list entry codes and titles by host species
INDEX list SENDable files
JOURNAL list entry codes and titles by journal citation
KEYWORD list entry codes and titles by keyword
MEMBER list alignments containing entry code as a member
NOT