Announcements of the Protein Information Resource
2 May 1994
Contents
1. PIR-International Protein Sequence Database Release 40.00
2. Summary of Database Developments in Release 40.00
3. The March 1994 ATLAS of Protein and Genomic Sequences CD-ROM
4. The Complex Carbohydrate Structure Database and CarbBank
Announcements
1. PIR-International Protein Sequence Database Release 40.00
Release 40.00 of the PIR-International database, Release 14.00 of the NRL_3D
Database (corresponding to Brookhaven Protein Data Bank Release 65), and
Release 5.1 of the PIR ALN Database of Protein Sequence Alignments are now
available through the PIR On-line System and the PIR Network Request Server.
The PIR1, PIR2, PIR3, and NRL_3D databases are distributed on tape, and those
databases plus the ALN Database are distributed on CD-ROM.
Database Release Sequences Residues
PIR1 40.00 12,227 4,454,283 Classified and Annotated Entries
PIR2 40.00 34,147 9,362,019 Annotated Entries
PIR3 40.00 21,049 5,930,994 Unverified Entries
NRL_3D 14.00 2,722 484,598 Protein Sequences in Brookhaven PDB
ALN 5.1 1,133 Entries Protein Sequence Alignments
The NRL_3D Database contains protein sequences extracted from the Brookhaven
Protein Data Bank (PDB) coordinate data files. Introduced by the PIR in 1990
as an interface between the Protein Sequence Database and the PDB, it was the
first sequence database providing access to the PDB data via computerized
sequence searching and comparison methods. The ALN Database contains multiple
sequence alignments of selected protein sequences from the PIR-International
Protein Sequence Database.
Growth of the PIR databases is documented in the file DBGROWTH.LIS available
through the PIR Network Request Server. The following files are also available
through the Server:
PADD.LIS PIR1 and PIR2 entries added since Release 39.00
PREV.LIS PIR1 and PIR2 entries with revised sequences since Release 39.00
SPECIES.LIS species recorded in PIR1 and PIR2
SUPERFAM.LIS superfamiles recorded in PIR1 and PIR2
KEYWORDS.LIS keywords employed in PIR1 and PIR2
FEATURES.LIS features catalogued in PIR1 and PIR2
JOURNALS.LIS recognized journal abbreviations
ALNBASE.LIS a description of the ALN Database
ALNTITLE.LIS titles in the ALN Database
NRLTITLE.LIS titles in the NRL_3D Database
To obtain these and other files from the PIR Network Request Server, requests
should be sent to:
FILESERV at GUNBRF.BITNET or
FILESERV at NBRF.Georgetown.Edu
2. Summary of Database Developments in Release 40.00
The enhanced NBRF format was introduced with release 39.00. These format
enhancements were undertaken in order to
(1) improve the coverage, accuracy, and completeness of the PIR-International
Protein Sequence Database,
(2) provide additional data fields and define them more precisely so that
conversions to other formats or database systems (RDBMS or OODBMS) can
be accomplished more easily,
(3) make the overall presentation more uniform for human readability and
more computer parsable to facilitate automatic checking for correct format,
syntax, and vocabulary within the database, and
(4) make the two flat file distribution formats of the PIR-International,
the NBRF format and the CODATA format, more completely interconvertible
without any degradation of information.
Because we realized that planned changes could cause software problems if our
users were not given advance notice, we set up a developers mailing list and
began issuing the PIR Technical Development Bulletin. The fourth Bulletin
documented the changes that would be introduced with the enhanced NBRF Format
in Release 39.00. It is available in the file PIRTECH.LIS, which can be sent by
the PIR Network Request Server or picked up by anonymous FTP from the UH
Gene-Server, ftp.bchs.uh.edu, IP address 129.7.2.43. This electronic bulletin
provides detailed specifications of the database format and serves as an "early
warning system" for software developers and others who are concerned about
changes in the format and standards for the PIR databases. If you are
interested in the technical aspects of these database changes and would like to
be placed on the mailing list for the Technical Bulletin, send a brief
electronic mail note to
POSTMAST at GUNBRF.BITNET or
POSTMASTER at NBRF.Georgetown.Edu.
Descriptions of the CODATA Exchange Format and of PIR feature annotations can
be obtained from the PIR Network Request Server in the files CXFSD.LIS and
FEATDOC.LIS respectively.
3. The March 1994 ATLAS of Protein and Genomic Sequences CD-ROM
The new release of the ATLAS of Protein and Genomic Sequences CD-ROM is now
available for distribution.
The ATLAS Information Retrieval program provides direct and simultaneous
retrieval from the databases included on the CD-ROM or on mounted secondary
CD-ROMs. In this release of the ATLAS CD-ROM, versions of the ATLAS program are
provided for these operating systems:
PC-DOS,
VAX/VMS,
OpenVMS Alpha AXP,
DEC OSF/1 Alpha AXP,
DEC ULTRIX (RISC),
SunOS,
SGI/IRIX, and
Macintosh
The ATLAS program provides a user-friendly environment where entries from
selected databases can be linked dynamically for simultaneous retrieval on
biological annotations and bibliographic information, such as protein names,
superfamily names, homology domains, organism names, gene names, keywords,
feature descriptions, author's names, etc. The ATLAS program also enables
selected sets of sequences to be searched directly both for exact subsequences
or for patterns. A complete and comprehensive Installation and User's Guide is
provided on the CD-ROM and the ATLAS program itself contains an integrated help
facility.
The ATLAS CD-ROM contains specially configured versions of the FASTA programs
that allow the protein sequence databases on the CD-ROM to be searched by
sequence directly. These programs will execute on PC-DOS, VAX/VMS, and DEC
ULTRIX systems.
The ATLAS CD-ROM includes:
- PIR1, PIR2, PIR3, NRL_3D, and ALN data sets
- release 39.06 of the MIPS PATCHX data set
- release 2.1 of the JIPID ECOLI (Escherichia coli) Nucleic Acid Sequence
Database
- release 81.0 of the NCBI-GenBank Genetic Sequence Databank GBNEW data set
- indexes for release 81.0 of the NCBI-GenBank Genetic Sequence Databank
- release 8 of Complex Carbohydrate Structure Database
The MIPS PATCHX data set has been assembled from a collection of other public
domain protein sequence databases. When used in conjunction with the MIPS
PATCHX data set, the Protein Sequence Database provides the most complete
collection of protein sequence data currently available in the public domain.
The ECOLI Nucleic Acid Sequence Database compiled by scientists at JIPID and
NBRF is a comprehensive, nonredundant, fully merged (all recognized contigs are
assembled into single sequence segments), and annotated database containing
sequence information from the GenBank, EMBL, and NBRF nucleic acid sequence
databases, plus information entered directly from published reports. Protein
coding regions are annotated in the feature tables, as are additional features
such as promoter regions, Shine-Dalgarno sequences, and transcription
termination sequences. The protein coding regions are directly cross-referenced
to the PIR-International Protein Sequence Database and features are formatted
to allow direct translation by computer. Overlapping sequences are merged and
ordered by map position. When their orientation is known, sequence segments are
represented in the same direction (the plus strand). Genetic map positions are
directly correlated with the Kohara physical map using an algorithm developed
by Kunisawa and coworkers that compares restriction fragment lengths, directly
incorporating information on restriction site distances while avoiding site
inversion problems.
Because of its size it is no longer possible to include all of the GenBank
Sequence Databank on the ATLAS CD-ROM. All of the GBNEW dataset is provided and
the LOCUS and TITLE information is available for the 14 other datasets.
However, index files for the NCBI-GenBank Genetic Sequence Databank release
81.0 are provided so that for VAX/VMS and MS-DOS systems with multiple CD-ROM
drives the ATLAS program can access the NCBI-GenBank Sequence Databank mounted
on a secondary CD-ROM drive.
Through the cooperation of CarbBank, the Complex Carbohydrate Structure
Database (CCSD) and its associated CarbBank software are now included on the
Atlas of Protein and Genomic Sequences CD-ROM. The ATLAS CD-ROM includes
documentation and an Installation Manual and Tutorial for CarbBank. The ATLAS
program cannot access the CCSD. The CCSD and CarbBank are discussed in more
detail in the next section.
Orders for the ATLAS CD-ROM are accepted, WITHOUT PREPAYMENT, on institutional
purchase orders, by FAX or E-mail. For further information in the US and the
Americas, please contact:
Kathryn Sidman, Technical Services Coordinator
Protein Information Resource (PIR)
National Biomedical Research Foundation (NBRF)
3900 Reservoir Rd., NW
Washington DC 20007
FAX: (202) 687-1662
phone: (202) 687-2121
E-mail: PIRMAIL at nbrf.georgetown.eduPIRMAIL at gunbrf.bitnet
In Europe contact:
Martinsried Institute for Protein Sequences (MIPS)
Max-Planck-Institute for Biochemistry
8033 Martinsried, Germany
FAX: 49 89 8578 2655
phone: 49 89 8578 2657
E-mail: mewes at ehpmic.mips.biochem.mpg.de
In Asia and Oceania contact:
Japan International Protein Information Database (JIPID)
Science University of Tokyo
2669 Yamazaki, Noda 278 Japan
FAX: 81 47 122 1544
phone: 81 48 124 1501
E-mail: Tsugita at JPNSUT31.BITNET
4. The Complex Carbohydrate Structure Database and CarbBank
This release of the ATLAS CD-ROM includes the Complex Carbohydrate Structure
Database (CCSD) release 8 and CarbBank version 2.5. The CCSD is a database that
contains complex carbohydrate structures and associated text information
derived from scientific publications. The database has a flat file format.
Structural abbreviations and nomenclature are similar to those found in the
journal Carbohydrate Research. CarbBank is the computer management system for
CCSD database files. CarbBank runs on PC- or MS-DOS, IBM-compatible
microcomputers, and has a menu-driven user interface. CarbBank has an Editor
that allows you to create or modify database records and a Searcher that will
let you find records based on Search Criteria that you supply. A Report
generation facility allows the user to create a variety of reports on the
contents of databases, and an Interchange module allows CarbBank to view
reports and to exchange records among ASCII text files, a CarbBank-specific
version of the CCSD, and an ASN.1 version of the CCSD.
The CarbBank program cannot operate from a floppy diskette, from a CD-ROM, or
from a write-protected disk. There are other minimum system and hardware
requirements. Please consult the CarbBank documentation or CarbBank before
attempting to install this software on your PC.
For information about CarbBank contact:
Dana Smith
CarbBank/CCSD Manager
114 W. Magnolia St.
Suite 305
Bellingham, WA 98225, USA
Phone: (206) 733-7183
FAX: (206) 733-7283
EMail: Internet: 76424.1122 at Compuserve.Com
------------------------------------------------------------------------
Inquiries about how to obtain the PIR-International Protein Sequence Database:
Ms. Katie Sidman
PIR Technical Services Coordinator
National Biomedical Research Foundation
3900 Reservoir Road NW
Washington DC 20007
Phone: (202) 687-2121
FAX: (202) 687-1662
EMail: PIRMAIL at nbrf.georgetown.edu
------------------------------------------------------------------------
Dr. Winona C. Barker, Director
Protein Information Resource
National Biomedical Research Foundation
Washington DC 20007
BARKER at nbrf.georgetown.edu