Announcements of the Protein Information Resource
Network Request Service
Highlights
1. Hints for Retrieving Sequence Database Entries
2. PIR Network Request Service Command Summary
1. Hints for Retrieving Sequence Database Entries
The very first thing to appreciate about the sequence databases is that the
most commonly sought information for every entry is contained in the title
field. The title field contains the protein name, the source organism and the
EC number if it's an enzyme. On the other hand, the keyword field contains
information that does not necessarily duplicate what is in the title. What the
keyword field is designed to do is provide ancillary retrieval information that
is not conveyed in a protein name, such information as
* disease or resistance states associated with the protein
ACQUIRED IMMUNE DEFICIENCY SYNDROME or CYANATE RESISTANCE
* metabolic roles or pathways
CALCIUM TRANSPORT or PENTOSE PHOSPHATE PATHWAY
* posttranslation modification processes
HYDROXYLATION
* tissues, cell types or subcellular components that are the origin of or
the targets of the protein
HEART, LEUKOCYTE or MITOCHONDRIAL MATRIX
* structural characteristics
TRIMER or ZINC FINGER
* larger classification schemes the protein may fall in
SERINE PROTEASE or STRUCTURAL PROTEIN
Most searches are for information contained in the title field. The most
common reason for a keyword search failure is that the protein name is what is
being used and that can be found in the title, not the keyword list. A list of
the keywords found in the current public distribution release of the PIR can be
obtained by using the command
SEND KEYWORDS
The keywords used by the PIR correspond closely to the MESH terms of the
National Library of Medicine.
When only one field is being searched, all the words that follow the field name
must be found in the same entry for there to be a "hit". This means that all
the words on one command line form a logical AND; a QUERY that repeats the same
field connected by AND is unnecessary. Furthermore since the title combines
both the protein name and the source organism, the title and species can be
searched in a single command; for example,
QUERY
TITLE ALPHA
AND
TITLE HEMOGLOBIN
AND
SPECIES HUMAN
END QUERY
can be simply combined as
TITLE HUMAN ALPHA HEMOGLOBIN
On the other hand OR operations are just equivalent to combining the results
of several different searches; for example
QUERY
TITLE HUMAN ALPHA HEMOGLOBIN
OR
TITLE HUMAN DELTA HEMOGLOBIN
END QUERY
would achieve the same result as the two separate TITLE searches.
The Boolean operators must be placed on separate lines and not on the line
with another command; for example,
TITLE CYTOCHROME AND P450
will fail because only entries with the character string "AND" in the title
along with "CYTOCHROME" and "P450" will hit.
TITLE CYTOCHROME P450
means "search for titles containing both strings 'CYTOCHROME' and 'P450'
in either order". Double quotation marks can be used to change the meaning
slightly
TITLE "CYTOCHROME P450"
means "search for titles containing the string 'CYTOCHROME P450' ". The
double quotation marks must be used when some part of the search string
is less than 3 characters long; for example,
TITLE "CYTOCHROME C"
The Boolean NOT command can be used most effectively to remove entries
with names that are extensions of some shorter name of interest; for example,
QUERY
TITLE "CYTOCHROME C "
NOT
TITLE OXIDASE
NOT
TITLE REDUCTASE
END QUERY
will pretty much eliminate everything but cytochrome C from the resulting list.
(Because the indexing scheme used by the retrieval program lumps together all
the nonalphanumeric characters, the space appearing after the "C" and before
the double quotation mark eliminates entries like "cytochrome c2" but not
"cytochrome c'" from the list.)
One very inappropriate type of request is the following.
GENE CONCANAVALIN
KEYWORD CONCANAVALIN
FEATURE CONCANAVALIN
TITLE CONCANAVALIN
SEARCH CONCANAVALIN
Specifically, "concanavalin" is not a gene name, so it will not be found
in the gene field. The word "concanavalin" is plain text, not a sequence,
so it should not appear after the SEARCH command --- only actual sequences
should appear after a SEARCH command. While "concanavalin" might possibly
appear in the keyword or feature fields, its use there would be very
specialized and not indicative of a concanavalin entry. The only command
that makes any sense is
TITLE CONCANAVALIN
The biggest problem comes when the SEARCH command is used in that way. The
futile FASTA search this generates wastes shared computer resources that can
be used by others much more fruitfully. The FASTA program has been modified
to recognize some occurrences of plain text and print a warning.
The USE command is used to restrict searches to particular databases or to
entries added or modified within a particular time period. Such restrictions
apply to all subsequent search commands in the same request and need not be
used only in queries.
After a successful search, the GET command should be used to retrieve the
actual text of an entry. The format of the GET command is either
GET database:code
or simply
GET code
There are no spaces around the colon and only one code may follow each GET
command.
There are a few special considerations to keep in mind when using the NRL_3D
database of sequence information extracted from the Brookhaven Protein Data
Bank. Only these fields in NRL_3D are indexed and can be searched through the
PIR Server: TITLE, SPECIES, FEATURE and the sequence. At this time the TITLE
field consists of the COMPND records from the Brookhaven Protein Data Bank file
as well as the species. In most cases your search will be for something in
this TITLE or name field. For example, after an initial
USE BASES NRL_3D
the command
TITLE MYOGLOBLIN
will retrieve a list of all the myoglobin sequences in the PDB and
SPECIES MOUSE
will retrieve a list of all the mouse sequences. The SPECIES field is not 100%
accurate for the NRL_3D because of some eccentricities in the SOURCE records of
the PDB used to construct it. Although there is a KEYWORD field in NRL_3D
entries, it is constructed directly from the PDB HEADER record and is not
indexed.
With release 10.00 of NRL_3D the PIR will cease converting all of each PDB
release. Instead only new and modified entries will be converted; the NRL_3D
entries will gradually be modified to standardize spelling, capitalization,
nomenclature, taxonomy and keywords. With this standardization the KEYWORD
field will become more meaningful and probably be indexed within the coming
year.
2. PIR Network Request Service Command Summary
The National Biomedical Research Foundation Protein Information Resource
network request service is a full-function fileserver and database query
system. Operating since August 1990 it is capable of handling database
queries, sequence searches and sequence submissions, in addition to
fileserver requests. To use this server, request commands should be sent to
FILESERV at GUNBRF on BITNET or FILESERV at NBRF.Georgetown.EDU on Internet.
The server recognizes the following commands sent either in a mail message,
or (if the sender is on BITNET) in a command message or a file:
Command Action
------- -----------------------------------------------
ACCESSION list entry codes and titles by accession number
AND combine QUERY commands with Boolean AND
AUTHOR list entry codes and titles by author
BASES list accessible databases
CROSS list PIR entry codes and titles corresponding to
a particular nucleic sequence database entry
DEPOSIT deposit entry for database submission
END DEPOSIT terminate deposit entry
FEATURE list entry codes and titles by feature table entry
GENE list entry codes and titles for a gene name
GET return entry by entry code
HELP return HELP instructions
HOST list entry codes and titles by host species
INDEX list SENDable files
JOURNAL list entry codes and titles by journal citation
KEYWORD list entry codes and titles by keyword
MEMBER list alignments containing entry code as a member
NOT combine QUERY commands with Boolean NOT
OR combine QUERY commands with Boolean OR
QUERY begin collecting QUERY commands
END QUERY terminate collecting commands and execute QUERY
QUIT ignore the remaining text (E-mail signature blocks)
RETURN change return address for gateway mail
SEARCH search for matching sequences by FASTA procedure
END SEARCH terminate sequence for searching
SEND send file
SPECIES list entry codes and titles by species
SUGGEST leave suggestion or correction for PIR staff
END SUGGEST terminate suggestion text
SUPERFAMILY list entry codes and titles by superfamily name
TAXONOMY report taxonomy for scientific or common name
TITLE list entry codes and titles by title
USE set databases, dates or formats to use in limited searches
Multiple commands can be sent with one command on each line of a mail message
or file. Commands should NOT be sent on the Subject line of a mail message.
Receipt of command messages and files will be acknowledged immediately. Mail
messages will be acknowledged by return mail.
For help in using any of the commands, send a request of the form
HELP topic
for example
HELP SEARCH
In addition to the commands, help instructions are also available on the
following topics:
Custom_Services
Databases
FTP
Gateway_Access
Help_en_Espanol
Help_en_francais
Hints
IBM-VM_BITNET
On-Line_Access
PIR_Distribution
VAX-VMS_BITNET
Because of network gateway communication protocols, there are limitations on
requests sent through gateways. Users not on BITNET or INTERNET who access the
server through local or network gateways should read and carefully follow these
instructions before sending requests. Only mail message requests (not command
messages or files) can be sent through gateways. Because addresses posted on
gateway mail do not always work for the return, before you send requests
through network gateways it is strongly recommended that you first contact Dr.
John S. Garavelli (POSTMAST at GUNBRF on BITNET, POSTMASTER at NBRF.Georgetown.EDU on
Internet). We will confirm a return address for you and may instruct you to
use the RETURN command to ensure that your request output will reach you. It
is not usually necessary to do this if you are on BITNET or INTERNET, unless
your system employs a local remailer or your mail program applies a
nonstandard return address (for example a personal name on the FROM: line).
The BITNET network and the network gateways impose strict limits on file size.
Poorly posed database queries may result in output so extensive that it could
not be returned by network mail. Therefore, an output limit of 1000 lines for
each command and 3000 lines for each request is imposed by the PIR server.
The DEPOSIT and QUERY commands, and the SEARCH and SUGGEST commands (in their
multiline form) must be followed by their respective END commands after the
text appearing on the intervening lines. The DEPOSIT command requires, and the
SEARCH command optionally uses, parameters that appear on the same line as the
command. Because these four commands are so complex, users should obtain and
carefully read the help instructions before attempting to use them.
The databases available through the PIR Network Server and their abbreviations
for code specification are as follows:
Abbreviation Database Update Schedule
PIR1 PIR Annotated and Classified Entries quarterly
PIR2 PIR Preliminary Entries approximately monthly
PIR3 PIR Unverified Entries weekly
ALN PIR Alignment Entries semiannually
NRL_3D Brookhaven Data Bank Sequences quarterly
PATCHX MIPS PIR-Supplementary Database quarterly
N NBRF Nucleic
GB GenBank (TM) as received
GBSUP GenBank (TM) as received
GBNEW GenBank (TM) New Entries weekly
EMBL EMBL as received
EMBLSUP EMBL as received
In the FASTA output of the SEARCH command the abbreviation for PATCHX is
shortened to PATX and NRL_3D is shortened to NR3D; the longer abbreviation
should be used to retrieve an entry with the GET command. Not all commands
work with all databases; please read the information returned by the command
HELP DATABASES.
------------------------------------------------------------------------
Dr. John S. Garavelli
Database Coordinator
Protein Information Resource
National Biomedical Research Foundation
Washington, DC 20007
POSTMASTER at GUNBRF.BITNETPOSTMASTER at NBRF.Georgetown.Edu