IUBio

Announcements of PIR Network Request Service

POSTMASTER at NBRF.GEORGETOWN.EDU POSTMASTER at NBRF.GEORGETOWN.EDU
Tue Nov 3 14:48:23 EST 1992


               Announcements of the Protein Information Resource
                            Network Request Service

Highlights
1. Hints for Retrieving Sequence Database Entries
2. PIR Network Request Service Command Summary


1. Hints for Retrieving Sequence Database Entries

The very first thing to appreciate about the sequence databases is that the
most commonly sought information for every entry is contained in the title
field.  The title field contains the protein name, the source organism and the
EC number if it's an enzyme.  On the other hand, the keyword field contains
information that does not necessarily duplicate what is in the title.  What the
keyword field is designed to do is provide ancillary retrieval information that
is not conveyed in a protein name, such information as
  * disease or resistance states associated with the protein
    ACQUIRED IMMUNE DEFICIENCY SYNDROME or CYANATE RESISTANCE
  * metabolic roles or pathways
    CALCIUM TRANSPORT or PENTOSE PHOSPHATE PATHWAY
  * posttranslation modification processes
    HYDROXYLATION
  * tissues, cell types or subcellular components that are the origin of or
    the targets of the protein
    HEART, LEUKOCYTE or MITOCHONDRIAL MATRIX
  * structural characteristics
    TRIMER or ZINC FINGER
  * larger classification schemes the protein may fall in
    SERINE PROTEASE or STRUCTURAL PROTEIN
Most searches are for information contained in the title field.  The most
common reason for a keyword search failure is that the protein name is what is
being used and that can be found in the title, not the keyword list.  A list of
the keywords found in the current public distribution release of the PIR can be
obtained by using the command
   SEND KEYWORDS
The keywords used by the PIR correspond closely to the MESH terms of the
National Library of Medicine.  

When only one field is being searched, all the words that follow the field name
must be found in the same entry for there to be a "hit".  This means that all
the words on one command line form a logical AND; a QUERY that repeats the same
field connected by AND is unnecessary.  Furthermore since the title combines
both the protein name and the source organism, the title and species can be
searched in a single command; for example,
   QUERY
   TITLE ALPHA
   AND
   TITLE HEMOGLOBIN
   AND
   SPECIES HUMAN
   END QUERY
can be simply combined as
   TITLE HUMAN ALPHA HEMOGLOBIN
On the other hand OR operations are just equivalent to combining the results
of several different searches; for example
   QUERY
   TITLE HUMAN ALPHA HEMOGLOBIN
   OR
   TITLE HUMAN DELTA HEMOGLOBIN
   END QUERY
would achieve the same result as the two separate TITLE searches.

The Boolean operators must be placed on separate lines and not on the line
with another command; for example,
   TITLE CYTOCHROME AND P450
will fail because only entries with the character string "AND" in the title
along with "CYTOCHROME" and "P450" will hit.
   TITLE CYTOCHROME P450
means "search for titles containing both strings 'CYTOCHROME' and 'P450'
in either order".  Double quotation marks can be used to change the meaning
slightly
   TITLE "CYTOCHROME P450"
means "search for titles containing the string 'CYTOCHROME P450' ".  The
double quotation marks must be used when some part of the search string
is less than 3 characters long; for example,
   TITLE "CYTOCHROME C"
The Boolean NOT command can be used most effectively to remove entries
with names that are extensions of some shorter name of interest; for example,
   QUERY
   TITLE "CYTOCHROME C "
   NOT
   TITLE OXIDASE
   NOT
   TITLE REDUCTASE
   END QUERY
will pretty much eliminate everything but cytochrome C from the resulting list. 
(Because the indexing scheme used by the retrieval program lumps together all
the nonalphanumeric characters, the space appearing after the "C" and before
the double quotation mark eliminates entries like "cytochrome c2" but not
"cytochrome c'" from the list.)

One very inappropriate type of request is the following.
   GENE CONCANAVALIN
   KEYWORD CONCANAVALIN
   FEATURE CONCANAVALIN
   TITLE CONCANAVALIN
   SEARCH CONCANAVALIN
Specifically, "concanavalin" is not a gene name, so it will not be found
in the gene field.  The word "concanavalin" is plain text, not a sequence,
so it should not appear after the SEARCH command --- only actual sequences
should appear after a SEARCH command.  While "concanavalin" might possibly
appear in the keyword or feature fields, its use there would be very
specialized and not indicative of a concanavalin entry.  The only command
that makes any sense is
  TITLE CONCANAVALIN
The biggest problem comes when the SEARCH command is used in that way.  The
futile FASTA search this generates wastes shared computer resources that can
be used by others much more fruitfully.  The FASTA program has been modified
to recognize some occurrences of plain text and print a warning.

The USE command is used to restrict searches to particular databases or to
entries added or modified within a particular time period.  Such restrictions
apply to all subsequent search commands in the same request and need not be
used only in queries.

After a successful search, the GET command should be used to retrieve the
actual text of an entry.  The format of the GET command is either
   GET database:code
 or simply
   GET code
There are no spaces around the colon and only one code may follow each GET
command.

There are a few special considerations to keep in mind when using the NRL_3D
database of sequence information extracted from the Brookhaven Protein Data
Bank.  Only these fields in NRL_3D are indexed and can be searched through the
PIR Server:  TITLE, SPECIES, FEATURE and the sequence.  At this time the TITLE
field consists of the COMPND records from the Brookhaven Protein Data Bank file
as well as the species.  In most cases your search will be for something in
this TITLE or name field.  For example, after an initial
   USE BASES NRL_3D
the command
   TITLE MYOGLOBLIN
will retrieve a list of all the myoglobin sequences in the PDB and
   SPECIES MOUSE
will retrieve a list of all the mouse sequences.  The SPECIES field is not 100%
accurate for the NRL_3D because of some eccentricities in the SOURCE records of
the PDB used to construct it.  Although there is a KEYWORD field in NRL_3D
entries, it is constructed directly from the PDB HEADER record and is not
indexed.

With release 10.00 of NRL_3D the PIR will cease converting all of each PDB 
release.  Instead only new and modified entries will be converted; the NRL_3D
entries will gradually be modified to standardize spelling, capitalization,
nomenclature, taxonomy and keywords.  With this standardization the KEYWORD
field will become more meaningful and probably be indexed within the coming
year.


2. PIR Network Request Service Command Summary

The National Biomedical Research Foundation Protein Information Resource
network request service is a full-function fileserver and database query
system.  Operating since August 1990 it is capable of handling database
queries, sequence searches and sequence submissions, in addition to
fileserver requests.  To use this server, request commands should be sent to
FILESERV at GUNBRF on BITNET or FILESERV at NBRF.Georgetown.EDU on Internet.
The server recognizes the following commands sent either in a mail message,
or (if the sender is on BITNET) in a command message or a file:

  Command        Action
  -------        -----------------------------------------------
  ACCESSION      list entry codes and titles by accession number
  AND            combine QUERY commands with Boolean AND
  AUTHOR         list entry codes and titles by author
  BASES          list accessible databases
  CROSS          list PIR entry codes and titles corresponding to
                   a particular nucleic sequence database entry
  DEPOSIT        deposit entry for database submission
    END DEPOSIT  terminate deposit entry
  FEATURE        list entry codes and titles by feature table entry
  GENE           list entry codes and titles for a gene name
  GET            return entry by entry code
  HELP           return HELP instructions
  HOST           list entry codes and titles by host species
  INDEX          list SENDable files
  JOURNAL        list entry codes and titles by journal citation
  KEYWORD        list entry codes and titles by keyword
  MEMBER         list alignments containing entry code as a member
  NOT            combine QUERY commands with Boolean NOT
  OR             combine QUERY commands with Boolean OR
  QUERY          begin collecting QUERY commands
    END QUERY    terminate collecting commands and execute QUERY
  QUIT           ignore the remaining text (E-mail signature blocks)
  RETURN         change return address for gateway mail
  SEARCH         search for matching sequences by FASTA procedure
    END SEARCH   terminate sequence for searching
  SEND           send file
  SPECIES        list entry codes and titles by species
  SUGGEST        leave suggestion or correction for PIR staff
    END SUGGEST  terminate suggestion text
  SUPERFAMILY    list entry codes and titles by superfamily name
  TAXONOMY       report taxonomy for scientific or common name
  TITLE          list entry codes and titles by title
  USE            set databases, dates or formats to use in limited searches

Multiple commands can be sent with one command on each line of a mail message
or file.  Commands should NOT be sent on the Subject line of a mail message.
Receipt of command messages and files will be acknowledged immediately.  Mail
messages will be acknowledged by return mail.

For help in using any of the commands, send a request of the form
  HELP topic
for example
  HELP SEARCH

In addition to the commands, help instructions are also available on the
following topics:
  Custom_Services
  Databases
  FTP
  Gateway_Access
  Help_en_Espanol
  Help_en_francais
  Hints
  IBM-VM_BITNET
  On-Line_Access
  PIR_Distribution
  VAX-VMS_BITNET

Because of network gateway communication protocols, there are limitations on
requests sent through gateways.  Users not on BITNET or INTERNET who access the
server through local or network gateways should read and carefully follow these
instructions before sending requests.  Only mail message requests (not command
messages or files) can be sent through gateways.  Because addresses posted on
gateway mail do not always work for the return, before you send requests
through network gateways it is strongly recommended that you first contact Dr.
John S. Garavelli (POSTMAST at GUNBRF on BITNET, POSTMASTER at NBRF.Georgetown.EDU on
Internet).  We will confirm a return address for you and may instruct you to
use the RETURN command to ensure that your request output will reach you.  It
is not usually necessary to do this if you are on BITNET or INTERNET, unless
your system employs a local remailer or your mail program applies a
nonstandard return address (for example a personal name on the FROM: line).

The BITNET network and the network gateways impose strict limits on file size.
Poorly posed database queries may result in output so extensive that it could
not be returned by network mail.  Therefore, an output limit of 1000 lines for
each command and 3000 lines for each request is imposed by the PIR server.

The DEPOSIT and QUERY commands, and the SEARCH and SUGGEST commands (in their
multiline form) must be followed by their respective END commands after the
text appearing on the intervening lines.  The DEPOSIT command requires, and the
SEARCH command optionally uses, parameters that appear on the same line as the
command.  Because these four commands are so complex, users should obtain and
carefully read the help instructions before attempting to use them.

The databases available through the PIR Network Server and their abbreviations
for code specification are as follows:
  Abbreviation  Database                              Update Schedule
  PIR1          PIR Annotated and Classified Entries  quarterly
  PIR2          PIR Preliminary Entries               approximately monthly
  PIR3          PIR Unverified Entries                weekly
  ALN           PIR Alignment Entries                 semiannually
  NRL_3D        Brookhaven Data Bank Sequences        quarterly
  PATCHX        MIPS PIR-Supplementary Database       quarterly
  N             NBRF Nucleic
  GB            GenBank (TM)                          as received
  GBSUP         GenBank (TM)                          as received
  GBNEW         GenBank (TM) New Entries              weekly
  EMBL          EMBL                                  as received
  EMBLSUP       EMBL                                  as received
In the FASTA output of the SEARCH command the abbreviation for PATCHX is
shortened to PATX and NRL_3D is shortened to NR3D; the longer abbreviation
should be used to retrieve an entry with the GET command.  Not all commands
work with all databases; please read the information returned by the command
HELP DATABASES.
------------------------------------------------------------------------
                                 Dr. John S. Garavelli
                                 Database Coordinator
                                 Protein Information Resource
                                 National Biomedical Research Foundation
                                 Washington, DC  20007
                                 POSTMASTER at GUNBRF.BITNET
                                 POSTMASTER at NBRF.Georgetown.Edu



More information about the Proteins mailing list

Send comments to us at biosci-help [At] net.bio.net