PIR-International Accession Number Policy
With the recent discussion on GenBank-BB concerning the accession number policy
of GenBank, we thought it would be a good time to restate the policy of the
PIR-International Protein Sequence Database.
Like Genbank we maintain both accession numbers and entry identification codes
(the equivalent of the GenBank LOCUS). Our policy differs in that we currently
assign accession numbers ONLY to sequences as reported in the literature or as
submitted to the database; these serve as unique identifiers of the reported
sequences. Moreover, the entry identification code refers to the entry NOT to
the sequence; as there is one sequence per entry, the difference is largely one
of principle; however, it does have practical implications. The entry
identification codes are unique in the database, but may be subject to change
from release to release.
We always considered that our role was primarily to represent sequences in
their `biological form' and secondarily to reflect the experimental literature.
Therefore, we strive to merge all sequence reports concerning the same protein
molecule. At this point you will note that our policy differs significantly
from that of Genbank. Part of the reason can be found in the data themselves.
Protein sequences have well defined boundaries and only in exceptional cases do
they exceed a few thousand residues in length. The average length for proteins
represented in the database is between 300 and 400 residues (including
polyproteins). Moreover, protein sequences are highly conserved (and generally
identical) among different strains of the same species and it is reasonable to
group the sequence data at this level, unless the data indicate otherwise,
i.e., consistent differences are noted among different strains. These
properties make decisions about merging entries straightforward and
difficulties with excessive lengths are not encountered. That all of the
sequence data have not been appropriately merged is a question of resources,
not intent.
In addition, we have developed a mechanism to present simultaneously both the
`merged' and `unmerged' data in an unambiguous and nonredundant manner. Several
years ago we introduced a concise syntax for representing differences among
sequences that allows the originally reported sequence to be derived from the
sequence explicitly represented in the entry. This syntax appears in the entry
in the `residues' field. These data would appear in an entry as follows.
REFERENCE
#Authors Scogin R., Richardson M., Boulter D.
#Journal Arch. Biochem. Biophys. (1972) 150:489-492
#Reference-number A00053
#Accession A00053
#Molecule-type protein
#Residues 1-97,'Q',99-111 <SCO>
The notation "1-97,'Q',99-11" indicates that the first 97 residues should be
extracted from the sequence shown in the entry; Gln should be appended,
followed by residues 99 to 111. In other words, the reported sequence differs
from that shown in the entry in having Gln at position 98. In the above
example, the reported sequence is identified by accession number A00053. All of
the information in this REFERENCE section refers to the experimental report.
This example was taken from entry CCTO; the label SCO on the residues line
uniquely identifies the residue specification within this entry; therefore, the
notation CCTO->SCO provides an alternate method for uniquely addressing the
reported sequence within the database (this address, however, is subject to
change from release to release).
These conventions allow us to distribute the information in a nonredundant
form, while allowing the unmerged `views' to be implicitly recovered. We have
developed software as part of our XQS program that automatically recovers this
information. We are in the process of developing portable transformation
functions that will be publicly distributed in the future. In the meantime,
anyone may develop their own methods if they wish and we would be happy to
assist. The syntax of the residues specification is described in Backus-Naur
Form and the semantics are fully defined in the PIR document CXFSD-1091 CODATA
Exchange Format Specification; this document is distributed on the ASCII tapes
and may be retrieved from the PIR Network Server (send "SEND CXFSD" in the body
of a message to FILESERV at GUNBRF.bitnet to obtain this document; send "HELP" in
the body of a message for further information concerning the PIR Network
Server).
These changes have been gradually introduced in the last three years. Prior to
this, discrepancies between reports were described in an unstructured way. We
are in the process of recovering the originally reported sequences from older
entries. In the meantime, not all of the original data are accessible by this
method. Moreover, these developments reflect a change in our policy concerning
accession numbers.
Until two years ago the accession number was assigned to the entry as a whole
and served as a token that allowed the entry to be recovered in future releases
irrespective of changes in the entry identification code. If one selects any of
the accession numbers from the entry, this role is still satisfied. Given the
dynamic nature of the entries, any attempt to assign an unchanging identity to
the entry as a whole is fraught with difficulty. Under the new policy,
permanent identities (accession numbers) are assigned to the sequence reports,
which are stable. The composite entry is best identified by the sum of its
parts, i.e. the collection of accession numbers. Please note that in
distinction to GenBank, we have no `primary accession number'; all carry equal
weight.
As a hold-over from the old accession number policy, all accession numbers that
were assigned to the entry as a whole prior to the change in policy are listed
on the main accession number line at the top of the entry. For convenience, the
accession numbers associated with specific reports are also duplicated on this
line.
We appreciate this opportunity to alleviate any confusion that may have arisen
when this policy on PIR accession numbers was changed, and welcome discussion,
suggestions, criticisms, etc., concerning policies or other aspects of the
database operation.
------------------------------------------------------------------------
David George
George at GUNBRF.bitnet
John S. Garavelli
POSTMASTER at GUNBRF.bitnet
Protein Identification Resource
National Biomedical Research Foundation
Georgetown University Medical Center
Washington, DC 20007