Sequences from PDB via Entrez

Steve Bryant bryant at ray.nlm.nih.gov
Wed Feb 16 17:39:24 EST 1994

I've in the past seen messages on this board concering access to sequences
derived from the Brookhaven Protein Data Bank.  I've recently finished putting 
the latest PDB sequences into the Entrez database distributed by NCBI, and I
thought I'd take the opportunity to remind folks that Entrez can be an easy way
to retrieve a sequence from PDB.  

PDB-derived sequences can be identified within Entrez by using the keyword 
"pdb-structure".  This will find either all PDB-derived protein sequences or 
all PDB-derived nucleic acid sequences, depending on which category one 
selects.  Particular sequences within these groups may be found by pdb id-code,
"accession number" in Entrez, or by looking for protein names and the like in 
"text terms". The pdb-derived entries contain "text-terms" derived from PDB 
COMPOUND and SOURCE records, as well as from other PDB record types. One can 
also find PDB-derived sequences by searching for descriptive names in the 
Medline abstracts included with Entrez.  About 90% of the citations in PDB are 
linked to the corresponding Medline citation, and if you can find the paper 
that reported a structure, you can then ask for the associated sequence.  
Sequences may be written out of Entrez in different formats, including FASTA 
sequence files.  

The pdb sequence reports in Entrez combine information provided on pdb ATOM
and/or HETATM records with the explicit sequence given on SEQRES.  (In about 1%
of cases ATOM/HETATM and SEQRES cannot be linked unambiguously, due to missing
data or inconsistencies. In these cases biopolymer sequences are derived from 
ATOM records.)  Because of this linking the sequence reports contain a fairly 
rich annotation, including secondary structure, disulfide bonds, bonds to 
nonpolymer groups, and descriptions of modified biopolymer residues.  They also
contain the residue numbers assigned by pdb on ATOM/HETATM records, so that one
can unambiguously identify the coordinates in the pdb file that go with each 
residue in the sequence.  

PDB-derived sequence reports in Entrez are derived automatically from pdb 
files, and I update the collection with each new release of pdb.  Network 
Entrez version 9.0 came out on February 10, and its database includes all 
polypeptide and nucleic acid sequences on the "October, 1993" Brookhaven CD, 
which I received in late January.

Steve Bryant

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net