In article <6l7588$j4n$1 at netnews.upenn.edu>,
Al Wang <alwang at blue.seas.upenn.edu> wrote:
>Hi,
>>Does anyone know what's a good way to take a Brookhaven PDB file, extract
>the sequence information, and save it in SwissPROT format? If it can
>also extract the secondary structure features, even better.
I can't help exactly, but if it gets you half way there, here is my
perl script that extracts the sequence information from a PDB file and
creates a FASTA format sequence file from it:
#!/usr/local/bin/perl
# pdb2fasta (C) T J R Cutts 1998
# This script extracts the SEQRES records from a PDB file, and writes out
# separate FASTA format sequence files, one for each strand in the PDB
# file. The title line of the FASTA file is constructed from the COMPND
# records of the PDB file, and the strand letter.
%lookup=(
'A' => 'A',
'C' => 'C',
'G' => 'G',
'T' => 'T',
'U' => 'U',
'ALA' => 'A',
'ASX' => 'B',
'CYS' => 'C',
'ASP' => 'D',
'GLU' => 'E',
'PHE' => 'F',
'GLY' => 'G',
'HIS' => 'H',
'ILE' => 'I',
'LYS' => 'K',
'LEU' => 'L',
'MET' => 'M',
'ASN' => 'N',
'PRO' => 'P',
'GLN' => 'Q',
'ARG' => 'R',
'SER' => 'S',
'THR' => 'T',
'VAL' => 'V',
'TRP' => 'W',
'XXX' => 'X',
'TYR' => 'Y',
'GLX' => 'Z',
'...' => '.',
'END' => '*'
);
die "Usage: pdb2fasta <pdbfile> <outfile>\nA separate output file will be created for each strand in the PDB file.\n"
unless ($#ARGV==1);
@pdb=();
open(P, "<$ARGV[0]") || die "Could not open $ARGV[0]";
# Put the file, stripped of its carriage returns, in an array.
foreach (<P>)
{
chomp;
push(@pdb, $_);
}
close(P);
%strand = ();
# Extract the SEQRES and COMPND records
@seqres=grep(/^SEQRES/, @pdb);
@cmpnd=grep(/^COMPND/, @pdb);
# Create a suitable FASTA title line from the COMPND records
$cmpnd='>'.join(' ', @cmpnd);
$cmpnd =~ s/COMPND//g;
$cmpnd =~ s/\s+/ /g;
foreach $line (@seqres)
{
# Split the SEQRES line from positions 19-70 into an array of residues
@residues = split(/\s+/, substr($line, 19, 51));
# For each of these, uppercase the residue, look up its single letter
# equivalent in the hash at the top of this script, and then append
# that single letter to this strand's variable for output.
foreach $res (@residues)
{
$res = uc($res);
$strand{substr($line, 11, 1)} .= $lookup{$res};
}
}
# For each strand discovered, create a FASTA format file printed nicely
# with 60 residues per line.
foreach $key (keys %strand)
{
$filename = ($key eq " ") ? "${ARGV[1]}.tfa" : "${ARGV[1]}_$key.tfa";
open (O, ">$filename") ||
die "Could not open $filename for writing";
print O "$cmpnd - strand $key\n";
for ($n = 0 ; $n < length($strand{$key}); $n+=60)
{
print O substr($strand{$key}, $n, 60)."\n";
}
close O;
}
# That's it!
--
--------------------------------------------------------------------------
Dr T J R Cutts Tel: +44 1223 333596
Dept. of Biochemistry, 80 Tennis Court Rd.
Cambridge, CB2 1GA, UK