IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

extract protein

Joe White joe.white at WMICH.EDU
Thu May 7 12:35:47 EST 1998


In response to Tonu Margus's question, the following script works with
GENBANK flat files that have the qualifier "CDS" in the features list.  I
wrote this script to extract the translational start and stop points from a
genbank file, and translate the DNA sequence from that file into amino acid
sequence.  

#!/bin/csh -f
#
#	script: translate.fil         	author: Joseph A. White    10/96
#
# Script to extract the translational start and stop sites from DNA 
# sequence file, and translate the file, giving it a filename extension of
# .pep.
#
	set begin = ` grep 'CDS' $file | cut -c22-40 | cut -f1 -d"." `
	set end = ` grep 'CDS' $file | cut -c22-40 | cut -f2 -d" " | cut -c2-7 `
	echo $begin, $end   
	translate $file -default -beg=$begin -end=$end -out=$file.pep 

To use the script, enter the following command:

translate.fil  <filename>

The script works with a single sequence file name.  It will produce a file
with the extension ".pep" .  

There are two problems that could occur in using this script:
1.      If the coding sequence ("CDS") is listed as a series of joined exon
coding parts of a sequence file, the script will fail to translate the
correct parts of the sequence.  
2.      If the start and stop base pair numbers are not in columns 22-40 of
the line containing "CDS", the script will incorrectly translate the DNA
sequence.
The second problem can be fixed by altering the columns which CUT uses to
extract information.  The first problem requires much more work to make this
script useful.  

A script that will accept a file of sequence names, and translate each file
is shown below.  

#!/bin/csh -f
#
#	script: translate.seqfiles	author: Joseph A. White    10/96
#
# Script to extract the translational start and stop sites from a group of DNA 
# sequence files, and translates each file, giving it a filename extension of
# .pep.
#
foreach file (`cat $1`)
	echo $file
	set begin = ` grep 'CDS' $file | cut -c22-40 | cut -f1 -d"." `
	set end = ` grep 'CDS' $file | cut -c22-40 | cut -f2 -d" " | cut -c2-7 `
	echo $begin, $end   
	translate $file -default -beg=$begin -end=$end -out=$file.pep 
	set lines = `grep -c '*' $file.pep`
              if ($lines > 1) then
		echo $file.pep $lines >> check.pep
	              echo "$file has not been translated properly."
	else
		echo $file.pep >> tfiles.pep
              endif
end

To use the script, enter the following command:

translate.seqfiles  <file_of_ssequence_names>

The script works with a file of sequence names or with a single sequence
file name.  It will produce a series of files with the extension ".pep" .
It also produces a file containing a list of the sequences it has
translated.  The script will produce a file called "check.pep" if it finds
that any file has been incorrectly translated, i.e. it
detects stop codons within the translated sequence.

The script is prone to the same problems that the first has.

Joe White

At 12:30 PM 5/7/98 GMT, you wrote:
>Hi,
>Is there program in GCG or EGCG what can extract protein seq 
>from NH seq annotation?
>If yes in EGCG then  from wher can I downloud it?
>
>Tonu Margus
>
>
Joe White

e-mail: 		joe.white at wmich.edu
snailmail:		Dept. of Chemistry
		Western Michigan University
		Kalamazoo, MI  49008
phone:		(616) 387-2895
fax:		(616) 387-2909




More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net