On Wed, 20 Nov 1996, Stephane Vuilleumier wrote:
>> I was wondering whether it is possible to translate a set of N large DNA
> sequence files (say, a prokaryotic genome) into all 6 reading frames in one
> command line. I have the DNA sequence files in GCG format and a file with
> the name of these files.
> The rationale for doing this is I feel the sequence annotations which I think
> are used in the trembl protein database (which takes some time to update
> anyway) might miss some subtle things such as translational coupling,
> frameshifts and, yes, sequencing errors introducing stop codons.
>> What I would do next is build a dataset with these 6N protein translations
(stuff deleted)
Why not just use TFASTA which does essentially the same thing -
"on-the-fly" translations of all 6 frames and protein-protein
comparisons. The difference is that TFASTA doesn't need, nor does it
save, the translations in a database.
I have found that fastapep.cmp doesn't work very well for TFASTA -
there are lots of alignments to obviously closed reading frames. However,
providing a penalty to alignments with a stop codon, and reducing the
score for certain matches, e.g. Cys-Cys and Trp-Trp, works much better and
brings significant alignments up out of the "noise". Indeed TFASTA does
show sequencing errors in real reading frames - this will show up as the
same file listed twice, in two different reading frames.
******************************************************************************
Paul H. Roy Phone: +1 418 654 2705
Departement de biochimie,FSG FAX: +1 418 654 2715
Universite Laval E-mail: proy at rsvs.ulaval.ca
Quebec, QC G1K 7P4
CANADA
******************************************************************************