How to handle selenocystiene in alignments ???

Peter Rice pmr at ebi.ac.uk
Tue Apr 22 03:47:27 EST 2003

Gordon D. Pusch wrote:
> I have recently found evidence that BLAST and FASTA do not properly handle
> the official IUPAC single-letter-code 'U' for selenocystiene, presumably
> because it does not appear in either the PAM or BLOSUM matrices (although
> I have not been able to rule out hard-coding as a cause). 
> Are substitution matrices available that include scores for selenocystiene?
> If not, what is the least harmful way of handling the selenocystiene character?
> Should it be changed to the code 'X' for an unknown amino acid?  Or should
> it be changed to the code for another amino acid with similar chemical and
> physical properties?  Would it be acceptable to change it to the extremely 
> rare but still 'legal' character 'Z' for glutamine?  Any other suggestions?

I am interested in this issue for EMBOSS.

It appears that a common approach is to treat 'U' as 'C'. This could 
mean converting 'U' to 'C' internally, or duplicating the 'C' scores as 
'U' for a matrix that does not include 'U'.

If there are not acceptable scores for 'U' then 'X' would be an 
alternative, although the implementations of some algorithms may have 

The use of 'Z' for glutamate/glutamine and 'B' for aspartate/asparagine 
goes back to the days of protein sequencing with an amino acid analyser. 
Hydrolysing all the amide bonds and then counting the molecular ratios 
resulted in asparagine being hydrolysed to aspartate and glutamine to 
glutamate so a code was needed to represent the resulting ambiguity.

For protein sequences derived from a DNA sequence these codes are 
usually not seen, though I have come across SNPs that translate ('RAC' 
for example translates as 'B' because AAC is Asparagine and GAC is 


Peter Rice

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net