displaying alignments (thanks)

Fri Jan 7 10:03:44 EST 1994

Happy New Year to all Bio-Soft/Bionet.Software participants!

Regarding present and future programs for automated illustrative 
shading of multiple sequence alignments, Michael Baron wrote:

> As it happens, a number of people communicated that they *had* wanted 
> to do what I described, but none of the programs available did it 
> (because, of course, they were designed to do something else). As Mark 
> Reboul suggested, it might be worth a little discussion about what 
> kind of rules we want an alignment *display* program to follow when it 
> makes a decision as to what residues to highlight.

Based on recent involvements with this issue, I have identified the 
shading scheme desired by some of my users. They want a shading 
program which merely automates the highlighting of majority column 
occupations. That the numbers in some mutation score matrix 
analytically justify the choice of a particular amino acid as one 
column's consensus element is of no concern to them. They don't want 
the shading program doing analysis. Their need for analysis vanished 
as soon as the alignment was generated. (Of course, users at other 
sites may have different needs!)

Here are the simple shading rules which my users would like to have 
applied to mutual alignments of multiple amino acid sequences. The 
following would apply to each column of an alignment:

o	If more than half of the elements are one amino acid, then 
	shade those all in black. Call that amino acid the "majority 
	occupier" in the column.

o	In cases where the above has occurred, if any of the other 
	elements fall in the same amino acid family as that majority 
	occupier, then shade those other family members in gray.

o	In cases where there is no majority occupier, if more than 
	half of the column elements all fall into the same a.a. 
	family, then shade all of those family members in gray.	

That's it. In this very straightforward scheme, two shading colors 
are required: black and gray.

Now, where I said "more than half" above, there's a question of 
whether that should mean precisely greater than half, or greater 
than OR EQUAL to half. If the latter, troubling problems can arise. 
For example, what do you do if you have an alignment of 8 sequences, 
and in one column there are 4 I's and 4 L's -- which do you shade 
black and which do you shade gray? Do we need to specify one more 
rule, causing all 8 to be shaded gray in such a case?

Keeping the interpretation of "more than half" as precisely greater 
than half avoids such questions, but also means that a column with 
C,D,E,G,I,I,I,I will have nothing shaded, which is not necessarily 
what a user wants. The "perfect" shading program ought to have an 
option allowing the user to switch from the default .gt. half rule 
to the .ge. rule.

My users care very much about the a.a. family groupings implicit in 
shading decisions. In fact, they want that relation to be explicit, 
and independent of the numbers in the mutation matrix (which were 
used in constructing the m.s.a.). They would like the user to have 
control over the family groupings used in shading decisions 
(according to my proposed rules above). It would be useful if the 
program had built into it a default a.a. family grouping, 2 or 3 
alternate "standard" groupings (selectable at the user-prompt 
level), and an option allowing the user to specify his/her own 
custom a.a. family assignments (presumably via a text file 
confirming to some simple format). I note that Kay Hofmann's Box- 
Shade program, along with its *.SIM file, can support asymmetric 
family membership assignments, which is a useful capability to 
preserve in the shading program next to be developed or perfected.

The default family grouping might be the one already used in GCG's 
GenRunData:Simplify.Txt [text below from GenHelp Simplify 
Description output] --

	A  =  P,A,G,S,T    (neutral, weakly hydrophobic)
	D  =  Q,N,E,D,B,Z  (hydrophilic, acid amine)
	H  =  H,K,R        (hydrophilic, basic)
	I  =  L,I,V,M      (hydrophobic)
	F  =  F,Y,W        (hydrophobic, aromatic)
	C  =  C            (cross-link forming)

(grouping due to Jimenez & Martinez at UCSF). An alternate standard 
grouping might be the following one from Branden & Tooze, 
_Introduction to Protein Structure_, 1991, pages 6-7, which seems 
very different --

	A,V,F,P,M,I,L      (hydrophobic)
	D,E,K,R            (charged)
	S,T,Y,H,C,N,Q,W    (polar)
	G                  (glycine)

(I don't know where the ambiguous codes B and Z fit into this one.)

My comments above are specific to amino acid sequence alignments. A 
shading program may need to obey a different rationale in handling 
nucleotide sequence alignments. Black would still highlight majority 
occupations. Gray might be used in connection with ambiguous 
nucleotides: highlighting ambiguous matches to a majority 
(unambiguous) element, or highlighting a column's majority of 
matching/consistent ambiguous nucleotides, etc. (It sounds 
potentially more complicated than handling amino acids -- may not be 
possible to work out a fully consistent scheme.)

Positive reactions, negative t(h)rashing, or other comments?

	Mark Reboul
	Columbia-Presbyterian Cancer Center Computing Facility
	mark at cuccfa.ccc.columbia.edu

P.S. -- Kay Hofmann, creator of BoxShade, raised his own interesting 
	questions about all this back in '91 in the Info-GCG forum. 
	For those who may not have seen that, I excerpt the relevant 
	part of his posting below.


Date:         Sat, 9 Nov 1991 19:40:20 GMT
From:         Kay Hofmann <KHOFMANN at CIPVAX.BIOLAN.UNI-KOELN.DE>
Subject:      Re: PRETTYBOX v1.1

	.	[initial text omitted by Mark Reboul]
 Problem 1: Think of an alignment having at a certain position two sequences
            with a Glu (E) and, say, five sequences having a hydrophobic
            residue (I,L,F,V,M). Now any program has to decide if the two
            identical E's should be marked, leaving the rest unlabeled or
            if the 5 similar residues should override the identity. This
            problem has to be extended to more complex cases involving
            a weighting of similar vs. identical residues
 Problem 2: (related to 1) How should similarity should be treated in
            general? One obvious method would be using Dayhoff-type scores
            which would in turn solve the 'identity vs. similarity' problem.
            The problem with those score is that identity of frequently
            occuring residues have lower scores than exchanges between
            less frequent amino acids. Strict use of these tables would lead
            to diagrams where His-Gln would be marked while any number of
            Ala at the same position are not recognized.
 Problem 3: (unsolvable)
            If you have at a seqeunce position 5x Gly and 5x His only one
            of these identities can be marked, the other one has to remain
            blank. Depending on the purpose of your multiple alignment, this
            can lead to very annoying effects if the 'wrong' group is labeled
            by the program. Example:
      seq1  GRTEILV
      seq2  GRTEILV
      seq3  AAAIWWW
      seq4  QQQILLL
      seq5  LLLIQQQ
            you see how the consensus of seq1 and seq2 would be interrupted
            because the (unrelated) seq3,seq4 and seq5 all have an Ile at
            position 4.
	.	[more text omitted here by Reboul]
Kay Oliver Hofmann                        Tel. ++49 201 478 6980
Institut fuer Biochemie (med. Fak.)       FAX  ++49 201 478 6979
Universitaet Koeln
Joseph Stelzmann Str. 52            INTERNET:
D-5000 Koeln 41                     KHOFMANN at cipvax.biolan.uni-koeln.de

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net