Happy New Year to all Bio-Soft/Bionet.Software participants!
Regarding present and future programs for automated illustrative
shading of multiple sequence alignments, Michael Baron wrote:
> As it happens, a number of people communicated that they *had* wanted
> to do what I described, but none of the programs available did it
> (because, of course, they were designed to do something else). As Mark
> Reboul suggested, it might be worth a little discussion about what
> kind of rules we want an alignment *display* program to follow when it
> makes a decision as to what residues to highlight.
Based on recent involvements with this issue, I have identified the
shading scheme desired by some of my users. They want a shading
program which merely automates the highlighting of majority column
occupations. That the numbers in some mutation score matrix
analytically justify the choice of a particular amino acid as one
column's consensus element is of no concern to them. They don't want
the shading program doing analysis. Their need for analysis vanished
as soon as the alignment was generated. (Of course, users at other
sites may have different needs!)
Here are the simple shading rules which my users would like to have
applied to mutual alignments of multiple amino acid sequences. The
following would apply to each column of an alignment:
o If more than half of the elements are one amino acid, then
shade those all in black. Call that amino acid the "majority
occupier" in the column.
o In cases where the above has occurred, if any of the other
elements fall in the same amino acid family as that majority
occupier, then shade those other family members in gray.
o In cases where there is no majority occupier, if more than
half of the column elements all fall into the same a.a.
family, then shade all of those family members in gray.
That's it. In this very straightforward scheme, two shading colors
are required: black and gray.
Now, where I said "more than half" above, there's a question of
whether that should mean precisely greater than half, or greater
than OR EQUAL to half. If the latter, troubling problems can arise.
For example, what do you do if you have an alignment of 8 sequences,
and in one column there are 4 I's and 4 L's -- which do you shade
black and which do you shade gray? Do we need to specify one more
rule, causing all 8 to be shaded gray in such a case?
Keeping the interpretation of "more than half" as precisely greater
than half avoids such questions, but also means that a column with
C,D,E,G,I,I,I,I will have nothing shaded, which is not necessarily
what a user wants. The "perfect" shading program ought to have an
option allowing the user to switch from the default .gt. half rule
to the .ge. rule.
My users care very much about the a.a. family groupings implicit in
shading decisions. In fact, they want that relation to be explicit,
and independent of the numbers in the mutation matrix (which were
used in constructing the m.s.a.). They would like the user to have
control over the family groupings used in shading decisions
(according to my proposed rules above). It would be useful if the
program had built into it a default a.a. family grouping, 2 or 3
alternate "standard" groupings (selectable at the user-prompt
level), and an option allowing the user to specify his/her own
custom a.a. family assignments (presumably via a text file
confirming to some simple format). I note that Kay Hofmann's Box-
Shade program, along with its *.SIM file, can support asymmetric
family membership assignments, which is a useful capability to
preserve in the shading program next to be developed or perfected.
The default family grouping might be the one already used in GCG's
GenRunData:Simplify.Txt [text below from GenHelp Simplify
Description output] --
A = P,A,G,S,T (neutral, weakly hydrophobic)
D = Q,N,E,D,B,Z (hydrophilic, acid amine)
H = H,K,R (hydrophilic, basic)
I = L,I,V,M (hydrophobic)
F = F,Y,W (hydrophobic, aromatic)
C = C (cross-link forming)
(grouping due to Jimenez & Martinez at UCSF). An alternate standard
grouping might be the following one from Branden & Tooze,
_Introduction to Protein Structure_, 1991, pages 6-7, which seems
very different --
A,V,F,P,M,I,L (hydrophobic)
D,E,K,R (charged)
S,T,Y,H,C,N,Q,W (polar)
G (glycine)
(I don't know where the ambiguous codes B and Z fit into this one.)
My comments above are specific to amino acid sequence alignments. A
shading program may need to obey a different rationale in handling
nucleotide sequence alignments. Black would still highlight majority
occupations. Gray might be used in connection with ambiguous
nucleotides: highlighting ambiguous matches to a majority
(unambiguous) element, or highlighting a column's majority of
matching/consistent ambiguous nucleotides, etc. (It sounds
potentially more complicated than handling amino acids -- may not be
possible to work out a fully consistent scheme.)
Positive reactions, negative t(h)rashing, or other comments?
Mark Reboul
Columbia-Presbyterian Cancer Center Computing Facility
mark at cuccfa.ccc.columbia.edu
P.S. -- Kay Hofmann, creator of BoxShade, raised his own interesting
questions about all this back in '91 in the Info-GCG forum.
For those who may not have seen that, I excerpt the relevant
part of his posting below.
===============================================================================
Date: Sat, 9 Nov 1991 19:40:20 GMT
From: Kay Hofmann <KHOFMANN at CIPVAX.BIOLAN.UNI-KOELN.DE>
Subject: Re: PRETTYBOX v1.1
.
. [initial text omitted by Mark Reboul]
.
Problem 1: Think of an alignment having at a certain position two sequences
with a Glu (E) and, say, five sequences having a hydrophobic
residue (I,L,F,V,M). Now any program has to decide if the two
identical E's should be marked, leaving the rest unlabeled or
if the 5 similar residues should override the identity. This
problem has to be extended to more complex cases involving
a weighting of similar vs. identical residues
Problem 2: (related to 1) How should similarity should be treated in
general? One obvious method would be using Dayhoff-type scores
which would in turn solve the 'identity vs. similarity' problem.
The problem with those score is that identity of frequently
occuring residues have lower scores than exchanges between
less frequent amino acids. Strict use of these tables would lead
to diagrams where His-Gln would be marked while any number of
Ala at the same position are not recognized.
Problem 3: (unsolvable)
If you have at a seqeunce position 5x Gly and 5x His only one
of these identities can be marked, the other one has to remain
blank. Depending on the purpose of your multiple alignment, this
can lead to very annoying effects if the 'wrong' group is labeled
by the program. Example:
seq1 GRTEILV
seq2 GRTEILV
seq3 AAAIWWW
seq4 QQQILLL
seq5 LLLIQQQ
you see how the consensus of seq1 and seq2 would be interrupted
because the (unrelated) seq3,seq4 and seq5 all have an Ile at
position 4.
.
. [more text omitted here by Reboul]
.
------------------------------------------------------------------------
Kay Oliver Hofmann Tel. ++49 201 478 6980
Institut fuer Biochemie (med. Fak.) FAX ++49 201 478 6979
Universitaet Koeln
Joseph Stelzmann Str. 52 INTERNET:
D-5000 Koeln 41 KHOFMANN at cipvax.biolan.uni-koeln.de
------------------------------------------------------------------------