Saludos Netlandos,
I recently asked for pointers or comments about representing the
relationships among amino aacids in different ways. (My apologies for
cannibalizing text from previous direct replies and the extreme latency in
summarizing).
Perhaps I was being a little too vague in my original posting, but I
wanted as many "alternatives" as possible. I was actually looking for 2
things.
I was originally looking for a way of coding substitution matrix
probabilities (originally Dayhoff's PAM matrix, but willing to consider
others) in a more convenient, compact way for attempting to do sequence
comparisons and alignments (more or less the same thing) using digital
signal processing techniques. (Any comments, helpfully derogatory or
otherwise on such an attempt are also welcome.)
The second is related - a way to show in 3-D, using color,
transparency, and/or "time-smearing" if neccessary to address other
dimensions, some way of representing as many of the characteristics of the
amino acids as possible. It occurred to me that it would be a great
teaching tool to be able to have as much information coded into a single
image as possible (shades of Edward Tufte), and so the impetus to check the
net to see how others had approached the problem.
I guess the ideal representation for the sequence anlaysis would include
a 3D (or 4D or 5D, real or imaginary) vector for each amino acid, the
coordinates representing various information about its biochemical nature
and substitution frequency (perhaps under different conditions or in
different types of protein). That's the ideal. I'll be settling for
something less.
So, in (just slightly over) 25 words 8), that's what I was fishing for.
The fish:
--------------
Tom (alternatively "Cassandra" or "what good is it if my computer can't
parse it?!" 8-) Schneider of NCI (toms at ncifcrf.gov) has approached the
problem with his "sequence logos"; (parsable) reference follows:
@article{Schneider.Stephens.Logo,
author = "T. D. Schneider
and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "6097-6100",
year = "1990"}
This paper describes a new way to show consensus sequences by changing
the size and other attributes of the letters in a consensus sequence to
show which residues are represented most. It's a bit startling to view,
but it does get the point across and it is certainly easier to pick out
homologies than looking at the usual stacked letters of slightly different
shading. It is a somewhat "lossy" representation, in that it does not
allow you to reconstruct the original sequences, but in many cases that's
not a disadvantage.
He writes:
"To see what they are like, you can look into my anonymous ftp archive at
ncifcrf.gov in pub/delila. There is a README with general directions, but
the
file globin.logo.Z is in PostScript ready to dump to your printer. (.Z means
it is compressed. Remember to transfer in binary mode and then uncompress
under unix. Tell me if you have trouble with this.) The logo paper itself is
also in the archive, as logo.bbl.Z and logo.tex.Z. I could put the PostScript
version there if you can't use LaTeX. The four figures are: globin.logo.Z
lambcro.logo.Z ribo.logo.Z t7.logo.Z ."
Cliff Pickover (cliff at watson.ibm.com) has some wonderful books out
(Christmas time is coming up - leave a few hints around) that deal with
alternative representations of data ("Computers and the Imagination",
"Computers, Pattern, Chaos, and Beauty" [obviously referring to my desk],
and most recently "Mazes for the Mind: Computers and the Unexpected") and
he recently wrote a paper that tangentially dealt with this problem
(non-parsably, DNA and protein tetragrams: biological sequences as
tetrahedral movements, J Molecular Graphics 10(March):2-16, 1992) by trying
to visualize sequences as 3D movements. For proteins, he tried coding the
amino acids as the standard polar/nonpolar/ charged/noncharged as the 4
vectors but also as a dodecahedron (20 points). This latter attempt is
closer to what I had in mind, but it still does not correctly address the
substitution problem. Distorting the regular dodecahedron brings it closer
to what I had in mind and coloration could bring it still closer, but it's
still not perfect.
As usual, though, Cliff's approach is very good - we've got access to
all these tremendously powerful, 3D-capable workstations, but very few
techniques have been presented that make use of them to help reduce the
data to a more understandable image. His idea of 3D sequence alignments
are a nice step in bringing another level of analysis to sequence bashing.
Cliff is editing another book coming out Real Soon Now, "The Visual
Display of Biological Information" (has a Tufte-ian ring to it doesn't
it?), which deals with a number of ways to represent and analyze sequence
data (including a bit on Tom's Sequence Logos - see above). Watch for it.
Craig Livingstone (cdl at biochemistry.oxford.ac.uk) suggested a paper about
different ways to picture the relationships among amino acids. (William R
Taylor (1986) The classification of amino acid conservation. J.
Theoretical Biology 119:205-218) It's a bit old, but I'm ashamed that I
hadn't seen it before - it's one of those rare papers that's a truly _fun_
read - well written and informative, with a number of models presented,
warts and all. Also some very nice graphical depictions of said
relationships, including some Venn diagrams that will be going up on my
wall. Very highly recommended!
This was close to what I had in mind - different coding schemes to show
individual relationships between aas.
Taylor also has a chapter in the book "Nucleic Acid and Protein
Sequence Analysis - A Practical Approach", Edited by M. J. Bishop and C. J.
Rawlings, (IRL Press Oxford & Washington DC), 1987 that differs in topic,
but includes some of the same information (thanks to Peter Floriani
(florianp at cs.rpi.edu) for the pointer).
Andre Lipinski (andre at xtliris.csu.McMaster.CA) and Calvin Harley
(charley at mcmail.cis.mcmaster.ca) have "designed a set of rational graphic
representations for presenting protein primary structure ... The
implementation is via a Fortran77 program that produces a PostScript
output, in colour of the sequence as a linear string of icons representing
each different aa." They have also submitted the program to GCG for
inclusion into their package.
Andre writes:
Our system is simply 2-dimensional giving information ranging from just the
gross character of a sidechain (charged, hydrophilic and hydrophobic and
sulphur containing) using four primary colours for each (red, blue, green
and yellow) to form the body of a cube. To represent other aspects of the
residue, a corner of
the cube is coloured in the lessor character of the residue. We furthered
this by changing the basic shape to give an indication of the size, and to a
lessor degree, the shape of the side chain. Small residues have the bottom
corner cut off, medium sized hav the cube shape, big residues have a cube
plus a triangle added to the bottom, side-chains with some partial ring
structure or long chain are a rectangle (resting on the short side) with
one corner rounded off, really big residues like whole ring structures have
a big appendage stuck to the bottom etc... This system has little degeneracy.
The point was to show what it is like, not re-name it. A third method is a
compromise for situations where abstractions will not suffice. It is simply
a stylised representation of the structure of the side chain."
They were good enough to send me color Postscript examples of their
work. The examples are in 3 formats, small, medium, and large. While in
color (using the SGI PS Viewer) even the smallest "characters" were easy to
differentiate, in B+W, the small one is a bit ...small and the
characteristics are hard to make out with the shading scheme they've
implemented. The large size is useful, however, because they very nicely
iconify the actual structure of the amino acid. I apologize for the
somewhat base suggestion that they would be very useful as an Amino Acid
'font' to use in sequence alignments or in structural diagrams - much more
useful than filled circles or stacks of text.
I've asked them to put some examples up for anonymous FTP - write to
them direct.
Larry Hunter (hunter at nlm.nih.gov) responded with an entire paper
(Postscript Format) which also hit very close to the mark. His thesis
(correct me if I'm horribly off the mark) is that amino acids are usually
represented by a very limited number of bits for machine learning and
analysis. One reason for this is the the usual/easy way of representing
characters in modern, general-purpose compilers (1 byte/character,
referencing the character and conveying no other information about it).
Certainly, if more information could be coded into the original
representation of the amino acid, you would have to do less processing
to derive relationships among the different aas or proteins.
Larry proposes to use much longer (48 bit) bitstrings to represent amino
acids, encoding in that bitstring the Atom/Orbital/Hydrogen (AOH)
configuration of each. In this way, a substantial amount of the the
bio-physico-chemical information about that amino acid is carried in its
representation. While 48 bits is considerably longer than 8 (the standard
length for a char), it is not unmanagably so, fitting nicely into 1.5
32-bit words, the standard data chunk of modern 32-bit processsors.
He has carried out analyses using this representation in a neural
network- based prediction of 'missing amino acids' and has found that it is
always better than the simple naming representation and usually better than
another method that uses 10-bit strings to represent aas (similar to the
one described in Holley and Kaplus 1989).
You can get the relevant files by anonymous ftp from the host
lhc.nlm.nih.gov, in the directory /pub/amino-acid-rep.
Gaston Gonnet (gonnet at inf.ethz.ch) contributed two ways of visualizing
this kind of data: He writes:
"We have been looking at representation of amino acids such that
distance between them relates to likelihood of inter-mutation.
Let me explain this in a bit more of detail. From a sufficiently
large set of alignments (or from a Dayhoff matrix) you can compute
the probability of amino acid i mutating into amino acid j for
every i and j. This is known as a mutation matrix in the
standard jargon. Now you make the analogy that distance between
amino acids is inversely proportional to the probability of
mutation. I.e. double the distance means half likely to mutate
into each other.
Hence, "close" amino acids will likely mutate into each other,
"distant" amino acids are unlikely to mutate into each other.
Close amino acids must have similar properties for protein function.
To represent this information we have two alternatives. The
first is to represent them as an unrooted tree. The amino
acids are at the leaves and the distance between amino acids
is measured by the length of all the branches that link the
two together. This is very much like phylogenetic tree
construction, except that it is done for similarities, not for
common ancestry.
The second method is to use euclidean distances and represent
the 20 points in space. Two dimension placements give a rather
primitive approximation, but is much easier to see than 3D (not
to mention 4D, 5D, etc.).
In both cases, for the unrooted tree and for the 2D placement,
the constructions are approximations. It is generally impossible
to find a tree or points in 2D which satisfy all 20*(20-1)/2=190
distance constraints.
This message includes the postscript for the tree and for the 2D
placement. This can be displayed in most bitmapped terminals
and on postscript printers. The postscript files are separated
by dashed lines. Enjoy it!
Gaston H. Gonnet, Informatik, ETH Zurich."
<<< In the interest of saving bandwidth, the PS files are not included here
<<< but can be obtained from our anonymous FTP server (salk-sc2.sdsc.edu)
in the <<< top directory as GONNET.PS (contains both files).
He also responded to my mention of representing the relationships in
multiple dimensions with a note to the effect that his group has already
tried something along these lines and that (happily), _most_ of the
information coded in such a 20D analysis can also be encoded in a 3D
analysis. I look forward to reading the paper.
Thanks to all who responded! If you have any advice or further
comments, I'd love to hear them. Standard disclaimers apply.
Cheers and Happy Holidays
or
So long and thanks for the fish,
Harry (If you can't do it right, do it now) Mangalam
Harry Mangalam Vox:(619) 453-4100, x250
Dept of Biocomputing Fax:(619) 552-1546
The Salk Institute 1' mangalam at salk-sc2.sdsc.edu
10010 N Torrey Pines Rd 2' hjm at salk-sgi.sdsc.edu
La Jolla CA 92037 3' mangalam at salk.bitnet