IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Sandia Labs Workshop (Final Program)

scistra at cs.sandia.gov scistra at cs.sandia.gov
Wed Jun 8 21:29:04 EST 1994

       Sandia National Labs Workshop on          _/_/_/   _/    _/  _/
                                                _/       _/_/  _/  _/
       COMPUTATIONAL MOLECULAR BIOLOGY         _/_/_/   _/ _/ _/  _/
                                                  _/   _/  _/_/  _/
            June 20-21, 24, 1994             _/_/_/   _/    _/  _/_/_/_/

               Albuquerque, NM

               (Final Program)

This document has the following sections:

        1. Final Program of the Workshop

        2. Workshop Information (Workshop Locations, Hotels, Directions)

        3. Abstracts of the Lectures

        4. About the Speakers

             I. FINAL PROGRAM

====== Monday June 20, University of New Mexico ============

   8:30-9:30 Reception and registration


  Nat Goodman (Whitehead Institute/MIT Center for Genome Research)
    Distinguished Lecture: Genomic problems to entice the computer scientist:
                           databases, maps, logic, and graphs
   9:30-10:30 Lecture 1
  11:00-12:00 Lecture 2

  Pavel Pevzner (Department of Computer Science, Penn State University)
    Genome rearrangements

   2:00-2:40  Lecture 1
   3:00-3:40  Lecture 2


  4:00-4:30  David Greenberg (Sandia National Laboratories)
    Algorithmic strategies for mapping chimeric clones

  4:30-5:00  Emanuael Knill (Los Alamos National Laboratory)
    Pooling strategies for unique sequence screening

  5:00-6:00  Discussions, Nat Goodman, Chair

===== Tuesday June 21, University of New Mexico ============


  Dan Gusfield (Department of Computer Science, U. C. Davis)
    Tutorial on biomolecular sequence analysis: algorithms

  8:30-9:30   Lecture 1
  9:45-10:45  Lecture 2

  Stephen Altschul (National Center for Biotechnology Information, NIH)
    Tutorial on biomolecular sequence analysis: statistics

  11:00-12:00 Lecture 1
   2:00-3:00  Lecture 2

  Bonnie Berger (Department of Mathematics, MIT)
    The mathematics of virus shell assembly

  3:15-4:15   Lecture


  Nat Goodman (Whitehead Institute/MIT Center for Genome Research)
    Research directions at the MIT Center for Genome Research

  4:45-5:45   Lecture 3

  5:45-6:45   Discussions, Dan Gusfield, Chair

===== Thursday June 23 =======================================

  7:00 Dinner with Professor David Botstein

===== Friday June 24, Ramada Hotel Classic ===================


  David Botstein (Department of Genetics, Stanford University)
    Distinguished Lecture: A biologist's view of open problems in
    Computational Molecular Biology


  Stephen Altschul (National Center for Biotechnology Information, NIH)
    Approaches to local multiple sequence alignment


  Dan Gusfield (Department of Computer Science, U. C. Davis)
    What's new in sequence alignment and search algorithms


  Craig Benham (Mount Sinai School of Medicine)
    New search strategies for DNA regulatory regions



The first two days of the workshop (June 20-21) are organized in
collaboration with the University of New Mexico:

  Department of Biology, Mary Anne Nelson, Coordinator,
  Department of Computer Science, Bernard Moret, Coordinator,
  Department of Cell Biology, School of Medicine, and
  UNM Cancer Research and Treatment Center, David Bear, Coordinator.

There is no registration fee for these two days and the workshop is open
to the public.  The workshop will take place at the University of New
Mexico Building for Continuing Education, Ballrooms B&C. (505) 277-2931

The minisymposium of the third day (June 24) is part of the
SIAM Conference on Discrete Mathematics, Albuquerque, NM, June 22-25
1994 and requires registration at the SIAM Conference.
SIAM registration information: (215) 382-9800, 800-447-7426 or
meetings at siam.org (Preregistration deadline: June 8, 1994)
The Ramada Classic Hotel is located on the northeast corner of Louisiana
and Menaul. (505) 881-000, (800) 252-7772

The Continuing Education Center of the University of New Mexico is
located at 1634 University Boulevard NE.  University Boulevard is a
major north-south street in Albuquerque; the Center is on the east
side of the street, north of Indian School Road and just south of the
Interstate (I-40) underpass.

To reach the Center from the airport:
take either the Girard or Yale exit from the airport.  Turn left
(west) on Gibson.  Then turn right (north) on University which is the
first traffic light after Yale.  The center will be on the right
past Indian School Road (approximately 7 traffic lights away).

To reach the Center from I-25 (If coming from either direction of
I-40, get on I-25 South): take the Lomas Boulevard exit.  On reaching
Lomas from the Interstate exit, turn east.  If you are coming from
I-25 South, this will be a left turn.  If you are coming from I-25
North this will be a right turn.  Turn left at University Boulevard
(first major intersection, the second or third traffic light depending
upon the exit taken).  The center is on the right after Indian
School Road.

Three hotels are within walking distance (0.3 mi) of the Center,
although the roadside does not have sidewalks.  These are the Hilton Inn
((505)884-2500), the Holiday Inn Midtown ((505)884-2511), and the Comfort Inn
Midtown ((505)881-3210), all situated around the intersection of University
and Menaul Boulevards, just north of the Center along University Boulevard.

The University of New Mexico has standing arrangements with several hotels
in town; attendees to the SNL/UNM minisymposium on computational molecular
biology qualify for these rates---mention UNM's "preferred rate" when booking.
These hotels include the Hilton Inn ($59/69 for single/double), the Holiday
Inn Midtown ($42/52), but not the Comfort Inn.  Other hotels with special
rates include the Plaza Inn (243-5693, $37/37), very close to UNM's main
campus (about 1 mile southwest of the Continuing Education Center), and
several hotels close to the airport (about 2 miles directly south of UNM's
main campus): Best Western Airport Inn (242-7022, $39/49), Best Western
Fred Harvey (843-7000, $54/64), and La Quinta Inn (243-5500, $46/54).

For participants staying at the Ramada Hotel Classic, driving down to the
Continuing Education Center is recommended, although city buses can take you
there. Simply drive west on Menaul (the east-west street fronting the hotel)
until you reach University (about 3.5 mi), then turn left (south) onto
University and drive 0.3 mile, at which point the Conitinuing Education Center
will be on your left.

There are many restaurants on Central Avenue (old Route 66, which forms
the southern edge of UNM's main campus) between Carlisle and downtown.
Recommended are restaurants in the Nob Hill area (Central west of Carlisle,
including Scalo's, Il Vicino, The Double Rainbow, The Monte Vista Fire Station,
and numerous cafes), around campus (better for lunch perhaps, with various
ethnic foods) south of Central, and on Central west of University Boulevard
(Cafe Oceana, The Artichoke Cafe).

Albuquerque has a bus system, but renting a car is highly recommended.
Parking is easily available at the Continuing Education Center (although
difficult to find on main campus) and around the restaurants mentioned above.


The Workshop Proceedings will be published after the workshop in the
Springer-Verlag Lecture Notes in Computer Science series.

Workshop Organizing Committee:

Sorin Istrail, Chair, (Sandia National Laboratories),
James Orlin    (MIT and Whitehead Institute),
Michael Sipser (MIT and Sandia National Laboratories).

For more information call Sorin Istrail at (505) 845-7612
or (505) 845-7397 or send email to scistra at cs.sandia.gov



          Approaches to Local Multiple Sequence Alignment

                        Stephen F. Altschul

           National Center for Biotechnology Information
                    National Library of Medicine
                    National Institutes of Health
                           Bethesda, MD

DNA and protein molecules may be described abstractly as strings of
nucleotides or amino acids.  Based upon common structure or function,
a collection of DNA or protein sequences may be suspected to share
some common, local pattern.  An important question is how to define,
locate, and assess the significance of such patterns.  No known attack
on this problem is both sensitive to subtle biological relationships,
as well as fast and completely rigorous.  This talk will describe the
attributes an ideal algorithm would possess, and then describe some of
the compromises that have been made in order to construct tools that
are practical for the working biologist.

               Tutorial on Biomolecular Sequence Analysis:
Scoring Systems and Statistics for Protein and Nucleic Acid Sequence Comparison

                        Stephen F. Altschul

           National Center for Biotechnology Information
                    National Library of Medicine
                    National Institutes of Health
                           Bethesda, MD

Methods for searching protein sequence databases have become important tools
for the molecular biologist.  Because distantly related proteins may share
only isolated regions of similarity, e.g. in the vicinity of an active site,
such methods usually seek "local alignments" of segments from the query and
database sequences.  These alignments are generally assessed by means of an
amino acid "substitution matrix" that assigns scores to aligning every pair
of amino acids.  Over the years much effort has been devoted to defining,
analyzing and refining such matrices, with the hope of finding the one best
suited to distinguishing biological relationships from similarities due
merely to chance.

While a wide variety of rationales have been advanced for various scoring
systems, recent statistical results show that all matrices may be seen in a
common light.  Specifically, any substitution matrix is implicitly a log-odds
matrix, optimized for a certain set of amino acid pair "target frequencies".
With proper scaling, the scores in such a matrix may be viewed as bits of
information, or evidence, for the hypothesis of relatedness over that of
chance similarity.  In order to rise above background noise, the score of
an alignment needs to exceed the number of bits required to specify the
starting positions of the alignment's two segments.

Since the choice of a substitution matrix reduces to the choice of appropriate
amino acid pair target frequencies, how can these frequencies be specified?
Given a model of molecular evolution, such as that proposed by Dayhoff and
coworkers, one may calculate the expected frequencies at any given evolutionary
distance.  The popular "PAM-250" substitution matrix, for example, is derived
in exactly this manner.  Alternative approaches to calculating target
frequencies, such as that used to construct the "BLOSUM" matrices, have
recently been described.

In a given database search, database sequences related to the query may be
removed by any evolutionary distance, and short but strong similarities may
be just as interesting as long but weak ones.  However, it is not known a
priori what evolutionary distance will best characterize the similarities to
be found.  Therefore using scoring systems tailored to different evolutionary
distances can be a fruitful strategy.  In general, three matrices (e.g. the
PAM-30, 120, and 250 matrices) are sufficient to "cover the waterfront" of
recognizable similarities; further matrices should be of only marginal utility.

The theory can be extended to DNA sequence comparison, and appropriate scoring
systems may be derived.  For coding sequences, it may be shown that protein
comparison is generally more sensitive to distant relationships than is direct
DNA sequence comparison.

          New Search Strategies for DNA Regulatory Regions

                             Craig Benham

                   Department of Biomathematical Sciences
                      Mount Sinai School of Medicine
                            New York, NY

Control regions within DNA sequences regulate gene  expression and other
biological processes. Existing algorithms to find these regions search for
characteristic sequence signatures. They commonly find many candidate sites,
few of which are active. New methods determine which candidate sites have
physical-chemical attributes needed for activity. The propensity of DNA
strands to separate under stress correlates closely with specific activities.
The special methods needed to analyze DNA sequences for this non-local
attribute will be described.

                 The Mathematics of Virus Shell Assembly

                           Bonnie Berger

                      Department of Mathematics
                           Cambridge, MA

    (Joint work with Peter Shor, Lisa Tucker-Kellogg, and Jonathan King)

Viruses have shells made of repeated protein subunits surrounding their genetic
information.  Many viruses, including polio, herpes, and AIDS, have
icosahedral-shaped shells.  But how do these simple proteins build into such
complex shapes?  In the talk, I will describe surprisingly simple sets of local
rules that explain the self-assembly of nearly all icosahedral viruses,
including two whose structures have puzzled researchers.  With these local
rules, we can simulate the assembly process computationally, and design a
"toolkit" that will allow biologists to study virus shell assembly on a
computer screen.

    A Biologist's View of Open Problems in Computational Molecular Biology

                             David Botstein

                         Department of Genetics
                     Stanford University Medical Center
                              Stanford, CA


                   Genomic Problems to Entice the Computer Scientist:
                           Databases, Maps, Logic, and Graphs

                                    Nat Goodman

                   Whitehead Institute/MIT Center for Genome Research
                                   Cambridge, MA

This talk will present a sampling of systems-oriented computer science
problems that arise in genome studies.  The problems are all "open" and
could serve as a research entree into the field.  The problems to be
covered are

     1.  Sequence databases:  data modelling and DBMS requirements
     2.  Physical map construction:  simulation studies and heuristic solutions
     3.  Laboratory information systems:  client/server meets object-oriented
     4.  Building genome information systems from re-usable components

The talk is aimed at computer scientists with little background in biology.
 The principal theme is that there are lots of good CS problems in this
field.  But, these problems won't jump out at you; to find good new
problems, you have to become immersed in the problem domain.

               Research directions at the MIT Center for Genome Research

                                    Nat Goodman

                   Whitehead Institute/MIT Center for Genome Research
                                   Cambridge, MA

               Abstract: TBA

                Algorithmic Strategies for Mapping Chimeric Clones

                                David Greenberg
                          Sandia National Laboratories
                                 Albuquerque, NM

                          (joint work with Sorin Istrail)

We will present several techniques for analyzing STS probe-clone
incidence data.  Our techniques are designed to identify and correct
errors due to multiple insert chimerism and due to internal deletions.
The core of our approach is to use optimization functions which
correlate to the expected structure of chimeric data.  We further
refine the technique by combining multiple functions to yield
confidence information.  We will show the results of applying our
techniques to simulated data and to actual genomic data.

             Tutorial on biomolecular sequence analysis: Algorithms

                             Dan Gusfield

                      Department of Computer Science
                      University of California, Davis
                               Davis, CA

The area of algorithms for bio-sequence analysis is huge, but relies on  a few
general ideas related to dynamic programming, pattern matching, and tree
building.  In this tutorial I will introduce the basic techniques and mention
the most central applications (sequence database searching is the most mature
application), but I will develop the techniques in depth using some lesser
known and more recent applications such as cDNA matching, PCR primer
selection, alignment to a fixed tree, efficient search for approximate
palindromes and repeats, efficient search for exon shuffling and trans

              What's new in sequence alignment and search algorithms

                             Dan Gusfield

                      Department of Computer Science
                      University of California, Davis
                               Davis, CA

In this talk I will give an overview of new results in string algorithms
motivated by problems in molecular biology and how the algorithms on strings
relate to sequence problems on more complex structures, and in applications
allowing more complex sequence mutations.

           Pooling strategies for unique sequence screening

                            Emanuel Knill
                    Center for Human Genome Studies
                    Los Alamos National Laboratory
                           Los Alamos, NM

             (Joint work with Bill Bruno, David Balding and David Torney)

Unique sequence screening is an instance of the group testing problem. It
is used to locate the clones in a library of genomic clones which contain a
given unique site. This is required both for compiling maps and for locating
sequences obtained by other means. This application of group testing has
several features which distinguish it from most other uses which have
been studied. The first is that the cost of constructing pools and the
advantages gained from parallelism encourage the use of non-adaptive
methods. The second is that the number of screenings is usually at least
of the order of the number of clones in the library. Finally, in the
mapping application, the goal is not to find all clones which contain a
given site on each screening, but to find as many as necessary. Additional
ones can be inferred at a later stage of the screening process. I will
discuss the various criteria that are used for evaluating and optimizing
screening strategies and give an overview of some of the strategies
used. We have recently begun to make use of probabilistic constructions
for experimental screening of two libraries with about $1200$ clones each.
Our experience with these libraries will be discussed. I will explain the
computational problems involved in the development of general screening
strategies and the interpretation of data which results from screening.

                         Genome rearrangements

                           Pavel A. Pevzner

             Department of Computer Science and Engineering
                    The Pennsylvania State University
                           University Park, PA

Sequence comparison in molecular biology is in the beginning of a major
paradigm shift - a shift from *gene* comparison based on local mutations
to *genome* comparison based on global rearrangements (i.e. inversions and
transpositions of fragments). With the advent of large-scale DNA mapping and
sequencing, the number of genome rearrangement problems is rapidly growing
in different areas, including evolution of chloroplast and mitochondrial
genomes, virology and Drosophila genetics.  On the other hand, there are
almost no computer science results allowing a biologist to analyze genome

I discuss algorithms for *sorting by reversals and transpositions* to analyze
genome rearrangements and describe applications of these techniques to study
plant and viral evolution. I also formulate some open combinatorial problems
on evolution of mammalian chromosomes.  Finally I discuss Gollan's conjecture
about the reversal diameter of the symmetric group (motivated by genome
rearrangements), Ukkonen's conjecture on the equivalent word transformations
(motivated by DNA sequencing by hybridization), and the Schmitt-Waterman
conjecture about cassette transformations (motivated by DNA physical mapping).
Surprisingly enough, similar combinatorial techniques (alternating cycles in
edge-coloured graphs) prove all three conjectures and provide a clue for
algorithms in such diverse areas of computational biology as genome
rearrangements, sequencing by hybridization, and double digest physical mapping.

The work on sorting by reversals/transpositions and genome rearrangements
in plant organelles is a joint project with Vineet Bafna (Penn State).
The work on genome rearrangements in herpes  viruses is a joint project
with Colombe Chappey (NIH), Sridhar Hannenhalli (Penn State) and  Eugene
Koonin (NIH).

                     IV. ABOUT THE SPEAKERS

* Craig Benham is Professor of Biomathematical Sciences at Mount Sinai School
of Medicine. His research focuses on investigating how stresses imposed on DNA
affect its structure, and how these structural modifications in turn regulate
physiological activities such as initiation or termination of transcription and

* Bonnie Berger is an Assistant Professor of Mathematics at the Massachusetts
Institute of Technology. Her research focuses on computational biology as well
as on parallel algorithms and architectures.

* Professor David Botstein is chairman of the Department of Genetics at
the Stanford University School of Medicine.  His research in genetics
and molecular biology has helped revolutionize both fields.  His early
work on the bacteriophage P22 led to better understanding of DNA
function.  His work on protein secretion and the use of localized
random mutagenesis technologies has led to new insights in the interaction
of protein structure and function.  The search for human disease genes was
revolutionized by the mapping strategy of Botstein and collaborators which
employed restriction fragment length polymorphisms (RFLPs).  This new
genomic technology has led to the discovery of many disease genes
including Huntington's disease.  One of his current projects is the
Stanford Yeast Genome Project.

Dr. Botstein has been elected to the US National Academy of Sciences
and the Institute of Medicine.  He has been awarded the Eli Lilly
Award in Microbiology, the Genetics Society of America Medal, the
Allen Award of the American Society of Human Genetics, and the
Rosenstiel Award.

       "The students were looking forward especially to the encounter with
        Botstein. The MIT biologist was gaining a reputation somewhat akin to
        that of a rock star among young scientists... Moreover, his ad-lib
        monologues -- which could be triggered by anything from a simple,
        mundane question to the delivery of a full-blown scientific paper
        -- often were brilliant, for Botstein was one of those rare
        "conceptualizers,"  a scientist who could take a single fact and
        in a matter of minutes show his fascinated listeners how it fit into
        a much larger picture of scientific enterprise. Few walked away from
        an encounter with Botstein without a stimulating new appreciation of
        their own research."
        (From the book ``Genome'' by J. E. Bishop and M. Waldholtz,
         (Simon & Schuster 1990)

* Dr. David Greenberg is a Senior Member of the Technical Staff in the
Department of Algorithms and Discrete Mathematics at Sandia National
Laboratories. His research focuses on genomic mapping as well as on parallel
algorithms and architectures. He is part of the Sandia/Intel team that
earlier this year broke the existing world record for the MP Linpack
performance benchmark, returning the title to U.S. The new world record
obtained on the Sandia's Paragon -- the world's highest-performance
supercomputer -- is 143.4 double-precision gigaflops (billion floating point
operations per second).

* Dan Gusfield is Professor of Computer Science at University of
California, Davis. His research focuses on combinatorial optimization,
string algorithms and computational molecular biology.

* Dr. Nat Goodman is Senior Research Scientist at the Whitehead Institute
for Biomedical Research and Associate Director of the Whitehead
Institute/MIT  Center for Genome Research where he directs an informatics
research and development group supporting the Center's large-scale genome
mapping projects. He is a computer scientist with extensive background in
academia and industry.

Dr. Goodman is known in academia for his work on distributed database
systems, transaction management, and query processing.  His industrial
experience includes stints at several computer start-up companies; he was
a founder of Kendall Square Research Corporation, which manufactures
parallel super-computers, and is presently acting-President of Marble
Associates, Inc., a consulting firm specializing in the design and
implementation of business information systems.  Dr. Goodman has served on
numerous advisory bodies for the Human Genome Project including the Joint
Informatics Task Force.

* Emanuel (Manny) Knill is at the Los Alamos National Laboratory. His research
focuses on combinatorics, logic and applications to genomic mapping.

* Pavel Pevzner is Professor of Computer Science at the Pennsylvania State
University. His research focuses on computational molecular biology.

More information about the Comp-bio mailing list

Send comments to us at biosci-help [At] net.bio.net