IUBio

A Philosophy for GenBank

Tom Schneider toms at fcs260c2.ncifcrf.gov
Mon Jul 6 18:01:43 EST 1992


% This is a LaTeX document
\documentstyle[12pt]{article}
\sloppy

\newcommand{\theversion}{{version = 1.01 of philgen.tex 1992 July 6 }}

%\textheight 9in % this works
%\topmargin -0.5in % -0.5 would shift the whole thing up
%\headheight 0in
%\headsep 0in
%\textwidth 6in
%\oddsidemargin 0in

\title{ A Philosophy for GenBank }

\author{
Thomas D. Schneider
\thanks{National Cancer Institute,
        Frederick Cancer Research and Development Center,
        Laboratory of Mathematical Biology,
        P. O. Box B,
        Frederick, MD 21702-1201.
        Internet address: toms at ncifcrf.gov. } }

\date{\theversion}

\begin{document}
%\pagestyle{empty} % removes page numbers

\hbadness 4500 % avoids silly messages about underfull boxes
%\pagestyle{myheadings}
%\markright{Running Title}
\bibliographystyle{unsrt}
\maketitle

This is a draft version of a philosophy to guide design of the international
genetic sequence databases.  My purpose is to present fundamental ideas for
design of the database, because many problems with the current databases stem
from a lack of definition.  This document does not describe the computer systems
or languages which would be used to create the database.  We are concerned here
with what should be in the database and how it should be stored.  Implementation
is a separate issue.\footnote{
Comments welcome, please post directly on the net in bionet.general.}

\vspace{12pt}
\begin{center}
\noindent
\framebox{
\parbox{5.0in}{
{\bf Principle 1:  A user should not be required to extract the original
paper(s) in order to work with a sequence.}
}
}
\end{center}
Reasoning:  When one is manipulating 2000 sequence fragments, it becomes
impractical to look up every paper.  Even if one could obtain and read
all the relevant papers, with a doubling time of 1.5 years, it becomes
impossible to keep up.

\vspace{12pt}
{\bf OBJECT DEFINITION:
An object is a region of genetic material with distinct sequence
properties. }  According to Principle 1, data defining and associated
with objects must be complete.  New information about the object
must be continuously added to the database to keep it up to date,
even if no new sequence information has been produced.
The current implementation of GenBank emphasizes sequences
and neglects new information about the sequences---there is no reliable
mechanism for capturing information not associated with the original
sequence.

\begin{enumerate}
\item
{\em EVERY OBJECT HAS A TYPE.}
A rigidly controlled list of types must exist.  New types are added
as new biological features are defined.  The ``misc\verb|_|feature'' of
the current implementation is not
an acceptable type.

The types must form a logical, non-overlapping set of definitions
of biological objects.  Each type must be defined and carefully
distinguished from related types so that objects are not given the
wrong type.  For example, it was not appropriate to indicate in the database
that an mRNA was an exon.  The mRNA is a complete processed RNA,
while an exon is a portion of the RNA.

\item
{\em EVERY OBJECT HAS A NAME.}
For a computer to find and manipulate an object, the object must have
a name (or ``label'')
which is unique within a certain scope.  Names begin with the
species and are followed by the genetic locus, and then by the specific
object.  Example:  {\em E. coli, lac, Z}---which would have type {\em gene}.
The species must be carried with the object, since we wish to be able
to perform manipulations which create chimeric sequences.
Names allow one to define genetic locations RELATIVE to an object.
This has the extreme advantage that the coordinates of the object
may shift, but one's instructions would not be affected.
For example, after specifying the object
mentioned above, we can then say that we are interested in the
region from -60 to +40 around the ribosome binding site.
This particular sequence has been unaltered for many years, and
will remain so even as the entire sequence of {\em E. coli} is being
completed.  In contrast, an absolute coordinate ({\em e.g.} 32432) is
useless as soon as sequences are merged.
The most useful name for an object is its standard genetic name.
Although these may change as standards become better, they are more
stable than anything else.  A list of synonyms could be associated
with each object.  By this means one may take an old list of
objects and use it (for the most part) in a new database.  The old
LOCUS name fails to allow this.
The ``note'' in the current implementation does not satisfy this requirement
because the data in it cannot be obtained precisely with an algorithm.
Notes should almost never be used.

\item
{\em Every object which represents a change of the sequence has
that change recorded in a COMPUTER MANIPULATABLE FORMAT.}
Without a precise algorithm for how to change the object, programs which
perform large statistical analysis of the database cannot be built.
Once again, ``note'' fails to satisfy this requirement.

\item
{\em NO OBJECT IS EVER DUPLICATED.}
Duplication (redundancy) in a database leads to inconsistency when
part of the database is corrected but the other part is not.
It also wastes space.

\item
{\em EVERY OBJECT ALWAYS HAS A RECORD OF THE KIND OF EVIDENCE IN SUPPORT OF IT.}
Objects are defined by experiment, and in general should not be defined by
looking at a sequence by using a computer program.  Such programs are a
reflection of the current thinking about what is in sequences, and are
therefore subjective.  Highly significant repeat structures and guesses
based on computer searches should be recorded with a clear indication of
the program which was used.  These guesses should be removed when experimental
data become available.  Guesses are useful as predictions, but unless they
are marked as such they interfere with statistical analysis of the
database.

\item
{\em EVERY OBJECT ALWAYS HAS A REFERENCE.}
This allows one to locate the original experimental data when this
is required.

\item
{\em EVERY OBJECT HAS A NATURAL LOCATION.}
Ultimately, this is given by a chromosome name and a coordinate on the
chromosome.
\end{enumerate}

\vspace{12pt}
\begin{center}
\noindent
\framebox{
\parbox{5.0in}{
{\bf Principle 2:  Subsets of the database have the same form as the
original database.}
} }
\end{center}
This allows one to use the same tools on the subset
as on the original set.  See \cite{Schneider1982,Schneider1984}.


\vspace{12pt}
{\bf SEQUENCE DEFINITION:
A sequence is a series of nucleic-acid or amino acids represented
by alphabetic symbols.  Alignment gaps are allowed, and are symbolized
by a dash (-).}
Each sequence has associated with it:
\begin{enumerate}
\item
A coordinate system.
This follows from Principle 2 because
when one is working with a partial fragment of a sequence, it is useful
to have the original numbering system maintained.
Therefore the original database should have a coordinate system.
(This may be implicitly 1 to n.)
When a sequence is derived from other sequences, each portion may
have its own coordinate.  Coordinates may be listed in the form:
\begin{center}
C(1 100)(120 150)(15 1)(150 200)
\end{center}
This indicates that a circular (C) sequence
is being described (it could also be Linear or Repeat).  The sequence
given $5'$ to $3'$ starts at base 1 and proceeds to 100.
There is a gap in the numbering ({\em e.g.}
from a deletion or a chimeric construction),
followed by a segment numbered from 120 up to 150,
then an inserted segment numbered from 15 {\em down to} 1.
Finally the circle
is closed with sequences running from 150 to 200.  With this method, mutant
sequences can be created which still have nearly the same numbering as
the original sequence.  This facilitates comparison between sequences.
In this scheme,
coordinates of each base must defined with two numbers, 2 at 14 could be a notation
to indicate the second set of sequence, base 14 of the example given above.
\item
When a set of sequences are extracted and aligned, a single base must
be designated as the aligned base.  By the second principle,
this must be in the original database (or it may be implicitly the first base).
\item
A species designation list.  More than one species may be required
to describe artificial constructions.  In this case each piece must
be identified by the coordinates.
\item
{\em A map location}, in standard genetic coordinates.  This includes
the orientation of the sequence (if known).
\item
A list of associated objects for the sequence.
Every object is associated with its originating species
(or inorganic synthesis), but there can be a large number of objects
on a nucleic-acid.  To avoid duplication and redundancy, objects are associated
with a single sequence.
\item
{\em References}: A list of associated references for the sequence and
associated objects.
\end{enumerate}

\vspace{12pt}
{\bf REFERENCE DEFINITION:
A reference is a citation to the primary scientific literature.}
The reference must include:  authors, title, journal, volume, pages, year.
When the data are not in the literature, the name, address, phone and email
address of the originator, or other information available, should be stored.
A practical modern output format for these data is the BiBTeX format.

\vspace{12pt}
\begin{center} 
\noindent 
\framebox{ 
\parbox{5.0in}{ 
{\bf Principle 3: Scientists are responsible for the quality of the
sequence data, not the database staff.}
} } 
\end{center}
It is the responsibility of each person who submitted a sequence
to be sure that the data in the database are complete and accurate.
It is impossible for the databases to ensure this.
Furthermore,
as the scientist learns more about a sequence, they are obliged to enter
the new data in the form of objects.  This is a duty.  To neglect this
duty means that the scientist's work is unavailable to other
scientists.  Objects that are not entered into the database
should not be considered discoveries.

\vspace{12pt}
\begin{center}  
\noindent  
\framebox{  
\parbox{5.0in}{  
{\bf Principle 4: The database staff is responsible for the quality of the
sequence database structure.}
} }
\end{center} 
A set of documents is required:
\begin{enumerate}
\item
{\em Philosophy.}
The philosophy of the database (as exemplified
by this document), is required to let everyone know WHY the database has
been structured as it is.  This allows people to challenge the fundamental
ideas implied by the current implementation.
\item
{\em Definition.}
This document defines the kinds of objects in the database in complete
detail.  Syntax may be defined in languages like BNF.
Allowed types are defined.
\item
{\em Implementation.}
Limitations on the lengths of names and other implementation dependent
decisions are described in this document.  ASN.1 is an excellent
implementation language.  Such an implementation does not, of course,
address the philosophy or definitions issues.
\item
{\em Examples.}
Examples of every object should be given so that scientists can see
how to express themselves in the database.
\item
{\em Tutorial.}
This document is for scientists.
It describes how to create and maintain the database.
\end{enumerate}

In 1991 of several groups found that they could not parse the database.
I discovered in 1992 duplicated features in the database. 
These incidents
would not have happened if the documents described
above had been created and a series of {\bf check programs} had been written.
Such check
programs must be written from the defining document, {\em not\/} from the current
implementation of the database.  A test of the defining document is
that it allows anybody to write the check program.

\vspace{12pt}
\begin{center}
\noindent
\framebox{
\parbox{5.0in}{
{\bf Principle 6:
Anyone using the database is responsible for reporting errors,
however small.
} } }
\end{center}
If an error is not reported, it will, of course, remain in the
database where it may affect the quality of science done with the database.

\vspace{12pt}
\begin{center}
\noindent
\framebox{
\parbox{5.0in}{
{\bf Principle 7: Release of sequence or object data into the database
has higher priority than physical publication for
determining precedence.
} } }
\end{center}
The reference data items in the
database establishes precedence.

``That is if you were the first to clone and
sequence a particular gene, the next person who sequences the homologue
would find your sequence already in the database, and would be obliged
to cite your entry, even if it had not yet been
published.''---Brian Fristensky.

One consequence of sticking to this principle would be a massive improvement
of the quality of data in the database.

The other consequence is that we will abandon paper technology.

\vspace{12pt}
ACKNOWLEDGEMENTS:
Brian Fristensky suggested the priority principle for GenBank publication.

\bibliography{all}
\end{document}



More information about the Bioforum mailing list

Send comments to us at biosci-help [At] net.bio.net