On `A Philosophy for GenBank'
I would like to commend Thomas Schneider for bringing attention to some
critical issues. Unless the architecture and the nature of the information
represented in a database are concisely and comprehensively defined, the
database itself will be of limited value. I would like to follow up by
examining some of the issues raised in the document. Please do not misinterpret
my comments and/or criticisms; it is precisely because Tom has given such
careful thought to the document and that it was so well conceived that it is
worth examining in more detail.
First off, I suggest that the title be reconsidered. The primary objective is
to reduce biological knowledge to computer readable form, this is an entirely
practical matter. It has little to do with one's personal philosophy. We should
take care to avoid standing on principles that have very little practical
significance. There are many models for representing information. Successful
designs are distinguished because they are well thought out and internally
consistent and perhaps most importantly because they are realizable. The `best'
idealistic models are of little value if they cannot be successfully
implemented. It is 1992 and it is high time that we stop designing `vaporbases'
(databases whose implementation plans are so nebulous that they never actually
appear).
I suggest a more appropriate title would be something to the effect: `A
Conceptual Model of GenBank.' We at the PIR-International Protein Sequence
Database have given serious thought to the issue of database documentation in
regards to the Protein Sequence Database and have arrived at a similar
three-tiered approach to database documentation. Our general strategy differs
only slightly from that outlined by Tom. On the top tier, we have conceived of
a `procedures manual.' Interestingly enough, our procedures manual shares many
attributes of Tom's Philosophy. It gives a conceptual definition and "WHY the
database has been structured as it is." We have focused on specific procedural
decisions concerning which representation strategies are to be employed in
specific instances. It appears that this is simply a different way of phrasing
Tom's ideas via specific examples.
I will focus the rest of this discussion on Principles 3, 4, and 7. In general,
I am in agreement with them; however, the document appears to have neatly side
stepped several more fundamental issues and thus these principles have not been
elucidated is sufficient detail. The first issue concerns whether or not to
merge information reported by different scientists. Put another way, what is
the nature of the `objects' being represented? First, I presume Tom will agree
that the sequence data in and of themselves are of only limited value if no
other information is associated with them, i.e., if a database search turns up
a matching sequence of which nothing is known other than the fact that someone
has determined it, what do we do with this information? Is it the sequence of
the same molecule (perhaps containing some inconsistencies) or that of a homolog
of unknown nature? So I will restrict the discussion to sequence objects that
have reasonably well defined properties associated with them.
Given the discussion in the document I suspect that, although it was not
explicitly stated, Tom is in favour of defining objects as `biological'
objects, i.e., the sequence (or feature) as it exists in nature as opposed to
the sequence as reported by an individual research scientist or an individual
group of scientists. Applying the redundancy property to such objects
necessitates merging overlapping information and representing a single
composite entity.
The NCBI's GenInfo `backbone' database has taken on the role of archiving
individual reports of sequence and sequence-related data. There is no question
that this activity is essential. It is difficult to realize any model if it is
not built on a sound foundation; the archival data provide this foundation. My
understanding is that the Los Alamos group is fostering a more dynamic model
where scientists update their own contributions continually. Again I believe
that these efforts are valuable and fully support them. The essential question,
however, is whether or not these activities will be sufficient. Should these
data be correlated and consolidated into a nonredundant reflection of current
biological understanding? If so, will the GenBank project be part of these
developments or is it envisioned as simply an extension of the Geninfo
backbone, perhaps incorporating more information but essentially collecting
independent reports. If the later is the case than I have no argument with
Principles 3 and 4 as stated.
On the other hand, if a biologically nonredundant data collection is
envisioned, the project takes on a fundamentally different nature and the
various roles played by contributing scientists versus database `scientific'
staff must be reassessed. Try to imagine tens of thousands of scientists
modifying each others data as `new' information is uncovered. The results of
experimental biology are notoriously ambiguous and error prone. Experiments
typically do not directly measure the properties (objects) of interest, rather
they are inferred from a number of lines of reasoning. As such, decisions
concerning which pieces of evidence are more fundamental and definitive are
not always clear cut.
The problem of managing a project of this type is not one of computer science
and certainly not one of biology; it is a problem in `organizational behavior.'
Organization behavior, that is the theory of organizing the interactions of
individuals involved in medium-to-large scale enterprises, is an emerging
field. While there is little agreement on appropriate models for such
organizations, two principles are widely accepted: 1) the most promising
enterprises often fail miserably because of faulty organizational structures;
and 2) the greater the level of interaction among various parties, the more
critical the organizational structure is to the success of the enterprise. In
the annotated and classified section of the current PIR-International database
nearly 25% of the sequence entries contain contributions from more than one
independent report. This is certainly a significant underestimate of the degree
of overlap and the situation with the nucleic acid sequence data will not prove
to be significantly different. Moreover, as more is learned about these
macromolecules the amount of overlap will increase dramatically. Organization
of the independent contributions from large numbers of biological scientists in
a project of this type is a nontrivial problem.
With this in mind, I would like to restate Principle 4: THE PRIMARY ROLE OF THE
DATABASE STAFF IS TO PROVIDE THE ORGANIZATIONAL STRUCTURE OF THE DATABASE. This
includes the physical organization of the data themselves as originally stated
but also includes providing a well thought out plan concerning how contributing
scientists will interact with the database and for developing procedures for
ensuring that these efforts are effectively coordinated.
Successful scientific databases have generally viewed the role of the database
staff as analogous to that of editors of a review article, i.e., the database
staff serves as arbitrators and ensures that the data be represented in a
consistent manner. Note that this does not necessarily imply that the database
staff directly review the data themselves; however, it does imply that they are
actively involved in the process.