dbEST AS A RESOURCE FOR GENE DISCOVERY
The number of public cDNA sequences ("Expressed Sequence Tags" or
ESTs) recently exceeded the 50,000 mark* and it was of interest to assess
the usefulness of this resource for gene discovery. We therefore compiled a
list of 32 human disease genes that had been cloned as of August 1994 by
either the positional cloning or positional candidate methods (1) and
performed sequence homology searching (2) , against dbEST, the database
of expressed sequence tags (3). Thirty eight percent of these human genes
had exact and often multiple matches in dbEST and an additional 47%
were represented by homologs in other organisms.** Only five human
disease genes had no convincing matches with ESTs. Thus for 85% of the
human disease genes positionally-cloned to date, there is a homologous
partial cDNA sequence in the public domain.
These results underscore the utility of "single pass," tag/survey
cDNA sequencing (4) and demonstrate that much valuable information is
already present in the public databases if one knows how to find it (2) .
These results also underscore the value of "model organisms" for
accelerating progress in the identification of human genes by homology -
an explicit goal of the U.S. Genome Program (5). If one is searching for
exons in human genomic DNA, a statistically significant match to a
cDNA, whether it be from humans, nematodes, rice, maize or yeast, is the
best proof (apart from an experiment) that an exon has been found.
dbEST may be searched using the BLAST (2) e-mail or network
services and full reports on individual ESTs may be obtained via NCBI's
retrieve e-mail server (6). The capability of retrieving ESTs based on their
chromosome assignment and map location has recently been
implemented. Instructions on submitting new sequence and mapping
data are available (6). World Wide Web access is also provided at
http://www.ncbi.nlm.nih.gov/. An NCSA Mosaic interface (7) allows
complex (Boolean) queries of dbEST to be performed.
Mark S. Boguski, Carolyn M. Tolstoshev
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
8600 Rockville Pike,
Bethesda, MD 20894, USA
Douglas E. Bassett, Jr.
Johns Hopkins University
School of Medicine,
725 North Wolfe Street,
Baltimore, MD 21205, USA
*dbEST release 2.27 contained 50,214 DNA sequences from 22 different
organisms. Information on the current release is available via the
World Wide Web at http://www.ncbi.nlm.nih.gov/dbEST/index.html.
**A detailed summary of these homologies with dbEST sequences in is
available in Postscript, GIF and HTML formats on the dbEST Home Page
at the URL specified above. We thank Dan Jacobson for instructing us on
how to provide the HTML links to OMIM entries (McKusick, V. Online
Mendelian Inheritance in Man. The Johns Hopkins University,
References and Notes
1. A. Ballabio, Nature Genet. 3, 277-279 (1993).
2. S. F. Altschul, M. S. Boguski, W. Gish, J. C. Wootton, Nature Genet.
6, 119-129 (1994). The TBLASTN program is essential for EST homology
searching. TBLASTN takes a protein query sequence and compares it
against conceptual translations of DNA sequences in all six reading
frames. This is much more sensitive than nucleotide vs. nucleotide
comparisons for detecting more distant, cross-phylum relationships (D.J.
States, S.F. Altschul, Methods 3, 66-70 (1991)). Indeed most of the
homologs representing inexact matches would not have been detected by
searching GenBank for nucleotide sequence similarities alone.
3. M. S. Boguski, T. M. J. Lowe, C. M. Tolstoshev, Nature Genetics 4,
332-333 (1993). Although all dbEST sequences are also present in the EST
Division of GenBank (D. Benson, D.J. Lipman, J. Ostell, Nucl. Acids Res.
13, 2963-2965 (1993)), dbEST contains additional value-added annotation
such as the latest homologies, mapping data and contact information for
obtaining physical DNA clones. Note that in addition to cDNA data, dbEST
contains some genomic sequences that have been isolated by exon
"trapping" or "amplification" (e.g. A.J. Buckler, et al. Proc. Natl. Acad.
Sci. USA 88, 4005-4009 (1991)).
4. M. D. Adams, et al., Science 252, 1651-6 (1991); A. S. Kahn, et al.,
Nature Genet. 2, 180-185 (1992); K. Okubo, et al., Nature Genet. 2, 173-179
(1992); R. Waterston, et al., Nature Genet. 1, 114-123 (1992).
5. F. Collins, D. Galas, Science 262, 43-46 (1993).
6. The e-mail address for BLAST is blast at ncbi.nlm.nih.gov and the
address for database records is retrieve at ncbi.nlm.nih.gov. To receive
documentation, send a message containing the work 'help' (unquoted) in
the body of the message. For specific information on dbEST, place the
instruction 'datalib dbest' (unquoted) on a line preceding 'help.' For
information on the BLAST network service, send e-mail to blast-
help at ncbi.nlm.nih.gov. For information on submitting data send e-mail to
info at ncbi.nlm.nih.gov. For other questions, telephone 301-496-2475
and ask for the service desk.
7. B.R. Schatz, J.B. Hardin, Science 265, 895-901 (1994).