################################################################
Finding all GenBank entries for a given human chromosome.
################################################################
########
Summary:
########
We've found that it is necessary to search both GDB and GenBank,
since there seems to be significant non-overlap. For human chromosome 22,
we find 279 entries with a search of GenBank annotations as described below,
and 301 entries with a search of GDB as described below.
We then find that only 147 entries are in the overlap (intersection),
and 286 are not, making a total of 433 unique entries.
301 GDB
279 GenBank
147 Overlap (intersection)
286 Not in overlap (132 in GenBank not in GDB, 154 in GDB not in GenBank)
433 Total human chromosome 22 (union of GDB and GenBank searches)
The results for all chromosomes are included at the end of this document.
###########
Discussion:
###########
Although we don't claim 100% accuracy, and expect some false positives and
negatives, the results are still striking. The norm is for less than
50% overlap between the GenBank annotations as found with DNA WorkBench,
and the GDB mapping information. Similarly, a significant fraction of
results are missed by relying on only one or the other method.
See the complete results below.
Assuming our results are accurate, then clearly there is room for better
integration between GDB and GenBank chromosome location information.
Perhaps our methods may be of some use to the database maintainers,
as a tool to enhance the integration.
We are now providing this search as a command "chromosome" in DNA WorkBench.
The compute intensive nature of the search led us to precompute
the chromosome information for humans and store it in a fast-access index.
Thus a query completes in seconds. The index information is updated
regularly by automation.
###################
Details of searches
###################
#######
GenBank
#######
To search for the GenBank annotations, we use the program DNA WorkBench
with the following commands:
(DNA WorkBench is an internet service, software for Unix and Mac (and PC) at
anonymous ftp location cbil.humgen.upenn.edu:/pub/dnaworkbench)
#
# DNA WorkBench script to get list of human genes on chromosome 22 in GenBank
#
database genbank
text "chromosome 22\D" gball gbnew
text \/map="22\D gball gbnew
text \/chromosome="22\D gball gbnew
# the next one works okay for human chromosome 22, but may generate some
# false positives for other chromosomes
text \b22[pq][^a-z] gball gbnew
union 1 2 3 4
organism sapiens
intersection 5 6
This search (plus the one on GDB) now may be executed simply as
#
# DNA WorkBench script for human chromosomes using precomputed index
# (N.B. need most recent version of software, Unix available 2/9/94, Mac soon)
#
chromosome 22
In general, this seems to work quite well; however, we have found at least
one GenBank entry that is on chromosome 22 and not found by this search
(nor by the GDB search below); there may well be others. We do not seem
to be getting any false positives; but applied to another chromosome, there
may be some false positives. The size of the GenBank human entries (about
50000) precludes our checking all results for all chromosomes. We are
eager to learn of such cases, so as to improve our search procedure.
The program DNA WorkBench does a complete search (in parallel) of all the
GenBank entries. The text being searched for can
be specified as a "regular expression". The "wildcards" used here are
\/ to represent /
\D to represent a non-digit
[pq] to represent p or q
[^a-z] to represent a non-letter. (Case insensitive).
Thus, the DNA WorkBench search of GenBank is a search for GenBank
annotations specifying the chromosome location.
(See the "help regular" command in DNA WorkBench for details.)
#######
GDB
#######
To search GDB, we use the following Sybase command (by Barbara Eckman):
$sql_cmd= "select distinct locus_symbol, genbank_ref
from locus, object_genbank_eref, locus_cyto_location
where locus.locus_id = locus_cyto_location.locus_id and
locus.locus_id = object_genbank_eref.object_id and
object_class_key = 1 and
loc_cyto_chrom_num = '22'
UNION
select distinct locus_symbol, genbank_ref
from probe, probe_locus_iref, locus, object_genbank_eref,
locus_cyto_location
where probe.probe_id = probe_locus_iref.probe_id and
probe_locus_iref.locus_id = locus.locus_id and
locus.locus_id = locus_cyto_location.locus_id and
probe.probe_id = object_genbank_eref.object_id and
object_class_key = 2 and
loc_cyto_chrom_num = '22'
order by locus_symbol desc";
We then read in the list of accession numbers obtainable from this search,
and compare with the GenBank search:
#
# DNA WorkBench script continued
# N.B. only latest version supports "gdb" command
#
gdb chromosome 22
union 7 8
#this prints out a list of one-line headers:
headers 1-$
#this writes the list of one-line headers to a file:
headers 1-$ write chromo_22_headers
#see also "help searches" for writing list of accession numbers
(Again, the new "chromosome 22" command does all this much faster.)
########################################################
Here are the results for all human chromosomes.
The first three lines are the result of running the "chromosome 22" (e.g.)
command.
The third (union) line shows the total GenBank entries for the
chromosome as found by GDB and GenBank (as per DNA WorkBench) searches.
The fourth (intersection) line shows the size of the overlap.
The fifth line shows GenBank (DNA WorkBench) results that are not in GDB;
the sixth line shows GDB results that are not in GenBank (DNA WorkBench).
Size Number Command
_____________________________________
986 1 chromosome 1 (in genbank)
1356 2 chromosome 1 (in gdb)
1688 3 union $ $-1
654 4 intersect 1 2
332 5 difference 1 2
702 6 difference 2 1
Size Number Command
_____________________________________
789 1 chromosome 2 (in genbank)
909 2 chromosome 2 (in gdb)
1221 3 union $ $-1
477 4 intersect 1 2
312 5 difference 1 2
432 6 difference 2 1
Size Number Command
_____________________________________
435 1 chromosome 3 (in genbank)
581 2 chromosome 3 (in gdb)
713 3 union $ $-1
303 4 intersect 1 2
132 5 difference 1 2
278 6 difference 2 1
Size Number Command
_____________________________________
1376 1 chromosome 4 (in genbank)
1366 2 chromosome 4 (in gdb)
1625 3 union $ $-1
1117 4 intersect 1 2
259 5 difference 1 2
249 6 difference 2 1
Size Number Command
_____________________________________
409 1 chromosome 5 (in genbank)
478 2 chromosome 5 (in gdb)
628 3 union $ $-1
259 4 intersect 1 2
150 5 difference 1 2
219 6 difference 2 1
Size Number Command
_____________________________________
1015 1 chromosome 6 (in genbank)
1058 2 chromosome 6 (in gdb)
1507 3 union $ $-1
566 4 intersect 1 2
449 5 difference 1 2
492 6 difference 2 1
Size Number Command
_____________________________________
1024 1 chromosome 7 (in genbank)
1115 2 chromosome 7 (in gdb)
1455 3 union $ $-1
684 4 intersect 1 2
340 5 difference 1 2
431 6 difference 2 1
Size Number Command
_____________________________________
396 1 chromosome 8 (in genbank)
495 2 chromosome 8 (in gdb)
583 3 union $ $-1
308 4 intersect 1 2
88 5 difference 1 2
187 6 difference 2 1
Size Number Command
_____________________________________
483 1 chromosome 9 (in genbank)
419 2 chromosome 9 (in gdb)
698 3 union $ $-1
204 4 intersect 1 2
279 5 difference 1 2
215 6 difference 2 1
Size Number Command
_____________________________________
401 1 chromosome 10 (in genbank)
522 2 chromosome 10 (in gdb)
629 3 union $ $-1
294 4 intersect 1 2
107 5 difference 1 2
228 6 difference 2 1
Size Number Command
_____________________________________
598 1 chromosome 11 (in genbank)
699 2 chromosome 11 (in gdb)
963 3 union $ $-1
334 4 intersect 1 2
264 5 difference 1 2
365 6 difference 2 1
Size Number Command
_____________________________________
410 1 chromosome 12 (in genbank)
584 2 chromosome 12 (in gdb)
689 3 union $ $-1
305 4 intersect 1 2
105 5 difference 1 2
279 6 difference 2 1
Size Number Command
_____________________________________
270 1 chromosome 13 (in genbank)
271 2 chromosome 13 (in gdb)
356 3 union $ $-1
185 4 intersect 1 2
85 5 difference 1 2
86 6 difference 2 1
Size Number Command
_____________________________________
1018 1 chromosome 14 (in genbank)
1060 2 chromosome 14 (in gdb)
1545 3 union $ $-1
533 4 intersect 1 2
485 5 difference 1 2
527 6 difference 2 1
Size Number Command
_____________________________________
205 1 chromosome 15 (in genbank)
263 2 chromosome 15 (in gdb)
345 3 union $ $-1
123 4 intersect 1 2
82 5 difference 1 2
140 6 difference 2 1
Size Number Command
_____________________________________
250 1 chromosome 16 (in genbank)
317 2 chromosome 16 (in gdb)
408 3 union $ $-1
159 4 intersect 1 2
91 5 difference 1 2
158 6 difference 2 1
Size Number Command
_____________________________________
608 1 chromosome 17 (in genbank)
541 2 chromosome 17 (in gdb)
876 3 union $ $-1
273 4 intersect 1 2
335 5 difference 1 2
268 6 difference 2 1
Size Number Command
_____________________________________
198 1 chromosome 18 (in genbank)
192 2 chromosome 18 (in gdb)
252 3 union $ $-1
138 4 intersect 1 2
60 5 difference 1 2
54 6 difference 2 1
Size Number Command
_____________________________________
497 1 chromosome 19 (in genbank)
630 2 chromosome 19 (in gdb)
879 3 union $ $-1
248 4 intersect 1 2
249 5 difference 1 2
382 6 difference 2 1
Size Number Command
_____________________________________
178 1 chromosome 20 (in genbank)
273 2 chromosome 20 (in gdb)
306 3 union $ $-1
145 4 intersect 1 2
33 5 difference 1 2
128 6 difference 2 1
Size Number Command
_____________________________________
485 1 chromosome 21 (in genbank)
365 2 chromosome 21 (in gdb)
558 3 union $ $-1
292 4 intersect 1 2
193 5 difference 1 2
73 6 difference 2 1
Size Number Command
_____________________________________
279 1 chromosome 22 (in genbank)
301 2 chromosome 22 (in gdb)
433 3 union $ $-1
147 4 intersect 1 2
132 5 difference 1 2
154 6 difference 2 1
Size Number Command
_____________________________________
1030 1 chromosome X (in genbank)
860 2 chromosome X (in gdb)
1362 3 union $ $-1
528 4 intersect 1 2
502 5 difference 1 2
332 6 difference 2 1
Size Number Command
_____________________________________
68 1 chromosome Y (in genbank)
35 2 chromosome Y (in gdb)
89 3 union $ $-1
14 4 intersect 1 2
54 5 difference 1 2
21 6 difference 2 1
======================================================================
James Tisdall
Departments of Genetics and Computer and Information Science
Computational Biology and Informatics Laboratory
Human Genome Project for Chromosome 22,
University of Pennsylvania and Childrens Hospital of Philadelphia
tisdall at cbil.humgen.upenn.edu 215-573-3113
======================================================================