IUBio

GenBank entries by chromosome

James Tisdall tisdall at amalthea.humgen.upenn.edu
Wed Feb 9 22:39:29 EST 1994


################################################################
   Finding all GenBank entries for a given human chromosome.
################################################################

########
Summary:
########

We've found that it is necessary to search both GDB and GenBank,
since there seems to be significant non-overlap.  For human chromosome 22,
we find 279 entries with a search of GenBank annotations as described below,
and 301 entries with a search of GDB as described below.
We then find that only 147 entries are in the overlap (intersection),
and 286 are not, making a total of 433 unique entries.

301	GDB
279	GenBank
147	Overlap (intersection)
286	Not in overlap  (132 in GenBank not in GDB, 154 in GDB not in GenBank)
433	Total human chromosome 22 (union of GDB and GenBank searches)

The results for all chromosomes are included at the end of this document.


###########
Discussion:
###########

Although we don't claim 100% accuracy, and expect some false positives and
negatives, the results are still striking.  The norm is for less than
50% overlap between the GenBank annotations as found with DNA WorkBench,
and the GDB mapping information.  Similarly, a significant fraction of
results are missed by relying on only one or the other method.
See the complete results below.

Assuming our results are accurate, then clearly there is room for better
integration between GDB and GenBank chromosome location information.
Perhaps our methods may be of some use to the database maintainers,
as a tool to enhance the integration.

We are now providing this search as a command "chromosome" in DNA WorkBench.
The compute intensive nature of the search led us to precompute
the chromosome information for humans and store it in a fast-access index.
Thus a query completes in seconds.  The index information is updated
regularly by automation.

###################
Details of searches
###################

  #######
  GenBank
  #######
To search for the GenBank annotations, we use the program DNA WorkBench
with the following commands:
(DNA WorkBench is an internet service, software for Unix and Mac (and PC) at
 anonymous ftp location cbil.humgen.upenn.edu:/pub/dnaworkbench)

#
# DNA WorkBench script to get list of human genes on chromosome 22 in GenBank
#
database genbank
text "chromosome 22\D" gball gbnew
text \/map="22\D gball gbnew
text \/chromosome="22\D gball gbnew
# the next one works okay for human chromosome 22, but may generate some
# false positives for other chromosomes 
text \b22[pq][^a-z] gball gbnew
union 1 2 3 4
organism sapiens
intersection 5 6

This search (plus the one on GDB) now may be executed simply as
#
# DNA WorkBench script for human chromosomes using precomputed index
# (N.B. need most recent version of software, Unix available 2/9/94, Mac soon)
#
chromosome 22


In general, this seems to work quite well; however, we have found at least
one GenBank entry that is on chromosome 22 and not found by this search
(nor by the GDB search below); there may well be others.  We do not seem
to be getting any false positives; but applied to another chromosome, there
may be some false positives.  The size of the GenBank human entries (about
50000) precludes our checking all results for all chromosomes.  We are
eager to learn of such cases, so as to improve our search procedure.

The program DNA WorkBench does a complete search (in parallel) of all the
GenBank entries.  The text being searched for can
be specified as a "regular expression".  The "wildcards" used here are 

\/ to represent /
\D to represent a non-digit
[pq] to represent p or q
[^a-z] to represent a non-letter.  (Case insensitive).

Thus, the DNA WorkBench search of GenBank is a search for GenBank
annotations specifying the chromosome location.
(See the "help regular" command in DNA WorkBench for details.)


  #######
    GDB
  #######
To search GDB, we use the following Sybase command (by Barbara Eckman):

    $sql_cmd= "select distinct locus_symbol, genbank_ref
	from locus, object_genbank_eref, locus_cyto_location
	where locus.locus_id = locus_cyto_location.locus_id and
	locus.locus_id = object_genbank_eref.object_id and
	object_class_key = 1 and
	loc_cyto_chrom_num = '22'
    UNION
	select distinct locus_symbol, genbank_ref
	from probe, probe_locus_iref, locus, object_genbank_eref,
	locus_cyto_location
	where probe.probe_id = probe_locus_iref.probe_id and
	probe_locus_iref.locus_id = locus.locus_id and
	locus.locus_id = locus_cyto_location.locus_id and
	probe.probe_id = object_genbank_eref.object_id and
	object_class_key = 2 and
	loc_cyto_chrom_num = '22'
     order by locus_symbol desc";

We then read in the list of accession numbers obtainable from this search,
and compare with the GenBank search:
#
# DNA WorkBench script continued
#  N.B. only latest version supports "gdb" command
#
gdb chromosome 22
union 7 8
#this prints out a list of one-line headers:
headers 1-$
#this writes the list of one-line headers to a file:
headers 1-$ write chromo_22_headers
#see also "help searches" for writing list of accession numbers

(Again, the new "chromosome 22" command does all this much faster.)


########################################################
Here are the results for all human chromosomes.

The first three lines are the result of running the "chromosome 22" (e.g.)
 command.
The third (union) line shows the total GenBank entries for the
 chromosome as found by GDB and GenBank (as per DNA WorkBench) searches.
The fourth (intersection) line shows the size of the overlap.
The fifth line shows GenBank (DNA WorkBench) results that are not in GDB;
the sixth line shows GDB results that are not in GenBank (DNA WorkBench).
           
   Size  Number  Command
_____________________________________
    986       1  chromosome 1 (in genbank)
   1356       2  chromosome 1 (in gdb)
   1688       3  union  $ $-1
    654       4  intersect  1 2
    332       5  difference  1 2
    702       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    789       1  chromosome 2 (in genbank)
    909       2  chromosome 2 (in gdb)
   1221       3  union  $ $-1
    477       4  intersect  1 2
    312       5  difference  1 2
    432       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    435       1  chromosome 3 (in genbank)
    581       2  chromosome 3 (in gdb)
    713       3  union  $ $-1
    303       4  intersect  1 2
    132       5  difference  1 2
    278       6  difference  2 1
           
   Size  Number  Command
_____________________________________
   1376       1  chromosome 4 (in genbank)
   1366       2  chromosome 4 (in gdb)
   1625       3  union  $ $-1
   1117       4  intersect  1 2
    259       5  difference  1 2
    249       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    409       1  chromosome 5 (in genbank)
    478       2  chromosome 5 (in gdb)
    628       3  union  $ $-1
    259       4  intersect  1 2
    150       5  difference  1 2
    219       6  difference  2 1
           
   Size  Number  Command
_____________________________________
   1015       1  chromosome 6 (in genbank)
   1058       2  chromosome 6 (in gdb)
   1507       3  union  $ $-1
    566       4  intersect  1 2
    449       5  difference  1 2
    492       6  difference  2 1
           
   Size  Number  Command
_____________________________________
   1024       1  chromosome 7 (in genbank)
   1115       2  chromosome 7 (in gdb)
   1455       3  union  $ $-1
    684       4  intersect  1 2
    340       5  difference  1 2
    431       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    396       1  chromosome 8 (in genbank)
    495       2  chromosome 8 (in gdb)
    583       3  union  $ $-1
    308       4  intersect  1 2
     88       5  difference  1 2
    187       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    483       1  chromosome 9 (in genbank)
    419       2  chromosome 9 (in gdb)
    698       3  union  $ $-1
    204       4  intersect  1 2
    279       5  difference  1 2
    215       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    401       1  chromosome 10 (in genbank)
    522       2  chromosome 10 (in gdb)
    629       3  union  $ $-1
    294       4  intersect  1 2
    107       5  difference  1 2
    228       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    598       1  chromosome 11 (in genbank)
    699       2  chromosome 11 (in gdb)
    963       3  union  $ $-1
    334       4  intersect  1 2
    264       5  difference  1 2
    365       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    410       1  chromosome 12 (in genbank)
    584       2  chromosome 12 (in gdb)
    689       3  union  $ $-1
    305       4  intersect  1 2
    105       5  difference  1 2
    279       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    270       1  chromosome 13 (in genbank)
    271       2  chromosome 13 (in gdb)
    356       3  union  $ $-1
    185       4  intersect  1 2
     85       5  difference  1 2
     86       6  difference  2 1
           
   Size  Number  Command
_____________________________________
   1018       1  chromosome 14 (in genbank)
   1060       2  chromosome 14 (in gdb)
   1545       3  union  $ $-1
    533       4  intersect  1 2
    485       5  difference  1 2
    527       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    205       1  chromosome 15 (in genbank)
    263       2  chromosome 15 (in gdb)
    345       3  union  $ $-1
    123       4  intersect  1 2
     82       5  difference  1 2
    140       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    250       1  chromosome 16 (in genbank)
    317       2  chromosome 16 (in gdb)
    408       3  union  $ $-1
    159       4  intersect  1 2
     91       5  difference  1 2
    158       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    608       1  chromosome 17 (in genbank)
    541       2  chromosome 17 (in gdb)
    876       3  union  $ $-1
    273       4  intersect  1 2
    335       5  difference  1 2
    268       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    198       1  chromosome 18 (in genbank)
    192       2  chromosome 18 (in gdb)
    252       3  union  $ $-1
    138       4  intersect  1 2
     60       5  difference  1 2
     54       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    497       1  chromosome 19 (in genbank)
    630       2  chromosome 19 (in gdb)
    879       3  union  $ $-1
    248       4  intersect  1 2
    249       5  difference  1 2
    382       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    178       1  chromosome 20 (in genbank)
    273       2  chromosome 20 (in gdb)
    306       3  union  $ $-1
    145       4  intersect  1 2
     33       5  difference  1 2
    128       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    485       1  chromosome 21 (in genbank)
    365       2  chromosome 21 (in gdb)
    558       3  union  $ $-1
    292       4  intersect  1 2
    193       5  difference  1 2
     73       6  difference  2 1
           
   Size  Number  Command
_____________________________________
    279       1  chromosome 22 (in genbank)
    301       2  chromosome 22 (in gdb)
    433       3  union  $ $-1
    147       4  intersect  1 2
    132       5  difference  1 2
    154       6  difference  2 1
           
   Size  Number  Command
_____________________________________
   1030       1  chromosome X (in genbank)
    860       2  chromosome X (in gdb)
   1362       3  union  $ $-1
    528       4  intersect  1 2
    502       5  difference  1 2
    332       6  difference  2 1
           
   Size  Number  Command
_____________________________________
     68       1  chromosome Y (in genbank)
     35       2  chromosome Y (in gdb)
     89       3  union  $ $-1
     14       4  intersect  1 2
     54       5  difference  1 2
     21       6  difference  2 1


======================================================================
James Tisdall
Departments of Genetics and Computer and Information Science
Computational Biology and Informatics Laboratory
Human Genome Project for Chromosome 22,
University of Pennsylvania and Childrens Hospital of Philadelphia

tisdall at cbil.humgen.upenn.edu    215-573-3113
======================================================================




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net