Hello,
On Monday I asked the net the following:
: Is there a file, in Gopherspace or wherever, that keeps the statistics
: of DNA and Protein Databases with regard to species? It would be interesting
: to know what percentage of the E. coli genome has been sequenced, what
: percent of Genbank is human DNA, etc.
I received two replies. The first was from Amos Bairoch (thanks!) regarding
SwissProt. He informed me that the SwissProt release notes appendix contains
all sorts of information regarding the database.
Here is just a sample:
SwissProt database of Protein Sequences
A.2.2 Table of the most represented species
Number Frequency Species
1 2454 Human
2 2222 Escherichia coli
3 1439 Mouse
4 1339 Rat
5 1220 Baker's yeast (Saccharomyces cerevisiae)
6 634 Bovine
7 560 Fruit fly (Drosophila melanogaster)
8 477 Chicken
9 454 Bacillus subtilis
10 362 African clawed frog (Xenopus laevis)
11 340 Salmonella typhimurium
12 333 Rabbit
13 298 Pig
14 251 Vaccinia virus (strain Copenhagen)
15 222 Maize
16 193 Human cytomegalovirus (strain AD169)
17 177 Arabidopsis thaliana (Mouse-ear cress)
177 Rice
19 176 Vaccinia virus (strain WR)
20 167 Bacteriophage T4
21 161 Pea
22 159 Tobacco
159 Wheat
24 151 Pseudomonas aeruginosa
25 142 Caenorhabditis elegans
26 141 Fission yeast (Schizosaccharomyces pombe)
27 133 Barley
28 129 Staphylococcus aureus
29 127 Spinach
30 125 Soybean
31 123 Sheep
32 122 Slime mold (Dictyostelium discoideum)
33 119 Marchantia polymorpha (Liverwort)
34 118 Rhodobacter capsulatus
35 115 Dog
36 113 Pseudomonas putida
37 110 Neurospora crassa
110 Klebsiella pneumoniae
Dennis Benson of GenBank replied (thanks) and told me that each GenBank release
has a file (gbrel.txt) which includes the number of bases for the top
twenty organisms (excluding chloroplast and mitochondrial sequences). Here is
the file from release 78:
Entries Bases Species
36990 28328775 Homo sapiens
11115 10665461 Mus musculus
4427 6634841 Rattus norvegicus
2347 5371333 Saccharomyces cerevisiae
2606 4571085 Escherichia coli
2246 4391333 Drosophila melanogaster
5123 4139634 Caenorhabditis elegans
1710 2228362 Gallus gallus
1392 1759777 Bos taurus
2351 1639151 Arabidopsis thaliana
3270 1503383 Human immunodeficiency virus type 1
1021 1399704 Xenopus laevis
972 1371412 Oryctolagus cuniculus
519 970769 Bacillus subtilis
771 907555 Influenza virus type A
1254 873268 Plasmodium falciparum
1522 864290 Oryza sativa
525 859881 Zea mays
354 689647 Schizosaccharomyces pombe
509 685265 Sus scrofa