Robin Matlib, a graduate student at Washington University School of
Medicine, working with Dr. Gary Stormo, and David Kulp, working with Dr.
David Haussler, have begun to modify and train an algorithm to provide
an effective gene-finder for the green alga, Chlamydomonas reinhardtii
from genomic sequence data. Robin has collected from the public data a
set of curated genes -- both DNA and mRNA records from Genbank that were
used to train the program. The training was validated by training with
subsets of the panel and testing with genes left out of the training
A jack-knife cross validation was performed. The following data are for
predictions of an exon.
The sensitivity is 78%. This is the percentage of actual exons that the
program gets completely right. This value of 28% includes predictions
that may be wrong by a single nucleotide. The percentage of totally
missed exons is 12%.
The specificity is 83%. This is the percent of predicted exons that are
The algorithm overpredicted exons; so that 17% of the predicted exons
were not actual exons. These predictions may be wrong by a single
nucleotide. Only 5% of the exons were missed in their entirety.
When predictions over the entire gene are performed, the test statistics
The sensitivity is 44%. This is the percentage of actual genes that the
program gets completely right.
The specificity is 38%. This is the percent of predicted genes that are
Unfortunately the UCSC computers that this program is running on are
overloaded and the file system is very slow. On top of that, Genie is not
particularly speedy either! Users should limit the sequence input size to
less than 100K.
Susan Dutcher (dutcher at genetics.wustl.edu)