Inverse document frequency (IDF) weights are used to calculate the relative
importance of indexing terms based on term distribution. When used with
segmented, overlapping, fixed n-grams derived from genomic libraries to build a
retrieval system, sequence retrieval is at a rate substantially faster than
other widely used methods and of comparable accuracy when mutation rates are
moderate. Because the database access method is indexed, speed of sequence
retrieval is primarily based on the size of the query and the number of
sequences actually retrieved rather than the size of the database. [1]
Application of clustering techniques to large genomic libraries is difficult due
to the time required to compute the large number of required pairwise
similarities. However, an IDF based system makes cluster construction feasible.
An example based on the GenBank Bacteria (gbbct* Aug 18, 2005) files is at:
http://www.cs.uni.edu/~okane/source/IDF/Bacteria-Clusters.gz
Of the 213,388 DNA sequences in the data set, those sequences with pairwise
similarities that exceeded a threshold of 80% were identified, sorted according
to strength of similarity, and submitted to a single-link clustering procedure.
A total of 22,709 clusters were identified with and average of 41.8 sequences
per cluster.
Copies of the source code (GPL/LGPL License) are at:
http://www.cs.uni.edu/~okane/source/IDF/
Reference:
1. O'Kane, K.C., The Effect of Inverse Document Frequency Weights on Indexed
Sequence Retrieval, Online Journal of Bioinformatics, Volume 6 (2) 162-173, 2005.
--
Kevin C. O'Kane, Ph.D.
Professor of Computer Science
University of Northern Iowa
Cedar Falls, IA 50614-0507
(319) 273 7322 (Office + Voice Mail)
(319) 266 4131 (Iowa)
(508) 778 9485 (Massachusetts)
http://www.cs.uni.edu/~okaneokane at cs.uni.edu <--- preferred