[Bio-software] Inverse Document Frequency based cluster construction

Kevin O'Kane okane at cs.uni.edu
Tue Oct 11 20:04:54 EST 2005

Inverse document frequency (IDF) weights are used to calculate the relative 
importance of indexing terms based on term distribution.  When used with 
segmented, overlapping, fixed n-grams derived from genomic libraries to build a 
retrieval system, sequence retrieval is at a rate substantially faster than 
other widely used methods and of comparable accuracy when mutation rates are 
moderate.  Because the database access method is indexed, speed of sequence 
retrieval is primarily based on the size of the query and the number of 
sequences actually retrieved rather than the size of the database. [1]

Application of clustering techniques to large genomic libraries is difficult due 
to the time required to compute the large number of required pairwise 
similarities. However, an IDF based system makes cluster construction feasible. 
  An example based on the GenBank Bacteria (gbbct* Aug 18, 2005) files is at:


Of the 213,388 DNA sequences in the data set, those sequences with pairwise 
similarities that exceeded a threshold of 80% were identified, sorted according 
to strength of similarity, and submitted to a single-link clustering procedure. 
A total of 22,709 clusters were identified with and average of 41.8 sequences 
per cluster.

Copies of the source code (GPL/LGPL License) are at:


1. O'Kane, K.C., The Effect of Inverse Document Frequency Weights on Indexed 
Sequence Retrieval, Online Journal of Bioinformatics, Volume 6 (2) 162-173, 2005.

Kevin C. O'Kane, Ph.D.
Professor of Computer Science
University of Northern Iowa
Cedar Falls, IA 50614-0507
(319) 273 7322 (Office + Voice Mail)
(319) 266 4131 (Iowa)
(508) 778 9485 (Massachusetts)
okane at cs.uni.edu <--- preferred

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net