[Bio-software] Text indexing and retrieval from Genbank files

Kevin O'Kane okane at cs.uni.edu
Sun Sep 11 10:48:17 EST 2005

Demo and source code (GPL license) are at:


MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries)

See O'Kane, K.C.; & Lockner, M.J, Indexing genomic sequence libraries, 
Information Processing & Management, Volume 41, Issue 2 , March 2005, 
Pages 265-274)

Kevin C. O'Kane
University of Northern Iowa
Cedar Falls, IA 50614-0507
okane at cs.uni.edu

Overview and Background

This system, MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries), 
is an implementation and toolkit to integrate multiple, very large, genomic, 
data bases into a unified data repository through open-source components and 
provide fast, web-based access to the results. This system differs from existing 
genomic text retrieval systems in that:

   1. it derives the index terms from the text rather than manual assignment;

   2. it is fully built upon open-source components;

   3. it is coded in an extended implementation language that supports multi-dimensional data bases; 

   4. it can index in a matter of hours the largest genomic data sets. 

While natural language text indexing can be approached from the perspective of assignment to 
pre-existing categories and hierarchies such as the National Library of Medicine MeSH (Medical 
Subject Headings) (Hazel, 1997), derivative indexing is better able to adapt to changes in a 
rapidly evolving discipline as the terms are dynamically extracted directly from the source 
material rather than waiting for manual analysis. Existing keyword based genomic retrieval 
systems are primarily based on assignment indexing whereas the approach taken here is based 
on derivative indexing, where both queries and documents are encoded into a common intermediate 
representation and metrics are developed to calculate the coefficients of similarity between 
queries and documents. Documents are ranked according to their computed similarity to the 
query and presented to the user in rank order. Several systems employing this and related 
models have been implemented such as Smart (Salton, 1968, 1971, 1983, 1988), Instruct 
(Wade, 1988), Cansearch (Pollitt, 1987) and Plexus (Vickery, 1987a, a987b). More recently, 
these approaches have been used to index Internet web pages and provide collaborative 
filtered recommendations regarding similar texts to book buyers at Amazon.com (Linden, 2003).

In this system, genomic accessions are represented by vectors that reflect accession content 
through descriptors derived from the source text by analysis of word usage (Salton,1968, 1971, 
1983, 1988; Willett, 1985; Crouch, 1988). This approach can be further enhanced identifying 
clusters of similar documents (El-Hamdouchi et al.,1988, 1989). Likewise, term-term co-occurrence 
matrices can be developed that can be used to identify similar or related terms and these 
can be automatically included into queries to enhance recall or to identify term clusters. 
Other techniques based on terms and queries have also been explored (Salton, 1988; Williams, 1983).

The vector model is rooted in the construction of document vectors that consisting of the 
weights of each term in each document. Taken collectively, the document vectors constitute a 
document-term matrix whose rows are document vectors. A document-term matrix can have millions 
of rows, more than 22 million in GenBank case, and thousands of columns (terms), more than 
500,000 in GenBank. This yields a matrix with potentially trillions of possible elements which 
must be quickly addressable not by numeric indices but also by text keys. Additionally, to 
enhance retrieval speed, an inverted matrix of the same size is needed which doubles the storage 
requirements. Fortunately, however, both matrices are very sparse.

Evaluation of retrieval effectiveness from a data set of this size is clearly difficult as there 
are few benchmarks against which to compare the results.  However, NCBI distributes a file of 
keyword phrases with GenBank  gbkey.idx (502,549,211  bytes in 2004).  This file contains submission 
author assigned keyword phrases and associated accession identifiers.  Of the 48,023 unique keys 
in gbkey.idx (after removal of special characters and words less than three characters in length), 
26,814 keys were the same as the keys selected by MARBL.  The 21,209 keys that differed were, for 
the most part, words of very high or low frequency that the system rejected due to preset thresholds.  
Alternatively, the MARBL system identified and retained a highly specific 501,614 terms, 
many of which were specific codes used to identify genes.  

We compared MARBL with BLAST by entering the nucleotide sequence of a Bacillus anthrtacis 
bacteriophage that was of interest to a local researcher.  BLAST retrieved 24 accessions.
The highest scoring accession was the correct answer, while the remainder were noise. When 
we entered the phrase anthracis & bacteriophage to the MARBL retrieval package, only one 
accession was retrieved, the same one that received the highest score from BLAST.  BLAST 
took 29 seconds, MARBL retrieval took 10 seconds.  It should be noted, however, that BLAST 
searches are not based on keywords but on genomic sequences.

Copies of all software in source code form are available from: http://www.cs.uni.edu/~okane.


Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman 
(1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res., 25, 

American National Standards Institute, Inc. (1995). ANSI/MDC X11.4.1995 Information Systems . Programming Languages - M, 
American National Standards Institute, 11 West 42nd Street, New York, New York 10036.

Barker, W.C., et. al. (1999). The PIR-International Sequence Database, Nucleic Acids Research, 27(1) 39-43.

Barnett, G.O. & Greenes, R.A. (1970). High level programming languages, Computers and Biomedical Research, 3, 488.497.

Bowie, J & and Barnett, G. O. (1976). MUMPS . an economical and efficient time.sharing language for information management, 
Computer Programs in Biomedicine, 6, 11.21.

Crouch, C.J. (1988). An analysis of approximate versus exact discrimination values, Information processing and Management, 
24(1), 5.16.

El.Hamdouchi, A.; & Willet, P. (1988). An improved algorithm for the calculation of exact term discrimination values, 
Information Processing and Management, 24(1) 17.22.

El.Hamdouchi, A.; and Willet, P. (1989). Comparison of hierarchic and agglomerative clustering methods for document 
retrieval, The Computer Journal, 32(3) 220.227.

Hazel, P. (1997), Perl Compatible Regular Expression Library, Cambridge: University of Cambridge Computing Service.

Linden, G., & Smith, B. (2003). Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Distributed Systems 
Online, http://dsonline.computer.org/0301/d/w1lind.htm

O'Kane, K.C. (1980). An RT.11 single user standard MUMPS interpreter, MUMPS Users' Group Quarterly, 10, 5.6.

O'Kane, K.C. (1983). A portable hybrid MUMPS development system host, Proc IEEE Computer Society 7th International Computer 
Software Applications Conference, 60.65.

O'Kane, K.C. (1992). A language for implementing information retrieval software, Online Review, 16(3) 127-137.  

O'Kane, K.C. (1999). An M Compiler for Internet server applications, M Computing, pp 11-17, 7(1) 11-17.

Pearson, W. R. (2000). Flexible sequence similarity searching with the FASTA3 program package Methods Mol.Biol., 132, 185-219

Pollitt, S. (1987). Cansearch: an expert systems approach to document retrieval, Information Processing and Management, 
23(2), 119.138.

Salton, G. (1968). Automatic Information Organization and Retrieval, New York: McGraw.Hill Book Company.

Salton, Gerard; and McGill, Michael J.; Introduction to Modern Information Retrieval, McGraw.Hill Book Company (New York, 

Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading 
Massachusetts: Addison.Wesley.

Salton, G. (editor) (1971). The SMART Retrieval System: Experiments in Automatic Document Processing, Englewood Cliffs, New 
Jersey: Prentice.Hall.

Shpaer, E. G., Robinson, M, et al. (1996). Sensitivity and Selectivity in Protein Similarity Searches: Comparison of 
Smith-Waterman in Hardware, Genomics, 38, 179-191.

Sleepycat Software, Inc. (2001). Berkeley DB, Indianapolis, IN: New Riders, ISBN 0-7357-1064-3

Smith, T. F., & Waterman, M.S. (1981) Identification of common molecular subsequence, J. Mol. Biol., 147(1), 195-197.

Thure, E. & Argos, P. (1993a). SRS an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci., 9, 

Thure, E. & Argos, P. (1993b). Transforming a set of biological flat file libraries to a fast access network. Appl. Biosci., 
9, 59-64.

U.S. National Library of Medicine (2003), 2003 MeSH, Annotated Alphabetic List, 8600 Rockville Pike, Bethesda, MD 20894, NTIS 
Order number: PB2003-96480. (http://www.nlm.nih.gov/mesh/meshhome.html)

Vickery, A.; and Brooks, H.M. (1987a). PLEXUS: an expert system for referral, Information Processing and Management, 23, 

Vickery, A.; & Brooks, H.M. (1987b). A reference referral system using expert system techniques, Journal of Documentation, 
43, 1.23.

Wade, S.J.; and Willett, P. (1988). INSTRUCT: a teaching package for experimental methods in information retrieval. Part III. 
Browsing, clustering and query expansion, Program, 22, 44.61.

Willett, P. (1985). An algorithm for the classification of exact term discrimination values, Information Processing and 
Management, 21(3), 225.232.

Williams, J.H. (1963). A discriminant method for automatically classifying documents, Proc 1963 Fall Joint Computer 
Conference, 161.166.

Wu, C.H., et. al. (2003). The Protein Information Resource, Nucleic Acids Res., 31(1), 345-347.

September 2005 

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net