Dear Netters,
Some of my clustering and database searching programs have been
described in a recent publication (see below) but some others
have not been announced before. The programs cover a wide range
of applications and should be of interest to many people. I wrote
all the programs on a Sun but I tried to keep the code portable;
all input and output is ASCII apart from index storage.
The copyright on all my code rests with the British MRC but I ex-
pect that they would be happy to see more academic users. I would
be particularly interested to hear from anyone willing to inves-
tigate the programs' portability.
"Clustering cDNA sequences", 1992, Parsons, J.D., Brenner S. and
Bishop M.J., Comput. Applic. Biosci., Vol 8, pp 461-466
Jeremy Parsons
==========================================================================
The ICAtools
------------
A set of programs has been written to quantify the similarities
between large numbers of DNA sequences. Using this information,
similar sequences are clustered together at the rate of thousands
of sequences per day. The cluster structure information is kept
in a small, space-efficient index file which ensures that disk
requirements are negligible. Index files are used to create
selective views and summaries of the entire sequence dataset of
interest. These summaries can form a useful overview analysis of
the data produced by any large-scale DNA sequencing project. The
ICAtools are useful for:
i) finding novel sequence families;
ii) database searching;
iii) linker and vector screening;
iv) determining the point of effective exhaustion of cDNA libraries;
v) sequence overlap detection as a precursor to contig building;
Linker and vector screening can normally be performed using a da-
tabase searching program such as BLAST or a specialised program
like Roger Staden's Vep, but these methods rely on knowing the
exact sequence of such artifacts. This information may be con-
fused or unavailable because of an experimenter's administrative
mistakes and protocol errors or because of commercial secrecy.
The ICAtools do not need any guiding information to find the
over-represented sequence segments that characterise cloning ar-
tifacts. Thus, the programs have the ability to find "features"
that their users didn't know they were looking for.
When used for database searching, one of the tools, ICAass, is
more sensitive, though less quick, than BLAST and faster than
FASTA for batches of sequences.
Together, the ICAtools are a useful and flexible package for both
the data-mining and the quality control of large DNA sequencing
projects. They have been used at the NCBI in the USA and in the
UK where they feature in the HGMPRC Computing Facility's menus.
ICAtool
ICAtool is a jack-of-all-trades cDNA clustering program. The
basis of the program is a FASTA-like algorithm which is used to
compare pairs of sequences. A full dynamic-programming algorithm
was implemented but then abandoned because it was unnecessarily
sensitive and slow. As an aid to performance, the results of pre-
viously calculated comparisons are used to guide the choice of
which sequences are subsequently compared. This gives the program
a best-case computational complexity of order 'n', where n is the
number of sequences being clustered.
In addition to clustering similar sequences together, ICAtool can
perform a rapid, focussed database search. In query mode, a pre-
prepared cluster index file is used to allow the searching pro-
cess to spend a disproportionately large amount of its time com-
paring the query sequence against those indexed sequences to
which the query is most similar. ICAtool, by using file-pointers
rather than creating yet another sequence format, allows the
simultaneous use of 5 different, existing formats and also uses
negligible disk space by avoiding unnecessary information dupli-
cation.
n2tool
The program n2tool is similar to ICAtool because it performs DNA
clustering and shares the ICAtool cluster index file format. The
programs differ for a few reasons:
i) n2tool cannot be used for querying;
ii) n2tool's pairwise comparison algorithm is more BLAST-like than
FASTA-like;
iii) n2tool is guaranteed to compare all the submitted sequences
against each other.
Using datasets typical in our laboratory (thousands of ~300 bp
fragments) n2tool is quicker than ICAtool and produces more con-
cise clustering. n2tool is the only program used for clustering
genomic data because its clustering algorithm is less affected by
multi-domain repetitive sequences.
Both n2tool and ICAtool can incrementally expand their indexes to
allow extra sequences to be added at any time; this is achieved
at minimal cost by not repeating previous calculations. All the
programs share the same concise index structure.
ICAass
There are some clustering applications for which ICAtool and
n2tool are inappropriate because they use local-similarity com-
parison algorithms. When clustering, the program ICAass uses a
novel global-similarity algorithm which determines whether one
sequence is an approximate subsequence of another. ICAass has
been used to cluster a size-sorted EMBL DNA database and was able
to shrink the database files by upto 50% by removing all approxi-
mate subsequences.
In addition to shrinking databases, ICAass can be used to query
indexes which it does very quickly using a local similarity algo-
rithm and without the need for any specially formated databases.
ICAprint and ICAstats
ICAprint and ICAstats are a pair of programs that can be used to-
gether to display how sequences have been clustered together.
ICAprint has many options that allow selected subsets of se-
quences to be printed out. This allows, for example, an easy
selection of those sequences which didn't match any others or the
selection of single example sequences, one from each cluster.
ICAstats takes the output from ICAprint and produces an overview
of cluster sizes and some related statistics. ICAstats is partic-
ularly useful to groups sequencing cDNA libraries because the
program uses a Poisson model to predict the number of, as yet,
unfound sequences left in the current library.
ICAmatches
ICAprint can clearly show which sequences have been clustered to-
gether but the task of explaining why is left to ICAmatches. The
explanation necessarily involves showing some form of alignement
but the traditional multiple alignment style would be too verbose
and, by only marking conserved bases, unimformative. In every
cluster there is one type example sequence chosen by the cluster-
ing program. ICAmatches creates a novel style of multiple align-
ment by printing underneath a listing of the type sequence, the
cumulative frequencies of those other sequences in the cluster
that match to that windowed region of the example sequence. This
allows a quick estimation of where and why all the sequences in
any cluster were put together. This is the best tool for
identifying unknown vector or linker sequences.