Determining/Removing Simmilar Sequences

Alan Williams Alan at Avocado.UCR.edu
Mon Dec 14 17:37:20 EST 1998

I am doing some analysis regarding the prevelance and position
of short sequence motifs from large sequence datasets using 
the genbank flat files.  The particular areas that I am 
interested in are non-coding regions proximal to coding regions 
(ie NTR & UTR). How are others handling simmilar sequences in 
the data set?  My guess is that sets of sequences with greater 
than X% identity should be reduced to one sequence to prevent
biasing the statistical analysis.  So some specific questions:

(1)  When dealing with 1000 to 20,000 sequences is it necessary to remove
     nearly identical sequences? In your experience does it make a 
     difference or would just reporting the degree of near identity in
     the dataset sufficient?
(2)  How would you go about determining the degree of nearly identical
     sequences in a dataset? (To report along with the analysis.)
(3)  What would a good cutoff value be for defining "nearly identical"?
(4)  What software is freely available to do this sort or determination
     of near identity and pruning?


