I am doing some analysis regarding the prevelance and position
of short sequence motifs from large sequence datasets using
the genbank flat files. The particular areas that I am
interested in are non-coding regions proximal to coding regions
(ie NTR & UTR). How are others handling simmilar sequences in
the data set? My guess is that sets of sequences with greater
than X% identity should be reduced to one sequence to prevent
biasing the statistical analysis. So some specific questions:
(1) When dealing with 1000 to 20,000 sequences is it necessary to remove
nearly identical sequences? In your experience does it make a
difference or would just reporting the degree of near identity in
the dataset sufficient?
(2) How would you go about determining the degree of nearly identical
sequences in a dataset? (To report along with the analysis.)
(3) What would a good cutoff value be for defining "nearly identical"?
(4) What software is freely available to do this sort or determination
of near identity and pruning?
Thanks,
Alan
************************************************************************
Alan Williams (finger alan at avocado.ucr.edu for pgp public key)
------------------------------------------------------------------------
University of California, Riverside "Where observation is concerned,
Dept. of Botany and Plant Sciences chance favors the prepared mind."
Alan at Avocado.UCR.edu -- Louis Pasteur
************************************************************************