in the course of studying repetitive sequences under statistical aspects,
i frequently encountered sequence stretches that originate from the multiple
cloning site and adjacent regions of vectors like pUC/pBR type or the like.
For getting an estimate of the number of 'vectorial contamination' in EMBL and
Genbank, i ran a fasta search against EMBL-primate section using whole pUC19 as
a probe and got more than 20 suspect scores.
A closer look at the top scorers revealed that most of them show their
homology in the region adjacent to the multiple cloning site. In at least one
of the sequence entries found, the pBR-part is annotated.
I am not sure if the occurence of vector sequences in primate genes is
desirable and think that perhaps these stretches should be removed in future
releases but if someone finds arguments for retaining them.. feel free to
here are my favourite puc-containers (in EMBL-notation)