Database of repetitive sequences?

Mon Mar 23 22:13:38 EST 1992

Bruce Roe suggests:

/1. Run Strings:
/        Search for the keyword "repeat" and search the GenEMBL
/        database with the output set to GENEMBL.STRINGS
/2. Run DataSet
/        To create the GCG data library from the set of sequences
/        in GCG format obtained as output from STRINGS
/        Assemble DATASET from what sequence(s) ?  @genembl.strings
/        What should I call the data library ?  repeats
/3. Sit back and watch all the work get done for you.


	This is easier and more difficult than you indicate. Easier
because you don't need to create a dataset (GCG's implementation of fasta
will accept "@genembl.strings" [or even *.seq] as the name of the database
to search). More difficult because it is *NEVER* a good idea to pull a
bunch of sequences out of GenBank and assume you have what you want (the
descriptions of them just aren't very good). Unless it's something I don't
care about (in which case I don't do it anyway :-) ) I always look at each
sequence to make sure it is what I want. Also, in this particular instance,
I don't need a zillion examples of Alu repeats. Furthermore, by looking at
the sequences you get a good idea of which repeats weren't found by
Strings*earch because your keyword repeat wasn't present. At the very least
I would look for REPETITIVE as well. 

Steve Clark

clark at salk-sc2.sdsc.edu  (Internet)
clark at salk               (Bitnet)

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net