Bruce Roe suggests:
/1. Run Strings:
/
/ Search for the keyword "repeat" and search the GenEMBL
/ database with the output set to GENEMBL.STRINGS
/
/2. Run DataSet
/
/ To create the GCG data library from the set of sequences
/ in GCG format obtained as output from STRINGS
/
/ Assemble DATASET from what sequence(s) ? @genembl.strings
/
/ What should I call the data library ? repeats
/
/3. Sit back and watch all the work get done for you.
Bruce,
This is easier and more difficult than you indicate. Easier
because you don't need to create a dataset (GCG's implementation of fasta
will accept "@genembl.strings" [or even *.seq] as the name of the database
to search). More difficult because it is *NEVER* a good idea to pull a
bunch of sequences out of GenBank and assume you have what you want (the
descriptions of them just aren't very good). Unless it's something I don't
care about (in which case I don't do it anyway :-) ) I always look at each
sequence to make sure it is what I want. Also, in this particular instance,
I don't need a zillion examples of Alu repeats. Furthermore, by looking at
the sequences you get a good idea of which repeats weren't found by
Strings*earch because your keyword repeat wasn't present. At the very least
I would look for REPETITIVE as well.
Steve Clark
clark at salk-sc2.sdsc.edu (Internet)
clark at salk (Bitnet)