removing ALUs and Vector sequences from text sequence files

Andy Law Andy.Law at bbsrc.ac.uk
Mon Jan 27 05:06:41 EST 1997

In article <5cg2q2$rs6 at mserv1.dl.ac.uk>, Rifat Hamoudi <rifat at icr.ac.uk> wrote:

 >  Hi,
 >          I used VEP and REP to remove vector and ALU repeat sequences 
 >  from the sequence text files. I would like to clarify a few points.
 >  Does VEP remove vector sequence for you? It seems to create a file 
 >  with extension corresponding to vector name and the original file 
 >  have nothing in it. How can one remove partial vector sequence?
 >  REP creates a filename with extension .ALU, when I look at the 
 >  contents of that file I found it to be the same as the original 
 >  except for a head indicating a number e.g. 5 100. Does this mean that 
 >  sequence 5 to 100 is ALU and should be ignored? Is there anyway using
 >  Staden software, that I can remove ALU sequences from text files 
 >  automatically? If so how?


I use Staden to scan and remove vector sequences (but not alu, since my
sequences are chicken sequences). Note that I also use ABI377 sequence
files as my start point so I may be way off the ball here but anyway...

The core of the Staden package seems to be the experiment file. This holds
all the information pertaining to a particular sequence ie clone name,
cloning vector, etc, etc, etc.

Provided that you have given vepe the correct information, it should scan
your sequence for vector (depending on the flags that you have set) and
write the results back into the experiment file with an identifying tag.
Lets assume that you have a sequence that is 650 bases long. The first 62
bases are sequencing vector, as are the last 50. The 5' most 200 bases of
the sequenced insert are cloning vector so we have the following

     1         62               262                       600       650

where '.....' is sequencing vector
      '_____' is cloning vector
and   '-----' is 'good' sequence

In this case, after vepe you should find in the experiment file the following

SL    62
SR    600
CS    63..262

which tells you just the information you need to know (check the manual for
the information in the 'formats' section)

You can then use a Staden program called extract_seq to pull out just the
good bits (command line flag is -good_only)

This all also assumes that you are using pregap to do the processing. If
you aren't, then it could be that you are missing some important step and
it may save you some time to do so. I have successfully modified pregap and
incorporated it into a system to automatically scan a directory of sequence
files (usually an ABI377 gel's worth), identify vectors and poor sequence,
pull out the good stuff, reformat into GCG, submit blast searches, scan the
results, submit GRAIL searches and do a CpG analysis. Anything seems to be
possible :o)


Andy Law
( Andy.Law at bbsrc.ac.uk )
( Big Nose in Edinburgh )

More information about the Staden mailing list

Send comments to us at biosci-help [At] net.bio.net