IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

SAD (SearchAndDestroy)

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Tue Jul 2 15:45:32 EST 1996


Hi,

There is a program that comes with REPBASE called CENSOR that can be
used to remove repetitive sequences found in the REPBASE database from
sequences.  Unfortunately, I was not able to build it on OpenVMS using
DEC C and DEC C++.  So, since one of my users wanted a solution fast, and
the port didn't look straightforward, I wrote a GCG program to handle this
task. 

SAD (SearchAndDestroy) takes a query sequence, compares it to each sequence
in a list and/or database (here parts of repbase) and obliterates matching
regions in the final output sequence.  That is, if it finds an Alu, in the
query sequence, it writes "." all over the ALU when it writes out the
final, modified, Query sequence. 

Here is an example where all human repetitive sequences are stripped as
are any simple repetitive sequeces (CTCTCTCT for instance).

$ create strip_human.list
..
REF_H:*
REF_SIMPLE:*
$ sad/infile=query.seq/infile2=@strip_human.list/outfile=query.sad

(.sad files are just sequence files, but since they typically look like
swiss cheese they have a different extension.)

SAD is vaguely FASTA like - it makes tuples, finds the highest scoring
diagonals by matching tuples, then scores diagonals.  Unlike FASTA, it
starts with the highest scoring diagonal and searches down until MISScut
diagonals past the last that had a match.  A match is a score above
CUToff within a WINDOW.  Defaults are tuples of size 6, cutoff =23,
window=30, MISScut=100, matrix=pileupdna.cmp.  The default fill character
is "." but you can change that (typically, to N).  The output file lists
all regions that were excised for each element. (These regions may be
overlapping, but it doesn't matter, since overwriting 5 times with "." is
the same as overwriting once.)  Here is part of an output file: 

SAD processed sequence file on July  2, 1996 13:32
   Infile1 = Killme.Sq
   Infile2 = @Strip_Human.List
   Matrix  = GenRunData:Pileupdna.Cmp

Deleted regions vs. screening sequence
    4016 -     4286  + REF_H:ALU
    5884 -     6106  + REF_H:ALU
    6108 -     6163  + REF_H:ALU
    7182 -     7477  + REF_H:ALU
   12327 -    12625  + REF_H:ALU
   13118 -    13387  + REF_H:ALU
    2822 -     2883  + REF_H:MIR
    2699 -     2737  - REF_H:MIR2
   11799 -    11903  + REF_H:L1
   11913 -    11990  + REF_H:L1
   15686 -    15771  + REF_H:L1
   12838 -    12869  - REF_H:L1
   15860 -    21888  - REF_H:L1
<snip>
   13407 -    13475  - REF_SIMPLE:GAAAA at 2
   18121 -    18159  + REF_SIMPLE:CAAAAA at 2
    7467 -     7504  - REF_SIMPLE:CAAAAA at 2
   12613 -    12647  - REF_SIMPLE:GAAAAA at 2
   13403 -    13468  - REF_SIMPLE:GAAAAA at 2
    6152 -     6199  - REF_SIMPLE:TAAAAA at 2
    6576 -     6605  - REF_SIMPLE:TAAAAA at 2
    6695 -     6727  - REF_SIMPLE:TAAAAA at 2


  Summary

   Test Sequences:                247
   Using Nmer:                      6
          Window:                  30
          Cutoff:                  23
          MissCut:                100
          fill character:           .
   Bases:
      Starting obliterated:         0
      Final obliterated:        10629
      Total:                    21888


After the signature you will find the files:

  sad.for
  sad.cmd         move to genrundoc
  sad.txt         move to genrundoc

Build on OpenVMS is via:

  $ for/extend/nolis sad.for
  $ sharelink sad

and define symbols.  Unix folks are on your own, but the equivalents should
be known to you.  Please report bugs to the address below.

Regards,

David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 
***SAD.FOR*****************************************************************
!***  SAD = SearchAndDestroy **********************************************************
!*  
!*  Purpose:
!*      Search a query sequence with one or more test sequences and excise
!*      from the query any regions of similarity.  The query sequence
!*      is converted to strand specific tuples, as is the test sequence
!*      (both strands, but only one at a time).  Both lists are sorted and
!*      then searched/merged to produce a list of high scoring diagonals.
!*      That list is then sorted, and it is searched, highest first, for
!*      regions of similarity (nongapped) scoring above CUToff in a WINdow.
!*      The search terminates after MISScut diagonals without a hit.
!*
!*			Copyright 1996 David Mathog (except on GCG parts!)
!*
!*      Must be compiled with /extend as some lines are >72 characters
!*
!*******************************************************************************

	Program SAD

	Implicit None
        Include 'GenInclude:Sequence.inc'
	Record /Sequence/ SQQ
	Record /Sequence/ SQT
	integer*4 MAXMER
	parameter (MAXMER=16)
	structure / duo /	
	   integer*4  nmer
	   integer*4  pos
	end structure
	record / duo / ListQ(MAXSEQLEN)
	record / duo / ListT(MAXSEQLEN)
	record / duo / scratch(MAXSEQLEN)
	record / duo / diagonals(-MAXSEQLEN:MAXSEQLEN)
	character*1 fate(MAXSEQLEN+1)
	character*1 fill(100),cfill
	integer   speedQ(MAXSEQLEN)
	integer   speedT(MAXSEQLEN)
	integer   fastalpha(0:255)
	integer   cmptable(32,32)
	integer   cmpfile
	character*1 tablename(256)
	integer*4 LTlen,LQlen
	integer*4 count,precount
	logical  degenerate
	integer  alphatonum   !external function
	logical  getdatafile  !external function
C
	Integer Begin, End, TotLen, Check
	Integer Window, cutoff,misscut
	Integer OutFile,TmpFile
	Character WildName(256), InFName(256), OutFName(256)
	Character MakeFName(256)
	Character NameBuff1(256)
	Character NameBuff2(256)
	Integer Nsize, pos
	Character OutLine(256)
	Character Text(132)
	Logical Wild,Full
	Logical Interact, Monitor, Summary, Commas/.TRUE./
	Logical Protein, Interrupt
	Logical Statistics, Random,TooShort
	Integer NSeqs
	integer*4 i,j,istat
	Real Seconds

	Logical IsWild, IsProtein, SQNext, VMSInterrupt
	Logical CLRetBool, CLGetWildFName, CLGetOldFName, CLGetNewFName, CLNoInteract
	Logical CLGetInt, CLGETBOOL, CLGetStr
  	Integer GetString, StrFind,Str_len
	Integer SortFile
	Real CPUTime

C
	Call GCGInit('SAD')
	Call Doc()
	Call CLComCheck()
c
	Interact = .not.CLNoInteract()
	Monitor = Interact
	Summary = Interact
	Call CLGetBool('FULL', Full)
	Call CLGetBool('MONitor', Monitor)
	Call CLGetBool('SUMmary', Summary)
	Call CLGetBool('STATistics', Statistics)
c
	if(.not. CLGetStr('FILl', fill))then
	   cfill = '.'
	else
	   cfill = fill(1)
	end if
c
	do i = 0,255
	  fastalpha(i) = alphatonum(char(i))
	end do
c
c
c
!* identify input File1

        If ( .not.CLGetOldFName('INfile1', 1, NameBuff1)  ) Then
          If ( .not.CLNoInteract() ) then
	    Call WriteF('\n%s SAD of what sequence(s) ?  ', Text)
	    If ( GetString(NameBuff1).eq.0 ) Stop ' '
	  End If
        End If ! Block created by IfAnd for UNIX
	If ( NameBuff1(1).eq.Char(0) ) Call WriteF('\n\b *** ERROR, '//
     &	  'SAD requires one or more input files! ***\n\s') 
	Call CapFields(NameBuff1)			!CVMS

!* make up an credible output file name

	Call StrCopy(OutFName, NameBuff1)
	Call NewFileType(OutFName, '.sad')

!* get the user to accept this name

        If ( .not.CLGetNewFName('OUTfile1', 2, OutFName)  ) Then
          If ( .not.CLNoInteract() ) then
	    Call WriteF('\n What should I call the output file (* %s'//
     &      ' *) ?  ', OutFName)
	    Call GetString(OutFName)
	  End If
        End If ! Block created by IfAnd for UNIX


!* open output file 
	Call OpenFile(OutFile, OutFName, 'wb')
c
c
c
!* identify and process input File2

        If ( .not.CLGetWildFName('INfile2', 1, WildName)  ) Then
          If ( .not.CLNoInteract() ) then
	    Call WriteF('\n%s SAD of what sequence(s) ?  ', Text)
	    If ( GetString(WildName).eq.0 ) Stop ' '
	  End If
        End If ! Block created by IfAnd for UNIX
	If ( WildName(1).eq.Char(0) ) Call WriteF('\n\b *** ERROR, '//
     &	  'SAD requires one or more input files! ***\n\s') 
	Wild = IsWild(WildName)
	Call StrCopy(NameBuff2, WildName)
	Call CapFields(NameBuff2)			!CVMS
C
C  Make sure the Nmer size is something reasonable! It defaults to 6
C
	Nsize = 6
	call CLGetInt('NMEr', 1,MaxMer,Nsize)
!	if(.not. CLGetInt('NMEr', 1,MaxMer,Nsize))then
!	Do while(Nsize.lt.1 .or. Nsize.gt.MAXMER)
!          If ( .not.CLNoInteract() ) then
!	    Call WriteF('\n Please specify an Nmer value (1 to %2d,inclusive) ',maxmer)
!	    Call GetInt(Nsize)
!	  else
!	   stop 'Fatal Error - specify a length with the /Nmer switch!'
!	  end if
!	end do
!	end if
C
C  Window and cutoff values - defaults to window = 30, cutoff =23
C
	window = 30
	cutoff = 23
	call  CLGetInt('WINdow', 1,MAXSEQLEN,window)
	call  CLGetInt('CUToff', 1,Window,CUToff)
!	if(.not. CLGetInt('WINdow', 1,MAXSEQLEN,window))then
!	Do while(Window.lt.1 .or. Window.gt.MAXSEQLEN)
!          If ( .not.CLNoInteract() ) then
!	    Call WriteF('\n Please specify an window value (1 to %2d,inclusive) ',MAXSEQLEN)
!	    Call GetInt(Window)
!	  else
!	   stop 'Fatal Error - specify a window length with the /WINdow switch!'
!	  end if
!	end do
!	end if
!
!	if(.not. CLGetInt('CUToff', 1,Window,CUToff))then
!	Do while(CUToff.lt.1 .or. CUToff.gt.WINDOW)
!          If ( .not.CLNoInteract() ) then
!	    Call WriteF('\n Please specify an CUToff value (1 to %2d,inclusive) ',Window)
!	    Call GetInt(CUToff)
!	  else
!	   stop 'Fatal Error - specify a CUToff length with the /CUToff switch!'
!	  end if
!	end do
!	end if
C
C  MISSCUT value (number of diagonals to examine, sorting down from the
C  top, AFTER the last one that stomped something)
C
	MISScut = 100
	call  CLGetInt('MISScut', 1,MAXSEQLEN,MISScut)
c
c	Leave this as a variable, just in case in some future version
c	strand specific searching is required.
c
	degenerate = .false.
C
C	Process infile1
C
	if(.not. SQNext(NameBuff1, SQQ))then
	   stop 'Fatal Error - infile1 not found or invalid'
	else
	   if(SQQ.type .eq. SQPROTEIN)stop 
	1     'SAD - Fatal error - infile must be DNA'
	   do i = 1,SQQ.len
	     speedQ(i) = alphatonum(SQQ.seq(i))
	   end do
	   call fwritef(outfile,
	1    'SAD processed sequence file on %t\n')
	   call fwritef(outfile,
	1    '   Infile1 = %s\n',NameBuff1)
	   call fwritef(outfile,
	1    '   Infile2 = %s\n',WildName)
	   call StrToUpper(SQQ.seq)
	   precount = 0
	   i = 1
	   do while(SQQ.seq(i) .ne. char(0))
	     if(SQQ.seq(i) .eq. cfill)precount = precount + 1
	     i = i + 1
	   end do
	   call SpewNmers(SQQ.seq,1,SQQ.len,Nsize
	1 ,scratch,degenerate,ListQ,LQlen)
	   call binsort(ListQ,scratch,LQLEN,0)
	   call strcopy(fate,SQQ.seq)	    
	end if
c
c	get the comparison table, it defaults to the one shown
c	unless /IDEntities is on the command line, then it is a pure
c	diagonal matrix of 100's.
c
	if(clgetbool('IDEntities'))then
	  do i = 1,32
	    do j = 1,32
	       if(i .eq. j)then
	          cmptable(i,j) = 100
	       else
	          cmptable(i,j) = 0
	       end if
	    end do
	  end do
	  call fwritef(outfile,
	1    '   Matrix  = None - Identities/n')
	else
	  call StrCopy(TableName, 'pileupdna.cmp')
          If ( GetDataFile(CmpFile, TableName) ) then
            If ( .not.CLNoInteract() ) Call WriteF('\n ***'//
     &      ' I read your local data file "%s". ***\n\b', TableName)
          End If
          Call ReadCmpTab(CmpFile, CmpTable)
	  call fwritef(outfile,
	1    '   Matrix  = %s\n',TableName)
	end if
c
	call fwritef(outfile,
	1    '\nDeleted regions vs. screening sequence\n')

!* now do each test sequence

	NSeqs = 0
	Interrupt = .FALSE.
c
c

!(mks 12-July-1990) Replaced DBNextSeq with SQNext

	Do While ( SQNext(WildName, SQT))
	  Call StrCopy(InFName, SQT.Name)
	  NSeqs = NSeqs + 1
	  If ( NSeqs.eq.1 ) then
	    Check = SQT.Check
	    Protein = IsProtein(SQT.Seq)
	    Call VMSSetTrap(.TRUE.)
	  End If
	  If ( Wild ) then
            Begin = 1
	    End = SQT.Len
	  End If
c
c	convert the test sequence to Nmers and sort them
c
	   call StrToUpper(SQT.seq)
c
	   call StrCopy(MakeFName,' + '//char(0))
	   call StrConcat(MakeFName,InFName)
	   call SpewNmers(SQT.seq,1,SQT.len,Nsize
	1 ,scratch,degenerate,ListT,LTlen)
	   call binsort(ListT,scratch,LTLEN,0)
	   call mergescore(Outfile,Diagonals,cfill
	1    ,scratch,MakeFName
	1    ,SQQ.seq,SQT.seq,ListQ,ListT
	1    ,LQLen,LTLen,SQQ.len,SQT.len,fate
	1    ,window,cutoff,misscut,nsize
	1    ,speedQ,speedT,fastalpha,cmptable)

	   call RevSeq(SQT.seq,1,SQT.len)
	   call StrCopy(MakeFName,' - '//char(0))
	   call StrConcat(MakeFName,InFName)
	   call SpewNmers(SQT.seq,1,SQT.len,Nsize
	1 ,scratch,degenerate,ListT,LTlen)
	   call binsort(ListT,scratch,LTLEN,0)
	   call mergescore(Outfile,Diagonals,cfill
	1    ,scratch,MakeFName
	1    ,SQQ.seq,SQT.seq,ListQ,ListT
	1    ,LQLen,LTLEN,SQQ.len,SQT.len,fate
	1    ,window,cutoff,misscut,nsize
	1    ,speedQ,speedT,fastalpha,cmptable)
c
	  Call BaseName(InFName)
	  If (  Wild .and. Monitor ) Call WriteF('\n %s', InFName)
	  If ( VMSInterrupt() ) then
	    Interrupt = .TRUE.
	    Go to 1000
	  End If
	End Do
c
c	Write the summary information into the file and also to the screen
c	if SUMMARY is on
c
	call strcopy(SQQ.seq,fate)
c
	count = 0
	i = 1
	do while(SQQ.seq(i) .ne. char(0))
	   if(SQQ.seq(i) .eq. cfill)count = count + 1
	   i = i + 1
	end do
c
c	Last piece goes to the screen/file
c
	call fwritef(outfile,
	1    '\n\n  Summary\n\n')
	call fwritef(outfile,
	1    '   Test Sequences:           %8d\n',nseqs)
	call fwritef(outfile,
	1    '   Using Nmer:               %8d\n',nsize)
	call fwritef(outfile,
	1    '          Window:            %8d\n',window)
	call fwritef(outfile,
	1    '          Cutoff:            %8d\n',cutoff)
	call fwritef(outfile,
	1    '          MissCut:           %8d\n',misscut)
	call fwritef(outfile,
	1    '          fill character:           %c\n',cfill)
	call fwritef(outfile,
	1    '   Bases:\n')
	call fwritef(outfile,
	1    '      Starting obliterated:  %8d\n',precount)
	call fwritef(outfile,
	1    '      Final obliterated:     %8d\n',count)
	call fwritef(outfile,
	1    '      Total:                 %8d\n\n',SQQ.len)
c
	call SQWrite(outfile,SQQ)
c
	if(summary)then
	call writef(
	1    '\n\n  Summary\n\n')
	call writef(
	1    '   Test Sequences:           %8d\n',nseqs)
	call writef(
	1    '   Using Nmer:               %8d\n',nsize)
	call writef(
	1    '          Window:            %8d\n',window)
	call writef(
	1    '          Cutoff:            %8d\n',cutoff)
	call writef(
	1    '          MissCut:           %8d\n',misscut)
	call writef(
	1    '          fill character:           %c\n',cfill)
	call writef(
	1    '   Bases:\n')
	call writef(
	1    '      Starting obliterated:  %8d\n',precount)
	call writef(
	1    '      Final obliterated:     %8d\n',count)
	call writef(
	1    '      Total:                 %8d\n\n',SQQ.len)
	end if
C
	Call VMSSpringTrap()

!        if (.true.) then
!		stop 'TERMINATED AFTER COMPUTATION LOOP'
!        end if

	If ( NSeqs.eq.0 ) then
	  Call WriteF('\n\b *** No sequences in "%s"! ***\n', WildName)
	  If ( ClNoInteract() ) Stop ' '
	End If

!* make up a heading for the output file now

1000	continue

!* write out that data
!* print a little summary

	If ( Interrupt ) then
	  Call WriteF(' INTERRUPTED!!!!!\n\b')
	Else if ( NSeqs.gt.50 .and. Summary ) then
	  Call WriteF('\b')
	End If
	
        Call WriteF('\o\n\n SAD complete:\n\n'//

     &	    '      Sequences: %D\n'//
     &	    '       CPU time: %T\n\n'//
     &	    '    Output file: %n\n',

     &	    NSeqs,  OutFile)

!* and do the account cleanup

	Call Account('SAD')
	If ( Interact ) Call WriteF('\n')
	End ! of RunNMer main block


	
!***  SpewNmers ******************************************************************
!*
!*	packs Nmer into a binary form and writes it with sequence number
!*      to an array
!*
!*******************************************************************************

	Subroutine SpewNmers(Strand,Begin,End,Nsize
	1 ,useit,degenerate,holdem,ListLen)
	
	implicit none
        Include 'GenInclude:Sequence.inc'

	Character Strand(*)
	integer useit(MAXSEQLEN)
	integer holdem(2,MAXSEQLEN)
	Integer Begin, End, ListLen
	Integer Nsize !size of the nmer
	logical degenerate

	character buffer(8)
	character dbuffer(8)
	integer*4 jishft,jior,jiand

	integer*4 i,j,k,kb,m
	integer*4 whatIam,mask	!32 bit counter for seqs of up to length 16
	integer*4 dwhatIam,dmask !ditto, but for reverse strand
	integer*4 dA,dG,dC,dT
	integer*4 ick,temp
	integer*4 calcsize
	logical   uint_lt

	integer alphatonum	!this is an external integer function
c
c	define an unsigned integer lt function
c
	uint_lt(i,j) = (
	1               (i .ge. 0) .and.                 !0xxxxxx
	1               (
	1                (
	1                 (j .ge. 0) .and. (i .lt. j)   !0xxxxxx < 0yyyyy
	1                )
	1                .or.
	1                .not. (j .ge. 0)             !0xxxxxx < Fyyyyy
	1               )
	1              )
	1
	1               .or.
	1
	1              (
	1               (.not. (i .ge. 0)) .and.           !Fxxxxxx
	1               (
	1                 .not. (j .ge. 0) .and. (i .lt. j)   !Fxx < Fyy 
	1               )
	1              )

c
c	NOTES:  strand(*) is an array of character*1 values
c	this next part extracts the part which runs from 
c	from BEGIN to END, this range is CALCSIZE
c
c	convert alpha to numeric for faster computation
c	no checks on what comes through
c
c
	j=0
	calcsize=end-begin+1
	do i=begin,end
	     j=j+1
	     useit(j)=alphatonum(strand(i))
	end do	
c
c	all A's, but it won't matter, see below...
c
	whatIam=0
c
c	Set up the mask.  This will zero above the count area and also the
c	bottom two bits (the next letter to go in)
c	also setup dmask, which is used for the opposite strand.
c	For nsize = 16, Mask -> 1111111111111100
c	For nsize = 16,dMask -> 0011111111111111
c
	Mask=0
	do k=1,Nsize-1
	   dMask = mask
	   mask=jior(mask,3)	!bits 11
	   mask=jishft(mask,2)
	end do
	dMask = jior(dmask,3)
	Mask=jiand(Mask,'FFFFFFFC'x)
c
c	set up dA,dG,dC,dT, complements of what is
c	used on the forward strand
c
	dA = jishft(3,2*(nsize-1))  !A->T = 11
	dG = jishft(2,2*(nsize-1))  !G->C = 10
	dC = jishft(1,2*(nsize-1))  !C->G = 01
	dT = 0                !T->A = 00
c
c	Now start adding them up
c
c	ick is a counter of the number of VALID nucleotides that have
c	been loaded into whatIam. If ick is >= nsize then that region
c	may be written.  It gets reset to 0 on sequence start, and when
c	any non AGCT are encountered.
c
	ListLen = 0
	ick=0
	kb = -Nsize + 1
	do k = 1,calcsize
	   kb = kb + 1
	   whatIam=jishft(whatIam,2)
	   whatIam=jiand(whatIam,Mask)  !clear the lowest 4 bits and above N
	   dwhatIam=jishft(dwhatIam,-2)
	   dwhatIam=jiand(dwhatIam,dMask) !Ditto for the reverse
	   ick = ick + 1                !indicate that a valid letter added
	   if(useit(k).eq.1)then	!A  = bits 00
	      dwhatIam = jior(dwhatIam,dA)	
	   else if(useit(k).eq.7)then	!G  = bits 01 
	      dwhatIam = jior(dwhatIam,dG)	
	      whatIam = jior(whatIam,1)
	   else if(useit(k).eq.3)then	!C  = bits 10
	      dwhatIam = jior(dwhatIam,dC)	
	      whatIam = jior(whatIam,2)
	   else if(useit(k).eq.20)then	!T  = bits 11
C	dT is 0, so no point "OR"ing that in.
C	      dwhatIam = jior(dwhatIam,dT)
	      whatIam = jior(whatIam,3)
	   else				!bad, must be an N or something
	      ick=0                     !set valid letters back to 0
	   end if
c
c	If DEGENERATE is set, then output whichever is smaller (unsigned),
c	 whatIam or dwhatIam.  Otherwise, just output whatIam.
c
	   if(ick.ge.nsize)then
	     ListLen = ListLen + 1
	     if(degenerate)then
	        if(uint_lt(whatIam,dwhatIam))then
	          holdem(1,ListLen)=whatIam
	        else
	          holdem(1,ListLen)=dwhatIam
	        end if
	     else
	        holdem(1,ListLen)=whatIam
	     end if
	     holdem(2,ListLen)=kb
	   end if
	end do
c
c
	Return
	End ! of GetStat

c
c	Binsort, mode = 0, sort on first (nmer), mode=1,
c	sort on second (pos)
c
	subroutine binsort(scratch,toss,nsum,mode)
	implicit none
	integer*4 nsum,mode
	structure /duo/
	   integer*4 nmer
	   integer*4 pos
	end structure
	record /duo/ scratch(nsum)
	record /duo/ toss(nsum)
	record /duo/ temp
c
	integer*4 ff,fe,fp,ef,ee,ep	!f=front, e=end, p = pointer
	integer*4 out,size,i,j
	logical special,maintest
	logical   uint_lt
c
c	define an unsigned integer lt function
c
	uint_lt(i,j) = (
	1               (i .ge. 0) .and.                 !0xxxxxx
	1               (
	1                (
	1                 (j .ge. 0) .and. (i .lt. j)   !0xxxxxx < 0yyyyy
	1                )
	1                .or.
	1                .not. (j .ge. 0)             !0xxxxxx < Fyyyyy
	1               )
	1              )
	1
	1               .or.
	1
	1              (
	1               (.not. (i .ge. 0)) .and.           !Fxxxxxx
	1               (
	1                 .not. (j .ge. 0) .and. (i .lt. j)   !Fxx < Fyy 
	1               )
	1              )

	
	if(nsum .lt. 2)return
	size = 1
	do while (size .le. nsum)
	ff = 1
	fe = ff + size - 1

	do while (fe .lt. nsum)
	   ef = fe + 1
	   ee = ef + size - 1
	   if(ee .gt. nsum) ee = nsum
	   fp = ff
	   ep = ef
	   out = ff
	   do while(out .le. ee)
	     if(mode .eq. 0)then
	        maintest = uint_lt(scratch(fp).nmer,scratch(ep).nmer)
	     else
	        maintest = scratch(fp).pos .lt. scratch(ep).pos
	     end if
	     if(maintest) then
	        toss(out) = scratch(ep)
	        out = out + 1
                if(ep .lt. ee)then
                   ep = ep + 1
                else                !move the rest of the f*  block
                   do while (fp .le. fe)
	              toss(out) = scratch(fp)
	              out = out + 1
                      fp = fp + 1
	           end do
                end if
	     else
	        toss(out) = scratch(fp)
	        out = out + 1
                if(fp .lt. fe)then
                   fp = fp + 1
                else                !move the rest of the e*  block
                   do while (ep .le. ee)
	              toss(out) = scratch(ep)
	              out = out + 1
                      ep = ep + 1
	           end do
                end if
	     end if
	   end do
	   ff = ee + 1
	   fe = ff + size -1
	end do
c
c	Write the toss area back over the input file in
c	preparation for the next step, or from exit from the routine
c	The reason it only goes to "out" is that their may be an "odd"
c	block on the end with no partner to swap with at some iterations.
c
	do out = 1, out - 1
	  scratch(out) = toss(out)
	end do
c
c	set up for the next iteration
c
	size =size + size
	end do
c
c	yes, this is dumb, and the sort should just be rewritten, but
c	the sort is Greatest->least and we need the opposite, so just swap
c
	j= nsum
	do i =1,nsum/2
 	  temp=scratch(i)
	  scratch(i)=scratch(j)
	  scratch(j)=temp
	  j= j - 1
	end do
c
	return
	end

	subroutine mergescore(Outfile,Diagonals,cfill,
	1    scratch,TestName,
	1    Qseq,TSeq,ListQ,ListT,
	1    LQlen, LTlen, Qlen,Tlen,
	1    fate,window,ecutoff,misscut,nsize
	1    ,speedQ,speedT,fastalpha,cmptable)
	implicit none
        Include 'GenInclude:Sequence.inc'
	integer*4 outfile,Qlen,Tlen,LQlen,LTlen,window,ecutoff,nsize
	character*1 TestName(*),cfill
	character*1 Qseq(*),TSeq(*)
	structure /duo/
	  integer*4 nmer
	  integer*4 pos
	end structure
	record /duo/ ListQ(LQlen)
	record /duo/ ListT(LTlen)
	record /duo/ scratch(MAXSEQLEN)
	record /duo/ Diagonals(-MAXSEQLEN:MAXSEQLEN)
	character*1 fate(Qlen)
	integer   speedQ(QLEN)
	integer   speedT(TLEN)
	integer   fastalpha(0:255)
	integer   cmptable(32,32)
	integer   alphatonum
c
	integer*4 cutoff
	integer*4 i,j,k
	integer*4 Qi,Ti,Qstart,Tstart
	integer*4 process,diag,Qpos
	integer*4 count,lastout
	logical   ok
	logical   missed
	integer*4 misses,misscut
	logical   uint_lt
	
c
c	define an unsigned integer lt function
c
	uint_lt(i,j) = (
	1               (i .ge. 0) .and.                 !0xxxxxx
	1               (
	1                (
	1                 (j .ge. 0) .and. (i .lt. j)   !0xxxxxx < 0yyyyy
	1                )
	1                .or.
	1                .not. (j .ge. 0)             !0xxxxxx < Fyyyyy
	1               )
	1              )
	1
	1               .or.
	1
	1              (
	1               (.not. (i .ge. 0)) .and.           !Fxxxxxx
	1               (
	1                 .not. (j .ge. 0) .and. (i .lt. j)   !Fxx < Fyy 
	1               )
	1              )

c
c	initialize the diagonals
c
	do i = -Tlen + 1, Qlen - 1
	  diagonals(i).nmer = 0
	  diagonals(i).pos  = i
	end do
c
c	cmptable has values that have been multiplied by 100, so
c	multiply cutoff by that too.
c
	cutoff = ecutoff * 100
c
c
c	convert T character sequence to integer sequence,
c	Q will have already been converted
c
	do i = 1,Tlen
	   speedT(i) = alphatonum(Tseq(i))
	end do
c
c	Fill the diagonals with tuple hits
c
	i = 1
	j = 1
	do while(i .le. LQlen .and. j .le. LTlen)
	  if(  uint_lt(  listQ(i).nmer , listT(j).nmer   )  )then
	    i = i + 1
	  else if(listQ(i).nmer .eq. listT(j).nmer)then
	    k = j
	    process = listQ(i).nmer
	    Qpos = listQ(i).pos
	    do while( process .eq. listT(k).nmer  .and.
	1             k .le. LTlen)
	      diag = Qpos - listT(k).pos
	      if(Qpos .ne. 0 .and. listT(j).pos .ne. 0)then
	        diagonals(diag).nmer = diagonals(diag).nmer + 1
	      end if
	      k = k + 1
	    end do
	    i = i + 1
	  else
	    j = j + 1
	  end if
	end do
c
c	All the diagonals are filled.  Sort them
c
	call binsort(diagonals(-Tlen + 1),scratch,Qlen + Tlen - 1,0)
c
c	Process successive diagonals until one is found that does NOT
c	have a region that satisfies WINDOW/CUTOFF.  It only checks
c	identities, cases have been set above.
c
	ok = .true.
	misses = 0
	k = Qlen - 1
	do while(ok)
	   diag = diagonals(k).pos
	   missed = .true.
	   if(k .lt. 1 - Tlen .or. misses .gt. misscut)then
	     ok = .false.
	   else
	     Qstart = max(diag + 1,1)
	     Qi     = Qstart
	     Tstart = max(1 - diag,1)
	     Ti     = Tstart
	     count = 0
	     lastout = 0
	     do while(Qi .le. Qlen .and. Ti .le. Tlen)
!	       if(Qseq(Qi) .eq. TSeq(Ti))count = count + 1
	       count = count + cmptable(speedQ(Qi),speedT(Ti))
	       if(qi - qstart .gt. window)count = count - 
	1          cmptable(speedQ(Qi-window),speedT(Ti-window))
	       if(count .ge. cutoff)then
	          do i = max(lastout + 1, Qi - window + 1),Qi
	            fate(i)='!'
	  	  end do
	          lastout = Qi
	          missed = .false.
	        end if
	        Qi = Qi + 1
	        Ti = Ti + 1
	     end do
	     if(missed)then
	       misses = misses + 1
	     else
	       misses = 0
	     end if
c
	  end if
	  k = k - 1
	end do
!
!   This is the original version, which only used identities.  It
!   was slightly faster, but couldn't handle consensus sequences well.
!c
!c	Process successive diagonals until one is found that does NOT
!c	have a region that satisfies WINDOW/CUTOFF.  It only checks
!c	identities, cases have been set above.
!c
!	ok = .true.
!	misses = 0
!	k = Qlen - 1
!	do while(ok)
!	   diag = diagonals(k).pos
!	   missed = .true.
!	   if(k .lt. 1 - Tlen .or. misses .gt. misscut)then
!	     ok = .false.
!	   else
!	     Qstart = max(diag + 1,1)
!	     Qi     = Qstart
!	     Tstart = max(1 - diag,1)
!	     Ti     = Tstart
!	     count = 0
!	     lastout = 0
!	     do while(Qi .le. Qlen .and. Ti .le. Tlen)
!	       if(Qseq(Qi) .eq. TSeq(Ti))count = count + 1
!	       if(qi - qstart .gt. window)then
!	           if(Qseq(Qi - window) .eq. TSeq(Ti - window))
!	1            count = count - 1
!	       end if
!	       if(count .ge. cutoff)then
!	          do i = max(lastout + 1, Qi - window + 1),Qi
!	            fate(i)='!'
!	  	  end do
!	          lastout = Qi
!	          missed = .false.
!	        end if
!	        Qi = Qi + 1
!	        Ti = Ti + 1
!	     end do
!	     if(missed)then
!	       misses = misses + 1
!	     else
!	       misses = 0
!	     end if
!c
!	  end if
!	  k = k - 1
!	end do
c
c
c	find all runs of "!" and convert them to cfill (usually = "."), also emit the
c	regions stepped on:
c
	i = 1
	j = 0
	do while(i .le. Qlen)
	  if(fate(i) .eq. '!')then
	    if(j .eq. 0)j = i
	    fate(i) = cfill
	  else
	    if(j .ne. 0)then
	      call fwritef(outfile,'%8d - %8d %s\n',j,i-1,TestName)
	      j = 0
	    end if
	  end if
	  i = i + 1
	end do
	if(j .ne. 0)call
	1      fwritef(outfile,'%8d - %8d %s\n',j,qlen,TestName)
c
	return
	end

c
c	remove all copies of any duplicates.
c
	subroutine squeezelist(list,llen,squeeze)
	implicit none
	integer*4 llen, squeeze
	structure / duo /
	   integer*4 nmer
	   integer*4 pos
	end structure
	record / duo / list(llen)
c
	integer*4 i,j
	logical   scan
c
	i = 1
	j = 2
	scan = .false.
	do while(j .le. llen)
	  if(list(i).nmer .eq. list(j).nmer)then
	    scan = .true.
	  else
	    if(scan)then
	      list(i) = list(j)
	      scan = .false.
	    else
	      i = i + 1
	      list(i) = list(j)
	    end if
	  end if
	  j = j + 1
	end do
	squeeze = i
	return
	end
***SAD.CMD*****************************************************************
Syntax: $ SAD /INfile=query.seq/Infile2=test.seq

Required Parameters:

/INFile=Query.seq               Sequence to process
/INFile2=test.seq               Test sequence(s) to remove from the Query
[/OUTfile=query.sad]            Output sequence file name

Optional Parameters:

/DATA=PileUpDNA.Cmp             scoring matrix for nucleic acids
/IDEntities                     ignore matrix, score only A,G,C,T identities
/NMER=6                         Length of the tuple to use
/WINdow=30                      Number of bases in scoring window
/CUToff=23                      Windows scoring above cutoff are overwritten
/MISScut=100                    Number of diagonals to examine after the
                                   last one containing a hit
/FILl="N"                       Overwrite with this character rather than "."
/SUMMARY                        Summarize files and results to screen

***SAD.TXT*****************************************************************
^*SAD\* (SearchAndDestroy)  Processes a Query sequence against one
or more test sequences and overwrites those areas in the Query with
significant similarity to one or more regions in the test sequences.
Useful for obliterating repetitive sequences or vector before doing
a database search.  It is similar to FASTA in that it selects
diagonals by processing the list of all Nmers in both sequences, and
then overwrites any sequence window on that diagonal that has at least
"cutoff" identical bases. The user may optionally change the overwrite
character from "." to something else, typically "N". 



More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net