searching for sequences homologus to fragments

Tue Nov 9 17:42:23 EST 1993

Peter -

> We have come across an interesting problem and wondered if anyone
> had any insights or alternative strategies.
> One of our users has several short polypeptide sequences believed to be
> from the same gene. He wants to search portein sequence databases
> to find sequences that are homologous to all of these fragments. The
> user anticipates 30/40 % sequence similarity overall between the 
> fragments and their hits.
> Searches have been done with the individual fragment sequences,
> which have been useful, but the user wishes to combine them all in
> one search. I would expect that others will want to be doing similar
> in the future, so a simple but effective strategy would be very useful.
> Thoughts so far:
> 1) Concaternate sequence files together and run against the databases
>    using the blast program.
>     -lose information however, because you know that you likely have gaps
>      between the fragments.
> 2) Run fasta with the different permuations of the arrangement of the
>    fragments, each in a different file, with a low gap penalty.
>    -fine for a small number of fragments, but the number of permutations
>     soon increases.
> 3) A recursive searching of the sequence databases:
>     First fragment -get top 500 hits
>                    -make a database with the hits
>     Second fragment-get top 50 hits
>                    -make another database with the hits.
>     Third fragment -get top 5 hits
>     The database could be made by editing the output from the
>     sequence search of the first fragment produced by fasta (or Blast)
>     and producing a file of (database) sequence names. 

  Here is another alternative, similar to your strategy #1 & #2, but should 
address your concern with regard to gaps.

  You don't mention in your query whether you have access to the GCG package,
but since your question is posted to the Info-GCG group, I'll assume that you
do.  I have a program (TriPatternGen) that utilizes the excellent GCG program 
FindPatterns to search for triplets of sequence motifs.  TriPatternGen is
written in VAX/VMS Pascal, and _is_, without apology, VMS dependent.  Source 
and/or executable versions of the program are available if you are interested. 

  FindPatterns will search either protein or nucleic acid sequence databases
with a query sequence (generally short), and allows ambigous patterns and 
mismatches. See the GCG-FindPatterns manual page for additional information 
on specifying patterns.

  TriPatternGen takes a set of three files with motifs, and combines them to
produce a pattern file suitable as input to FindPatterns.  In addition, 
TriPatternGen can insert a variable spacer between the motifs, by utilizing
the FindPatterns pattern specification syntax "(X){I,J}". For instance:

if the "PreMotif" file contains the following 3 patterns:


the  "MidMotif"  file contains the following 2 patterns:


and the  "PostMotif"  file contains the following 2 patterns:


with an constant intervening pattern of: (X){1,5} between the first and mid
and a constant intervening pattern of: (X){2,3} between the mid and post.

The following pattern.dat file would be created:

A Motif pattern file created by TriPatternGen, for
use in conjunction with the GCG FindPatterns Program, for example:


Name          Offset  Pattern             Overhang  Documentation  ..
Motif_1_&_1_&_1   1       ABC(X){1,5}MNOP(X){2,3}TUV  0  !
Motif_1_&_1_&_2   1       ABC(X){1,5}MNOP(X){2,3}WXYZ  0  !
Motif_1_&_2_&_1   1       ABC(X){1,5}QRS(X){2,3}TUV  0  !
Motif_1_&_2_&_2   1       ABC(X){1,5}QRS(X){2,3}WXYZ  0  !
Motif_2_&_1_&_1   1       CDE(X){1,5}MNOP(X){2,3}TUV  0  !
Motif_2_&_1_&_2   1       CDE(X){1,5}MNOP(X){2,3}WXYZ  0  !
Motif_2_&_2_&_1   1       CDE(X){1,5}QRS(X){2,3}TUV  0  !
Motif_2_&_2_&_2   1       CDE(X){1,5}QRS(X){2,3}WXYZ  0  !
Motif_3_&_1_&_1   1       FGHI(X){1,5}MNOP(X){2,3}TUV  0  !
Motif_3_&_1_&_2   1       FGHI(X){1,5}MNOP(X){2,3}WXYZ  0  !
Motif_3_&_2_&_1   1       FGHI(X){1,5}QRS(X){2,3}TUV  0  !
Motif_3_&_2_&_2   1       FGHI(X){1,5}QRS(X){2,3}WXYZ  0  !

Notes:  Maximum allowed length of the pattern is 175 characters, although,
        the pattern specification can address much longer sequences (on 
        the order of 350,000 symbols).  

        The maximum number of patterns allowed by FindPatterns is 2,000.

        FindPatterns utilizes the "(X){I,J}" syntax to indicate a pattern 
        (ie "X"), repeated a minimum of "I" times and a maximum of "J" times.  

        Pre, Mid & Post Motif file can be the same file.

        Use "" to indicate a null intervening patterns.

TriPatternGen requires:

Input files:   Three files of motifs, one motif per line, max number: 2000;
               Default names are PreMotif.dat, MidMotif.dat and PostMotif.dat

Output file:   A FindPatterns compatible file. Default file name is Pattern.dat


      First, install PatternGen as a foreign VMS command:

$ PtGen :== $Sys$login:TriPatternGen.Exe

      Next, run the program:

$ PtGen File_Of_Motifs File_Of_Motifs File_Of_Motifs Output_Filename -
_$ Fixed_Pattern1 Fixed_Pattern2 

      for example:

$ PtGen PreMotif.dat MidMotif.dat PostMotif.dat Pattern.dat (n){1,7} (N){0,6}

      Now run the GCG-FindPatterns program:

$Findpatterns/Data=Pattern.dat/Mismatch=Number_of_Mismatches_Allowed database:*

The FindPatterns output will show matching candidates from the database(s)

   Let me know if you would like to have a copy, or if you would be interested
in a test run of the program on our machines here.  

-Mark Gunnell
Mark A. Gunnell                   | Internet: gunnell at ncifcrf.gov
Sci. Applications Analyst         | Bitnet:   gunnell%ncifcrf.gov at cunyvm.bitnet
Biomedical Supercomputer Center   | Phone:   (301) 846-5779
PRI/DynCorp                       | FAX:     (301) 846-5762
NCI-FCRDC                         |
PO Box B, Bldg 430                |
Frederick, MD 21702-1201  USA     |

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net