Peter -
> We have come across an interesting problem and wondered if anyone
> had any insights or alternative strategies.
>> One of our users has several short polypeptide sequences believed to be
> from the same gene. He wants to search portein sequence databases
> to find sequences that are homologous to all of these fragments. The
> user anticipates 30/40 % sequence similarity overall between the
> fragments and their hits.
>> Searches have been done with the individual fragment sequences,
> which have been useful, but the user wishes to combine them all in
> one search. I would expect that others will want to be doing similar
> in the future, so a simple but effective strategy would be very useful.
>> Thoughts so far:
>> 1) Concaternate sequence files together and run against the databases
> using the blast program.
> -lose information however, because you know that you likely have gaps
> between the fragments.
>> 2) Run fasta with the different permuations of the arrangement of the
> fragments, each in a different file, with a low gap penalty.
> -fine for a small number of fragments, but the number of permutations
> soon increases.
>> 3) A recursive searching of the sequence databases:
>> First fragment -get top 500 hits
> -make a database with the hits
> Second fragment-get top 50 hits
> -make another database with the hits.
> Third fragment -get top 5 hits
>> The database could be made by editing the output from the
> sequence search of the first fragment produced by fasta (or Blast)
> and producing a file of (database) sequence names.
Here is another alternative, similar to your strategy #1 & #2, but should
address your concern with regard to gaps.
You don't mention in your query whether you have access to the GCG package,
but since your question is posted to the Info-GCG group, I'll assume that you
do. I have a program (TriPatternGen) that utilizes the excellent GCG program
FindPatterns to search for triplets of sequence motifs. TriPatternGen is
written in VAX/VMS Pascal, and _is_, without apology, VMS dependent. Source
and/or executable versions of the program are available if you are interested.
FindPatterns will search either protein or nucleic acid sequence databases
with a query sequence (generally short), and allows ambigous patterns and
mismatches. See the GCG-FindPatterns manual page for additional information
on specifying patterns.
TriPatternGen takes a set of three files with motifs, and combines them to
produce a pattern file suitable as input to FindPatterns. In addition,
TriPatternGen can insert a variable spacer between the motifs, by utilizing
the FindPatterns pattern specification syntax "(X){I,J}". For instance:
if the "PreMotif" file contains the following 3 patterns:
ABC
CDE
FGHI
the "MidMotif" file contains the following 2 patterns:
MNOP
QRS
and the "PostMotif" file contains the following 2 patterns:
TUV
WXYZ
with an constant intervening pattern of: (X){1,5} between the first and mid
and a constant intervening pattern of: (X){2,3} between the mid and post.
The following pattern.dat file would be created:
=============================================================================
PATTERN.DAT
=============================================================================
A Motif pattern file created by TriPatternGen, for
use in conjunction with the GCG FindPatterns Program, for example:
$Findpatterns/Data=Pattern.dat
Name Offset Pattern Overhang Documentation ..
Motif_1_&_1_&_1 1 ABC(X){1,5}MNOP(X){2,3}TUV 0 !
Motif_1_&_1_&_2 1 ABC(X){1,5}MNOP(X){2,3}WXYZ 0 !
Motif_1_&_2_&_1 1 ABC(X){1,5}QRS(X){2,3}TUV 0 !
Motif_1_&_2_&_2 1 ABC(X){1,5}QRS(X){2,3}WXYZ 0 !
Motif_2_&_1_&_1 1 CDE(X){1,5}MNOP(X){2,3}TUV 0 !
Motif_2_&_1_&_2 1 CDE(X){1,5}MNOP(X){2,3}WXYZ 0 !
Motif_2_&_2_&_1 1 CDE(X){1,5}QRS(X){2,3}TUV 0 !
Motif_2_&_2_&_2 1 CDE(X){1,5}QRS(X){2,3}WXYZ 0 !
Motif_3_&_1_&_1 1 FGHI(X){1,5}MNOP(X){2,3}TUV 0 !
Motif_3_&_1_&_2 1 FGHI(X){1,5}MNOP(X){2,3}WXYZ 0 !
Motif_3_&_2_&_1 1 FGHI(X){1,5}QRS(X){2,3}TUV 0 !
Motif_3_&_2_&_2 1 FGHI(X){1,5}QRS(X){2,3}WXYZ 0 !
=============================================================================
Notes: Maximum allowed length of the pattern is 175 characters, although,
the pattern specification can address much longer sequences (on
the order of 350,000 symbols).
The maximum number of patterns allowed by FindPatterns is 2,000.
FindPatterns utilizes the "(X){I,J}" syntax to indicate a pattern
(ie "X"), repeated a minimum of "I" times and a maximum of "J" times.
Pre, Mid & Post Motif file can be the same file.
Use "" to indicate a null intervening patterns.
TriPatternGen requires:
Input files: Three files of motifs, one motif per line, max number: 2000;
Default names are PreMotif.dat, MidMotif.dat and PostMotif.dat
Output file: A FindPatterns compatible file. Default file name is Pattern.dat
Usage:
First, install PatternGen as a foreign VMS command:
$ PtGen :== $Sys$login:TriPatternGen.Exe
Next, run the program:
$ PtGen File_Of_Motifs File_Of_Motifs File_Of_Motifs Output_Filename -
_$ Fixed_Pattern1 Fixed_Pattern2
for example:
$ PtGen PreMotif.dat MidMotif.dat PostMotif.dat Pattern.dat (n){1,7} (N){0,6}
Now run the GCG-FindPatterns program:
$Findpatterns/Data=Pattern.dat/Mismatch=Number_of_Mismatches_Allowed database:*
The FindPatterns output will show matching candidates from the database(s)
searched.
Let me know if you would like to have a copy, or if you would be interested
in a test run of the program on our machines here.
-Mark Gunnell
-------------------------------------------------------------------------------
Mark A. Gunnell | Internet: gunnell at ncifcrf.gov
Sci. Applications Analyst | Bitnet: gunnell%ncifcrf.gov at cunyvm.bitnet
Biomedical Supercomputer Center | Phone: (301) 846-5779
PRI/DynCorp | FAX: (301) 846-5762
NCI-FCRDC |
PO Box B, Bldg 430 |
Frederick, MD 21702-1201 USA |
-------------------------------------------------------------------------------