Hi again,
> I have a question about searching for a short nucleotide consensus
> sequence. The sequence is only 10 nucleotides long and represents a
> protein binding site with in a promoter.Specifically, I want to search
> the E. coli genome for this consensus sequence. I have tried a fasta
> and blastn search in GCG without success. Is it possible to search for
> such a short consensus in GCG databases such as - [nr n Non-redundant
> GenBank+EMBL+DDBJ+PDB sequences]?
I wanted to send all the responses I received to the question I posted above, for
anybody who is interested. All the responses I got are pasted in below (Numbered
[I] through [V]). Thank you for all your help! Findpatterns worked great. I
searched all bacterial sequences in Genbank/EMBL and found 40 hits that matched
my 10 nucleotide consensus perfeclty. Now all I have to do is figure out if the
matches are biologically relevant ;- ).
I also followed a suggestion one GCGer gave to try a fasta search with a word
size (k-tuple) of one instead of six but still did not have get any results back.
For anybody interested I pasted in a captured session file below of a fasta
search of the banks bacterial sequences with a 10 nucleotide sequence. This
particular file is a copy of my search with a k-tuple size of six but I got
similar results with a k-tuple size of one too. I have two additional questions
for anyone who can help me.
1) What is an E value? the default setting used in my fasta search was 2.0 (see
below). I'd like to learn how to adjust that paramater if that is worthwhile to
learn about.
2) The results from my fasta search said "1919 scores saved that exceeded 71"
but no best scores were listed. Are the 1919 scores saved available for review?
It seemed odd to me that scores were saved but no "best scores" were listed in my
results file.
Any way thank you very much for you help to get me this far. Any further
comments will be icing on the cake.
sincerely,
Nate Weyand
Nate_Weyand at hlthsci.med.utah.edu
THE RESPONSES:
******************************************************************************
[I] FASTA should work just fine, but use a ktup=1 instead of the default ktup=6
for DNA.
Bill Pearson
[II] Dear Nathan,
What exactly do you mean "without success"? 10 nucleotides
however is too short for a BLAST search. Have you tried
findpatterns, which is a program that can scan a database
for an exact match to a sequence or an arbitrary number of
mismatches? This program can work on GEP:* which is the
latest nonredundant database in the GCG release. However
if your site updates them more frequently than the bimonthly
GCG release it would be that much more recent. Another
program that you might wish to try is Ssearch (full Smith
Waterman Search) which is available from William Pearson as
part of his Fasta release.I hope that this is of some help.
Best wishes,
Rich
[III] Hi,
If you can put the consensus in the form of a pattern or set of patterns then
the GCG "findpatterns" program works pretty well, though the speed does not
compare to BLAST.
Findpatterns works best if you can define a sequence pattern that is specific,
but includes all of the "real" binding sites. It does not allow positional
weighting, so the best thing to give it is fully conserved residues/IUPAC
specifications and allowed spacings between conserved groups. In addition, if
there are correlated dinucleotide variations they are best searched as separate
patterns instead of via a single generalization of the pattern.
EG:
CNGGATNA{5,7}TNATCCNG
CNGGTANA{5,7}TNTAGGNG
is much better than:
CNGGNNNA{5,7}TNNNCCNG
Findpatterns also lets you allow mismatches, but in general for short,
degenerate patterns without positional weighting this will just decrease your
signal to noise ratio. Better to explicitly search additional patterns.
Findpatterns patterns are specified in the same data file format as restriction
enzymes for "MAP", etc. Do "fetch pattern.dat" to get an example file you can
modify with your own patterns. Do "genhelp findpatterns" to see a description
of the program. The pattern syntax is described under "defining patterns" and
making and using your own pattern files are described under "pattern file" and
"local data files". Good luck,
Mike Lonetto
[IV] I'm off site this week so can't check the GCG documentation. But I'm sure
that you will discover that you can 'tweak' FINDPATTERNS to do what you are
looking for. I don't know the switches you need to specify, but the
documentation should help with that. Findpatterns is specifically designed
for looking for short sequence matches to either a single sequence or
sequences in a database.
E. Victoria Porter, Ph.D
[V] Fasta and blastn probably won't work on such short sequences. One
possibility is FINDPATTERNS permitting, say, 2 mismatches (there will
probably be too many "hits" with 3). However, if you have several of
these sequences and if the contribution of the individual bases is unequal
(e.g. the T1, A2 and T6 in the -10 box are more conserved than the others)
then you can align the known sequences (make individual sequence files and
then use PILEUP; or enter them directly using LINEUP). You can then use
PROFILEMAKE and then PROFILESEARCH. These latter two programs are meantfor
proteins but can be gotten to work with DNA sequences - don't forget to use
-MATRix=profiledna.cmp with PROFILEMAKE. I have used these
programs successfully to hunt for consensus sequences at the ends of
"59-base elements" in integrons (see Fig. 7 of Collis and Hall, Molecular
Microbiology 6: 2875-2885 (1992).
Paul H. Roy
MY CAPTURED FASTA SESSION:
******************************************************************************
$ fasta GATCMOTIF.;1 /bat
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.
Begin (* 1 *) ?
End (* 10 *) ?
Search for query in what sequence(s) (* GenEMBL:* *) ? Bacterial:* *
What word size (* 6 *) ? 9
Word size must be in the range from 1 to 6.
What is Word size (* 6 *) ?
Don't show scores whose E() value exceeds: (* 2.0 *):
What should I call the output file (* Gatcmotif.Fasta *) ?
** fasta will run as a batch or at job.
** fasta was submitted using the command:
" SUBMIT/NOPRINT/NOTIFY "
Job FASTA_325312_1 (queue SYS$BATCH, entry 182) started on SYS$BATCH
$ type GATCMOTIF.FASTA;1
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: Gatcmotif. from: 1 to: 10 June 18, 1997 13:25
GATC motif
TO: Bacterial:* Sequences: 33,599 Symbols: 77,340,372 Word Size: 6
Databases searched:
EMBL, Release 50.0, Released on 18Feb97, Formatted on 17Apr1997
GenBank, Release 100.0, Released on 7Apr97, Formatted on 16Apr1997
Searching with both strands of the query.
Scoring matrix: GenRunData:Fastadna.Cmp
Constant pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Results sorted and z-values calculated from opt score
1919 scores saved that exceeded 71
20596 optimizations performed
Joining threshold: 45, optimization threshold: 30, opt. width: 16
The best scores are: init1 initn opt z-sc E(41501)..
\\End of List
! CPU time used:
! Database scan: 0:01:36.3
! Post-scan processing: 0:00:00.1
! Total CPU time: 0:01:36.5
! Output File: Gatcmotif.Fasta