IUBio

Nucleotide consensus search summary & 2 ?s

Mindi Robinson Mindi.Robinson at m.cc.utah.edu
Fri Jul 4 23:30:57 EST 1997


Hi again,

>   I have a question about searching for a short nucleotide consensus
> sequence.  The sequence is only 10 nucleotides long and represents a
> protein binding site with in a promoter.Specifically, I want to search
> the E. coli genome for this consensus sequence.  I have tried a fasta
> and blastn search in GCG without success. Is it possible to search for
> such a short consensus in GCG databases such as - [nr n Non-redundant
> GenBank+EMBL+DDBJ+PDB sequences]?


I wanted to send all the responses I received to the question I posted above, for 
anybody who is interested.  All the responses I got are pasted in below (Numbered 
[I] through [V]).  Thank you for all your help!  Findpatterns worked great.  I 
searched all bacterial sequences in Genbank/EMBL and found 40 hits that matched 
my 10 nucleotide consensus perfeclty.  Now all I have to do is figure out if the 
matches are biologically relevant ;- ).

  I also followed a suggestion one GCGer gave to try a fasta search with a word 
size (k-tuple) of one instead of six but still did not have get any results back.  
For anybody interested I pasted in a captured session file below of a fasta 
search of the banks bacterial sequences with a 10 nucleotide sequence.  This 
particular file is a copy of my search with a k-tuple size of six but I got 
similar results with a k-tuple size of one too.  I have two additional questions 
for anyone who can help me.  

1)  What is an E value?  the default setting used in my fasta search was 2.0 (see 
below).  I'd like to learn how to adjust that paramater if that is worthwhile to 
learn about.

2)  The results from my fasta search said "1919 scores saved that exceeded 71" 
but no best scores were listed.  Are the 1919 scores saved available for review?  
It seemed odd to me that scores were saved but no "best scores" were listed in my 
results file.

Any way thank you very much for you help to get me this far.  Any further 
comments will be icing on the cake.

sincerely,
Nate Weyand  

Nate_Weyand at hlthsci.med.utah.edu

THE RESPONSES: 
******************************************************************************
[I] FASTA should work just fine, but use a ktup=1 instead of the default ktup=6
for DNA.

Bill Pearson

[II] Dear Nathan,

	What exactly do you mean "without success"? 10 nucleotides
however is too short for a BLAST search. Have you tried
findpatterns, which is  a program that can scan a database
for an exact match to a sequence or an arbitrary number of 
mismatches? This program can work on GEP:* which is the 
latest nonredundant database in the GCG release. However 
if your site updates them more frequently than the bimonthly
GCG release it would be that much more recent. Another
program that you might wish to try is Ssearch (full Smith 
Waterman Search) which is available from William Pearson as 
part of his Fasta release.I hope that this is of some help.
Best wishes,
Rich

[III] Hi,

If you can put the consensus in the form of a pattern or set of patterns then 
the GCG "findpatterns" program works pretty well, though the speed does not 
compare to BLAST.  

Findpatterns works best if you can define a sequence pattern that is specific, 
but includes all of the "real" binding sites.  It does not allow positional 
weighting, so the best thing to give it is fully conserved residues/IUPAC 
specifications and allowed spacings between conserved groups.  In addition, if 
there are correlated dinucleotide variations they are best searched as separate 
patterns instead of via a single generalization of the pattern.

EG:

CNGGATNA{5,7}TNATCCNG
CNGGTANA{5,7}TNTAGGNG

is much better than:

CNGGNNNA{5,7}TNNNCCNG 

Findpatterns also lets you allow mismatches, but in general for short, 
degenerate patterns without positional weighting this will just decrease your 
signal to noise ratio.  Better to explicitly search additional patterns.

Findpatterns patterns are specified in the same data file format as restriction 
enzymes for "MAP", etc.  Do "fetch pattern.dat" to get an example file you can 
modify with your own patterns.   Do "genhelp findpatterns" to see a description 
of the program.  The pattern syntax is described under "defining patterns" and 
making and using your own pattern files are described under "pattern file" and  
"local data files".  Good luck,

Mike Lonetto

[IV] I'm off site this week so can't check the GCG documentation. But I'm sure
that you will discover that you can 'tweak' FINDPATTERNS to do what you are
looking for. I don't know the switches you need to specify, but the
documentation should help with that. Findpatterns is specifically designed
for looking for short sequence matches to either a single sequence or
sequences in a database.

E. Victoria Porter, Ph.D

[V] Fasta and blastn probably won't work on such short sequences.  One
possibility is FINDPATTERNS permitting, say, 2 mismatches (there will
probably be too many "hits" with 3).  However, if you have several of
these sequences and if the contribution of the individual bases is unequal
(e.g. the T1, A2 and T6 in the -10 box are more conserved than the others)
then you can align the known sequences (make individual sequence files and
then use PILEUP; or enter them directly using LINEUP).  You can then use
PROFILEMAKE and then PROFILESEARCH.  These latter two programs are meantfor 
proteins but can be gotten to work with DNA sequences - don't forget to use  
-MATRix=profiledna.cmp  with PROFILEMAKE.  I have used these
programs successfully to hunt for consensus sequences at the ends of
"59-base elements" in integrons (see Fig. 7 of Collis and Hall, Molecular
Microbiology 6: 2875-2885 (1992).

 Paul H. Roy 


MY CAPTURED FASTA SESSION:
******************************************************************************
$ fasta GATCMOTIF.;1 /bat

FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.

                  Begin (* 1 *) ?
                End (*    10 *) ?

 Search for query in what sequence(s) (* GenEMBL:* *) ?  Bacterial:* *

 What word size (* 6 *) ?  9
 Word size must be in the range from 1 to 6.
 What is Word size (* 6 *) ?

 Don't show scores whose E() value exceeds: (* 2.0 *):

 What should I call the output file (* Gatcmotif.Fasta *) ?

 ** fasta will run as a batch or at job.

 ** fasta was submitted using the command:
    "  SUBMIT/NOPRINT/NOTIFY  "

Job FASTA_325312_1 (queue SYS$BATCH, entry 182) started on SYS$BATCH


$ type GATCMOTIF.FASTA;1
!!SEQUENCE_LIST 1.0


(Nucleotide) FASTA of: Gatcmotif.  from: 1 to: 10  June 18, 1997 13:25

GATC motif

 TO: Bacterial:*  Sequences:     33,599  Symbols: 77,340,372  Word Size: 6

 Databases searched:
   EMBL, Release 50.0, Released on 18Feb97, Formatted on 17Apr1997
   GenBank, Release 100.0, Released on 7Apr97, Formatted on 16Apr1997

 Searching with both strands of the query.
 Scoring matrix: GenRunData:Fastadna.Cmp
 Constant pamfactor used
 Gap creation penalty: 16      Gap extension penalty: 4

Results sorted and z-values calculated from opt score
 1919 scores saved that exceeded 71
 20596 optimizations performed
 Joining threshold: 45, optimization threshold: 30, opt. width: 16

The best scores are:                    init1 initn   opt    z-sc E(41501)..

\\End of List

! CPU time used:
!        Database scan:  0:01:36.3
! Post-scan processing:  0:00:00.1
!       Total CPU time:  0:01:36.5
! Output File: Gatcmotif.Fasta



More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net