To whom it may concern,
A e coli pili promoter I study has two GATC sites that are differentially
methylated in vivo by Deoxyadenosine methylase resulting in specific
patterns. The two GATC sites are spaced 103 basepairs apart (that is the
gAtc to gAtc distance ,counting from A to A within each GATC, see below for
a schematic depiction, if interested). We have experimental data that
indicates that the sequences around these GATCs are important to regulation
of transcription from our promoter. Recently I have used "findpatterns"
(based on sequences around the GATCs) to locate over 15 different pili
operons in E coli that are similar but not identical to the one we study.
One pattern I used was
I have been using GCGs pileup to align the regulatory regions (nucleotide
sequences of 250 basepairs) of all the 15 pili operons. I have used
Plotsimilarity and Profilemake to find other shared regions of similarity.
Of the 15 I have found all have strong similarity around the GATC sites
(due to the way I searched for them) and a subset share a lot of similarity
around the the -35 and -10 regions of the promoter.
My goal is to analyze the 15 operons to gain an understanding of which
bases are most important (least variant).
My questions for those who know more than I:
1. I would like to make a consensus sequences for the regions around each
GATC and the promoter (-35 and -10 regions). How should I make consensus
sequences for regions that share similarity? I have been using 15
sequences each 250 bases in length to create my alignments. I fear this is
too much and that I should split the region into smaller segments for
construction of a consensus. However, I don't know how many bases around
each GATC I should include when I generate a consensus for each region of
2. How do I decide which of the 15 pili operons to include in the analysis
to make consensus sequences? None of the sequences are identical but some
are more closely related than others. Eight of the pili operons clearly
have E coli -35 and -10 promoter regions, but not all of them. So should I
eliminate them from any programs I use to generate a consensus around the
Schematic:The expression states associated with methylation patterns at the
GATC sites are indicated
distance in bp(from start site of transcription)
154 52 +1
OFF STATE GATC . . 103 bp . . GATC. . . -> OFF (no transcription)
ON STATE GATC . . 103 bp . . GATC. . . -> ON (transcription occurs)
Thanks in advance for any suggestions!