In article <01GLMO0VRQX800036Y at CINVESMX.BITNET> FVEGA at CINVESMX.BITNET writes:
... mailer routing messages deleted...
>> Dear Netters,
>> I have a problem in database searching that I hope someone out
>there could help me. I am interested in locating genes of E. coli that
>end with the UGA stop codon and which partially overlaps the AUG start
>codon of the following gene. That is, I am looking for a AUGA pattern,
>but just those cases that indeed are gene overlapings.
>> I used PatternSearch from the GCG Package, but as most can imagine
>the background of all AUGA sub-sequences that does not correspond to real
>gene overlapings is enormously high. I dont dare to inspect this output
>to locate by hand (looking at each GenBank entry) the real overlapings..
>> Does someone knows a software that could do this for me?
>> Many thanks,
>>>> Francisco M. De La Vega
> Department of Genetics and Molecular Biology
> CINVESTAV-IPN, Mexico City, Mexico.
> E-Mail: FVEGA at CINVESMX.Bitnet
The XYLEM package can get you part of the way. Just as an experiment, I
tried took the following steps:
1) Create a list of E coli sequences.
Since XYLEM creates index files with GenBank LOCUS names in the order
they appear in the files, all names for a given species are grouped
together in the index. By pulling out the block of index lines for E coli,
we now have a list of all E coli sequences. (I did this with a single
command using the vi editor.) This file is called ECO.nam. The first
11 lines are shown below:
ECO16S23S X12420 74316 48682
ECO1721DNA X61367 74356 48691
ECO21SUL1 X15371 74461 48841
ECO2MIN X55034 74515 48865
ECO3926PA X14236 75158 49244
ECO42RNA X01895 75216 49286
ECO5388 V00252 75238 49289
ECO571MR M74821 75267 49305
ECO5CPDB X54008 75341 49384
ECO5ERNAA M16640 75382 49393
etc......
2) This list can now be used as input for the FEATURES program, which
will extract all protein coding sequences (CDS) as shown in the user
menu below:
___________________________________________________________________
FEATURES - Version 15 May 92
___________________________________________________________________
Features: CDS
Entries: ECO.nam
Database: /home/psgendb/GenBank/gbbct
___________________________________________________________________
Parameter Description Value
-------------------------------------------------------------------
1).................... FEATURES TO EXTRACT ....................> f
f:Type a feature at the keyboard
F:Read a list of features from a file
2)....................ENTRIES TO BE PROCESSED (choose one).....> N
Keyboard input - n:name a:accession # e:expression
File input - N:name(s) A:accession #(s) E:expression(s)
3)....................WHERE TO GET IT .........................> u
u:User-defined database subset g:complete GenBank database
4)....................WHERE TO SEND IT ........................> a
s:Each feature to a separate file a:All output to same file
---------------------------------------------------------------
Type number of your choice or 0 to continue:
0
Messages will be written to ECO.msg
Final sequence output will be written to ECO.out
Expressions will be written to ECO.exp
Extracting features...
and there are now four files in our directory:
-rw------- 1 psgendb 95507 Jun 26 10:17 ECO.exp
-rw------- 1 psgendb 1248705 Jun 26 10:19 ECO.msg
-rw------- 1 psgendb 72770 Jun 26 10:03 ECO.nam
-rw------- 1 psgendb 2374474 Jun 26 10:19 ECO.out
Since ECO.out contains the DNA sequences for each CDS, it is quite
straightforward to look for all sequences beginning with atga. You
could write a fairly simple program that searched the .out file and
wrote a new namefile with the names of those sequences beginning with
atga. You could almost use grep to do this, since
egrep -n ^atga ECO.out >atga.out
writes a file containing numbered output of all lines starting with atga:
109:atgacaaagttgcagccgaatacagtgatccgtgccgccctggacctgtt
179:atgagccagcaagtcattattttcgataccacattgcgcgacggtgaaca
185:atgactcattccacggcaatggattctgtttttatcagaacccgtatctt
210:atgatgcattgcataccgtgggtggtattgatcatgtattagttcgtcat
225:atgaccgaacgacgaacaatctggcaaagtactgcccaaatgccactgtt
291:atgatggaaaactataaacatactacggtgctgctggatgaagccgttaa
311:atgatcagcagagtgacagaagctctaagcaaagttaaaggatcgatggg
320:atgaaagcagcggcgaaaacgcagaaaccaaaacgtcaggaagaacatgc
388:atgattagcgtaacccttagccaacttaccgacattctcaacggtgaact
516:atgaatacacaacaattggcaaaactgcgttccatcgtgcccgaaatgcg
etc...
There are 1319 such lines in ECO.out. While most of these are probably the
beginning of a CDS, it is best to have a program eliminate the false
positives for you.