Salut,
There is program called PROSEARCH which look in the prosite "database of
motifs" for the motifs similar to your sequences. I have used this program in unix
systeme. You may also have access directly to this database by WWW in the address:
http://expasy.hcuge.ch/sprot/prosite.html
here following the help file for this program that I hope will clarify more:
Prosearch reads in a file containing one or more protein sequences and
searches for patterns, described sites, or structures in the Prosite
Database compiled by Amos Bairoch. The output is sent to the named
file.
The input protein sequence file can be in any reasonable format.
The output is a table of sites, followed by the relevant sections from
the Prosite Database.
The Prosite Database is updated with every release of the SwissProt
database, (about every three months).
INTRODUCTION
Over the past year or so Amos Bairoch (bairoch at earn.cgecmu51)
has released an number of versions of his Prosite database. This is a
database of patterns which have been associated with particular
enzymatic activities or structures. For example, the well known pattern
for N-link glycosylation Asn-Xxx-Ser/Thr.
Amos has compiled a database that consists of references about
each pattern, validity of the patterns, occurrences, and a host of other
details. This database is of general use, and has been used by Amos in
his PC/Gene Suite of programs for analysis of DNA and Protein sequences.
I wanted to use this database on a Unix machine and be able to
ask the question, "Which of these patterns occur in sequence X?"
This is the second release of Prosearch. It completely
supersedes the first version with one important bug fix, and support for
VMS, MS-DOS, and UNIX. Also, by using ReadSeq, a fine program from Don
Gilbert <gilbertd at silver.ucs.indiana.edu>, more protein data formats are
accessible.
IMPLEMENTATION
Most patterns can be expressed as regular expressions. For
example the pattern '^P' when used with the unix utility grep matches
any line in the input that begins with a 'P'.
I translated all but 1 of the 337 patterns in Prosite to Unix
style regular expressions and wrote a simple searching program to search
a protein sequence for their occurrence. The pattern I did not
translate was the pattern PS0003 which is Tyrosine Sulfation. There is
no clean pattern for this modification.
The program is written in the Awk language, and runs on machines
which have either Nawk from AT&T, Gawk from the Free Software
Foundation, or one of several versions of Awk which run on MSDOS
compatibles.
INPUT FILES
Input file are any protein sequence files in an unstructured
format. AWK will accept the input on any number of lines of any length
(I've tried proteins sequences up to 2500 amino acids on one line with
no problem). Each ASCII character will be interpreted as an amino acid,
and all letters must be capitalized. With 'readseq' any of a number of
formats can be used.
GCG-format files are accepted as input sequences.
Sequences with no sort of header are NOT accepted. If you have a raw sequence,
then add a comment line, as below:
>pep23a from my library C1
aggiiplmma
OUTPUT
There are two possible forms of output. The "short" form is a
table of accession numbers, positions in the sequence and short names
for patterns. The "long" form is the same except that the relevant
sections from the Prosite Database is also printed. At the HGMP-RC you
will be given the long form.
Here is an example of the output for E. coli chloramphenicol
transferase III:
Prosite Database -- Release 5.0 of April 1990 Copyright: Amos Bairoch
ProSearch Software -- Release 1.1 -- Copyright: Lee Kolakowski
The following patterns are in < ct.pep >:
Access# From->To Name Doc#
_______ ________ ____________________ _________
PS00001 2->6 ASN_GLYCOSYLATION PDOC00001
PS00005 31->34 PKC_PHOSPHO_SITE PDOC00005
PS00006 4->8 CK2_PHOSPHO_SITE PDOC00006
PS00006 32->36 CK2_PHOSPHO_SITE PDOC00006
PS00006 102->106 CK2_PHOSPHO_SITE PDOC00006
PS00006 113->117 CK2_PHOSPHO_SITE PDOC00006
PS00100 178->184 CAT PDOC00093
PS00100 204->210 CAT PDOC00093
{PDOC00001}
{PS00001; ASN_GLYCOSYLATION}
{BEGIN}
************************
* N-glycosylation site *
************************
It has been known for a long time [1] that potential N-glycosylation sites are
specific to the consensus sequence Asn-Xaa-Ser/Thr. It must be noted that the
presence of the consensus tripeptide is not sufficient to conclude that an
asparagine residue is glycosylated, due to the fact that the folding of the
protein plays an important role in the regulation of N-glycosylation [2]. A
recent study [3] has shown that the presence of a proline either between the
Asn and the Ser/Thr or C-terminal to the Ser/Thr will completely suppress
N-glycosylation.
-Consensus pattern: N-{P}-[ST]-{P}
[N is the glycosylation site]
-Last update: June 1988 / First entry.
[ 1] Marshall R.D.
Annu. Rev. Biochem. 41:673-702(1972).
[ 2] Pless D.D., Lennarz W.J.
Proc. Natl. Acad. Sci. U.S.A. 74:134-138(1977).
[ 3] Bause E.
Biochem. J. 209:331-336(1983).
{END}
{PDOC00005}
{PS00005; PKC_PHOSPHO_SITE}
{BEGIN}
*****************************************
* Protein kinase C phosphorylation site *
*****************************************
In vivo, protein kinase C exhibits a preference for the phosphorylation of
serine or threonine residues close to a C-terminal basic residue [1,2]. The
presence of additional basic residues at the N- or C-terminal of the target
amino acid enhances the Vmax and Km of the phosphorylation reaction.
-Consensus pattern: [ST]-x-[RK]
[S or T is the phosphorylation site]
-Last update: June 1988 / First entry.
[ 1] Woodget J.R., Gould K.L., Hunter T.
Eur. J. Biochem. 161:177-184(1986).
[ 2] Kishimoto A., Nishiyama K., Nakanishi H., Uratsuji Y., Nomura H.,
Takeyama Y., Nishizuka Y.
J. Biol. Chem. 260:12492-12499(1985).
{END}
{PDOC00006}
{PS00006; CK2_PHOSPHO_SITE}
{BEGIN}
*****************************************
* Casein kinase II phosphorylation site *
*****************************************
Casein kinase II (CK-2) is a protein serine/threonine kinase that has activity
independent of cyclic nucleotides and of calcium. This enzyme phosphorylates
many different proteins. The substrate specificity of this enzyme [1,2] can
be summarized as follows:
(1) Under comparable conditions Ser is favoured over Thr.
(2) An acidic residue (either Asp or Glu) must be present three residues
to the C-terminal of the phosphate acceptor site.
(3) Additional acidic residues in positions +1, +2, +4 and +5 increase the
phosphorylation rate. Most physiological substrates have at least one
acidic residue in these positions.
(4) Asp is preferred to Glu as the provider of acidic determinants.
(5) A basic residue to the N-terminal of the acceptor site decreases the
phosphorylation rate, while an acidic one will increase it.
-Consensus pattern: [ST]-x(2)-[DE]
[S or T is the phosphorylation site]
-Note: this pattern is found in all of the known physiological substrates
except in the high mobility group protein 14, where an alanine replaces the
acidic residue in position +3. However, the phosphorylation rate of this
substrate is very low.
-Last update: January 1989 / First entry.
[ 1] Marin O., Meggio F., Marchiori F., Borin G., Pinna L.A.
Eur. J. Biochem. 160:239-244(1986).
[ 2] Kuenzel E.A., Mulligan J.A., Sommercorn J., Krebs E.G.
J. Biol. Chem. 262:9136-9140(1987).
{END}
{PDOC00093}
{PS00100; CAT}
{BEGIN}
*************************************************
* Chloramphenicol acetyltransferase active site *
*************************************************
Chloramphenicol acetyltransferase (CAT) (EC 2.3.1.28) catalyzes the Acetyl-COA
dependent acetylation of the antibiotic chloramphenicol [1], an inhibitor of
prokaryotic peptidyltransferase activity. Acetylation of chloramphenicol by
CAT inactivates the antibiotic. An histidine residue plays a central role in
the catalytic mechanism of the enzyme. We use a conserved hexapeptide sequence
around the catalytic residue as a signature pattern for this type of enzyme.
-Consensus pattern: H-H-x-V-C-D
[The second H is the active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: January 1989 / First entry.
[ 1] Murray I.A., Hawkins A.R., Keyte J.W., Shaw W.V.
Biochem. J. 252:173-179(1988).
{END}
NOTICES
This code is covered by the Free Software Foundation's Gnu
Public License.
Frank Kolakowski
======================================================================
|lfk at athena.mit.edu || Lee F. Kolakowski |
|lfk at eastman2.mit.edu || M.I.T. |
|kolakowski at wccf.mit.edu || Dept of Chemistry |
|lfk at mbio.med.upenn.edu || Room 18-506 |
|lfk at hx.lcs.mit.edu || 77 Massachusetts Ave.|
|AT&T: 1-617-253-1866 || Cambridge, MA 02139 |
======================================================================
GOOD LUCK
Hassan