Elias Lolis wrote:
>> I am interested in determining the frequency of proline (relative to
> other amino acids) to be present at the amino terminus. Does anyone
> know of a program that can do this or whether such a list has been
> compiled? It will be important for the program to distinguish between
> entire polypeptides and mature, processed forms of proteins. Thanks.
>May I suggest something?
First of all, let's extract mature processed chains from the SwissProt
databank. If I'm not mistaken this should be referenced in the FT field
of Swissprot as CHAIN. So it will be possible to extract these chains
with SRS as follows :
getz -fosn -pos -l swissprot '[Sequence-Features:CHAIN]' >
chains.list
Now, you should have a File Of Sequence Names (FOSN) chains.list for all
CHAIN features in SwissProt.
Then you can use a modified version of FindPatterns from the GCG
package.
It has to be modified in order to take into account the Begin: and End:
specifications from the chains.list file. This modification is very easy
to perform, you only have to add the following at line 370 of a copy of
the original findpatterns.f program (I called it efindpatterns.f) :
Call SQMove (Sq,Sq.Begin,Sq.End)
and to compile the new version.
You now have a version of FindPatterns which is able to search the
chains
in chains.list for a Proline in the first position :
efindpatterns -PAT='<P' @chains.list -nomonitor -out=NPro.find
The file NPro.find now contains all occurences of a Proline at the
N-terminus of chains found in Swissprot. FYI, I found 438 such
occurences
in the 13,740 chains in Swissprot. Of course, we are missing thus quite
a lot of sequences which do not have the CHAIN keyword in their FT
field.
A search on the whole of SwissProt (59159 sequences) yields 661 Prolines
at the N-terminus, 28 of these being sequences already examined as
chains.
As there are 11328 sequences in Swissprot with CHAIN in the FT field
then
we examined 59159 + 13470 - 11328 = 61301 different sequences yielding
661 + 438 - 28 = 1071 N-terminal Prolines.
Does this sound right to you? Or did I overlooked something?
This assumes, of course, that the absence of CHAIN in the FT field is
not
due to a lack of information nor misannotation.
Regards,
Fred
Frédéric PLEWNIAK
Bioinformatics
I.G.B.M.C.
Strasbourg - France