frequencies of amino acids at N-terminus

Frederic PLEWNIAK plewniak at igbmc.u-strasbg.fr
Fri Nov 22 11:00:41 EST 1996

Elias Lolis wrote:
> I am interested in determining the frequency of proline (relative to
> other amino acids) to be present at the amino terminus.  Does anyone
> know of a program that can do this or whether such a list has been
> compiled?  It will be important for the program to distinguish between
> entire polypeptides and mature, processed forms of proteins.  Thanks.
May I suggest something?
First of all, let's extract mature processed chains from the SwissProt
databank. If I'm not mistaken this should be referenced in the FT field 
of Swissprot as CHAIN. So it will be possible to extract these chains 
with SRS as follows :

   getz -fosn -pos -l swissprot '[Sequence-Features:CHAIN]' >

Now, you should have a File Of Sequence Names (FOSN) chains.list for all
CHAIN features in SwissProt.

Then you can use a modified version of FindPatterns from the GCG
It has to be modified in order to take into account the Begin: and End:
specifications from the chains.list file. This modification is very easy
to perform, you only have to add the following at line 370 of a copy of
the original findpatterns.f program (I called it efindpatterns.f) :

	Call SQMove (Sq,Sq.Begin,Sq.End)

and to compile the new version.

You now have a version of FindPatterns which is able to search the
in chains.list for a Proline in the first position :

    efindpatterns -PAT='<P' @chains.list -nomonitor -out=NPro.find

The file NPro.find now contains all occurences of a Proline at the
N-terminus of chains found in Swissprot. FYI, I found 438 such
in the 13,740 chains in Swissprot. Of course, we are missing thus quite
a lot of sequences which do not have the CHAIN keyword in their FT
A search on the whole of SwissProt (59159 sequences) yields 661 Prolines
at the N-terminus, 28 of these being sequences already examined as
As there are 11328 sequences in Swissprot with CHAIN in the FT field
we examined 59159 + 13470 - 11328 = 61301 different sequences yielding
661 + 438 - 28 = 1071 N-terminal Prolines.
Does this sound right to you? Or did I overlooked something?
This assumes, of course, that the absence of CHAIN in the FT field is
due to a lack of information nor misannotation. 

Strasbourg - France

More information about the Info-gcg mailing list

Send comments to us at biosci-help [At] net.bio.net