ProfileScan questions

Michael Gribskov gribskov at sdsc.edu
Thu Aug 13 13:20:12 EST 1992

The library control file (motifs.fil) used by PROFILESCAN is strange for 
historical reasons.  The development of PROFILESCAN occurred before we 
knew what we do now about the score distribution, especially about the 
systematic dependence of score on length.  Therefore the original version 
of PROFILESCAN used absolute scores (corresponding to orig scores 
reported by PROFILESEARCH) as cutoffs.  Because the distribution was 
clearly not normal (when unnormalized) no provision was made for sigma 
cutoffs.  After the current normalization procedure for length 
dependence was developed, PROFILESCAN was updated to take the 
normalization into account.  To do this, it needs to know the 
coefficients A, B, and C needed to normalize comparisons to the profile.
The cutoffs, however, are now in terms of the normalized scores (not the 
Z score).  I plan to convert this whole system to something more 
rational where you will simply enter /sigma=5.0, for example, on the 
command line to get all matches with Z>5.0, and not worry about the 
cutoffs, but unfortunately have not done it yet [same old story <:^(  ]

The instructions below cover both validation of the profile and 
installing the profile in the library.  The essential item is #4, but I 
recommend the rest of the steps to confirm the sensitivity and 
specificity of the profile.  Please feel free to e-mail or call if you have 

Michael Gribskov
San Diego Supercomputer Center
gribskov at sdsc.edu
(619) 534-8312

Instructions for installing validated profiles in PROFILESCAN library file

1) Generate a profile (PROFILEMAKE) and perform a database search 
(PROFILESEARCH).  Examine the results and confirm that all sequences that 
should have the motif in question have standard scores (Z scores) above 

2) Align all of the top scoring sequences with the profile (PROFILEGAP)
and confirm that the alignment is the same as the alignment used to
generate the profile (or at least acceptable).  If some of the top
scoring sequences appear highly unlikely to contain the motif of
interest, try to determine from the alignment what feature of the
profile is causing the artifactually high score and adjust (edit) the
profile to exclude this sequence more effectively.  Changes in the
default gap opening and extension parameters may also be necessary to
optimally separate related and unrelated sequences. 

3) Return to 1) and repeat the generation of the profile and database 
search until all related sequences are detected with standard scores 
above 6.0, and no unrelated sequences have scores above 6.0.  Such a 
profile is both sensitive and specific and thus may be considered 

4) Enter the statistics from the normalization section of the output from 
PROFILESEARCH into the profile library control file (motifs.fil).  The 
gap and length penalties used for the searches, the three parameters A, 
B, and C used in the normalization, and the mean and standard deviation 
of the normalized score distribution are recorded in the library control 
file.  In addition, the "high" and "interesting" threshold values are 
specified.  These values are recorded in terms of the normalized score 
(before conversion to standard scores) and can be calculated by 

	threshold = sigma_cutoff * sigma + ave_score

where sigma_cutoff is the desired Z score cutoff value and sigma and 
ave_score are the standard deviation and mean of the normalized score 
distribution (reported in the top section of the output from 
PROFILESEARCH).  The thresholds are generally set such that the high 
threshold includes all of the known related sequences, and none of the 
unrelated sequences.  Typically, the "high" threshold corresponds to a 
standard score of about 6 and the "interesting" threshold to a standard 
score of 4.5 to 5.

5) Note to users of GCG versions:  The calculation of the A,B, and C 
parameters requires at least one database search.  For a quick test, you 
can shortcut this by running PROFILESEARCH interactively and aborting 
the search after about 1000 sequences.  This is generally enough for 
reasonable normalization.  It seems to work best using the PIR database, 
and will obviously fail if the superfamily represented by the profile is 
heavily overrepresented in the first 1000 sequences.  for PIR, I worry mainly 
about cytochromes.  This approach doesn't seem to work as well for 
SwissProt, probably due to some peculiarity of the sequence order.  In 
any case, be sure to check the correlation coefficient for the curve fit 
as this is the simplest indication that there is a sampling problem (
preferably r>0.9, r>0.85 is acceptable for quick test).  GCG will allow 
other more sophisticated approaches using FOSNs.

Good Luck

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net