The library control file (motifs.fil) used by PROFILESCAN is strange for
historical reasons. The development of PROFILESCAN occurred before we
knew what we do now about the score distribution, especially about the
systematic dependence of score on length. Therefore the original version
of PROFILESCAN used absolute scores (corresponding to orig scores
reported by PROFILESEARCH) as cutoffs. Because the distribution was
clearly not normal (when unnormalized) no provision was made for sigma
cutoffs. After the current normalization procedure for length
dependence was developed, PROFILESCAN was updated to take the
normalization into account. To do this, it needs to know the
coefficients A, B, and C needed to normalize comparisons to the profile.
The cutoffs, however, are now in terms of the normalized scores (not the
Z score). I plan to convert this whole system to something more
rational where you will simply enter /sigma=5.0, for example, on the
command line to get all matches with Z>5.0, and not worry about the
cutoffs, but unfortunately have not done it yet [same old story <:^( ]
The instructions below cover both validation of the profile and
installing the profile in the library. The essential item is #4, but I
recommend the rest of the steps to confirm the sensitivity and
specificity of the profile. Please feel free to e-mail or call if you have
questions.
Michael Gribskov
San Diego Supercomputer Center
gribskov at sdsc.edu
(619) 534-8312
--------------------------------------------------------------------------------
Instructions for installing validated profiles in PROFILESCAN library file
1) Generate a profile (PROFILEMAKE) and perform a database search
(PROFILESEARCH). Examine the results and confirm that all sequences that
should have the motif in question have standard scores (Z scores) above
6.0.
2) Align all of the top scoring sequences with the profile (PROFILEGAP)
and confirm that the alignment is the same as the alignment used to
generate the profile (or at least acceptable). If some of the top
scoring sequences appear highly unlikely to contain the motif of
interest, try to determine from the alignment what feature of the
profile is causing the artifactually high score and adjust (edit) the
profile to exclude this sequence more effectively. Changes in the
default gap opening and extension parameters may also be necessary to
optimally separate related and unrelated sequences.
3) Return to 1) and repeat the generation of the profile and database
search until all related sequences are detected with standard scores
above 6.0, and no unrelated sequences have scores above 6.0. Such a
profile is both sensitive and specific and thus may be considered
validated.
4) Enter the statistics from the normalization section of the output from
PROFILESEARCH into the profile library control file (motifs.fil). The
gap and length penalties used for the searches, the three parameters A,
B, and C used in the normalization, and the mean and standard deviation
of the normalized score distribution are recorded in the library control
file. In addition, the "high" and "interesting" threshold values are
specified. These values are recorded in terms of the normalized score
(before conversion to standard scores) and can be calculated by
threshold = sigma_cutoff * sigma + ave_score
where sigma_cutoff is the desired Z score cutoff value and sigma and
ave_score are the standard deviation and mean of the normalized score
distribution (reported in the top section of the output from
PROFILESEARCH). The thresholds are generally set such that the high
threshold includes all of the known related sequences, and none of the
unrelated sequences. Typically, the "high" threshold corresponds to a
standard score of about 6 and the "interesting" threshold to a standard
score of 4.5 to 5.
5) Note to users of GCG versions: The calculation of the A,B, and C
parameters requires at least one database search. For a quick test, you
can shortcut this by running PROFILESEARCH interactively and aborting
the search after about 1000 sequences. This is generally enough for
reasonable normalization. It seems to work best using the PIR database,
and will obviously fail if the superfamily represented by the profile is
heavily overrepresented in the first 1000 sequences. for PIR, I worry mainly
about cytochromes. This approach doesn't seem to work as well for
SwissProt, probably due to some peculiarity of the sequence order. In
any case, be sure to check the correlation coefficient for the curve fit
as this is the simplest indication that there is a sampling problem (
preferably r>0.9, r>0.85 is acceptable for quick test). GCG will allow
other more sophisticated approaches using FOSNs.
Good Luck