Tim Cutts (tjrc1 at mole.bio.cam.ac.uk) wrote:
>I have had a number of queries from users on my system about what
>appears (to them) to be major differences between the output from pileup
>in GCG 8 and GCG 9.
>>This seems to be due to the new default scoring matrix; does anyone
>know what the rationale was behind this change, and why does it
>produce such different answers to the previous GCG version? This
>seems to have confused a lot of users. Of course I can tell them to
>use -matrix=oldpep.cmp if they want the same results as GCG 8, but how
>should they determine which is the appropriate scoring matrix to use?
Dramatic changes have been made to the scoring matrices used
by GCG program starting with version 9.0 of the Wisconsin Package.
1) Format change leads to rescaled matrices
In version 8.1, and before, each program in GCG had a corresponding
scoring matrix. These matrices were filled with floating point (real
number) values, and the protein matrices were based on the PAM250
matrix.
Starting with version 9.0 of the package, the matrices are converted
to integer values and the values are ten times greater. Thus a value
of 1.0 in swgapdna.cmp in previous software versions has been
converted and rescaled to a value of 10 in swgapdna.cmp in version
9.0. This was done in order to make the matrices provided by the
Wisconsin Package more similar to scoring matrices provided by others.
The change in magnitude of the values in the scoring matrices,
however, necessitated a change in the magnitude of the default gap
penalties. These changes are documented in the Version 9.0 User
Release
Notes section that concerns package-wide enhancements: New Scoring
Matrices. You can view this on-line with the GCG command:
genhelp whats_new_90
This format and rescaling change is particularly apparent for
nucleotide
matrices, where the matrices are unchanged beyond the reformatting and
rescaling.
(NOTE: A copy of the "old" protein matrix, rescaled and in
the new format, is available, but we do not recommend it's use.
It is no longer the default matrix)
2) Blosum 62 is the new protein matrix default for most programs
Starting with version 9.0, the default matrix used for most programs
has changed. For protein alignments, most programs now use BloSum62
as the default scoring matrix (FastA uses BloSum50).
Even if we hadn't made a change in the format of the matrices
we would still have changed the default protein scoring matrix.
The Blosum62 matrix (now used by all GCG programs except Fasta)
is the matrix most accepted in the scientific literature, and has
long been the default matrix used by the BLAST program. For more
information on this matrix I highly recommend the paper
Henikoff, S. and Henikoff, J. G. (1992).
Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA 89: 10915-10919.
It is not surprising that your results are different when using
the new Blosum 62 scoring matrix. We believe that the results
with the new matrices are more valid scientifically. You might
also want to experiment with the gap creation and extension penalties,
since the ideal ones to use can be different for each alignment.
Regardless of the matrix and penalties used, it is always a good
idea to visually inspect the alignment to make sure that it makes
sense to you.
Regards,
Lynn Miller
---------------------------------------------------------------------------
Lynn Miller || phone: (608) 231-5200
Technical Support Coordinator || fax: (608) 231-5202
Genetics Computer Group, Inc. || e-mail: help at gcg.com
575 Science Drive || e-mail:
Lynn.Miller at gcg.com
Madison, WI 53711-1060 USA || e-mail: miller at gcg.com
|| WWW: http://www.gcg.com
---------------------------------------------------------------------------