Reformatting of databases for BLAST

Warren Gish gish at host.nlm.nih.gov
Fri Oct 11 12:06:21 EST 1991

Subject: Re: Reformatting of databases for BLAST

In article <9110041530.AA01811 at genbank.bio.net> MACRIDES at WFEB2.BITNET
(Foteos Macrides) writes:
... stuff deleted ...
>I suppose what is embedded in
>this question is another about the appropriate uses and "abuses" of FastA's
>init scores versus blast's probability values w.r.t. the sizes of the
>databases searched.

With respect to BLAST...

When the probabilities reported by the BLAST programs must be compared
between searches against different databases, the database size can be
normalized across all searches using the programs' Z parameter.  If the
searches to be compared have already been performed, the reported
Expect values can be normalized in direct proportion to the ratio of the
databases sizes--the larger the database, the larger the Expect value.

To assess the significance of alignments in a database size-independent
manner, scores can be converted to bits, a measure of the informativeness
which is less dependent on the specific scoring system used (Karlin and
Altschul, 1990; Altschul, 1991).  It remains important to the statistics
which PAM matrix (e.g., PAM-120 or PAM-250) was used in a search,
but the scale of the matrix becomes virtually irrelevant*.

In its output, the BLAST programs report the factor, Lambda, needed to
do the conversions from scores to bits.

    bits = score * Lambda / ln(2)

where ln(2) is the natural logarithm of 2.

Whatever the database size or scoring system employed, the Expect value
is approximated by the relation:

    Expect ~  KMN/(2**b)

where M and N are the lengths of the database and query sequence, b
is the number of bits associated with the alignment score, and K is
one of the parameters described by Karlin and Altschul (1990) and
reported by the BLAST programs.

Version 1.2 of the BLAST programs report both alignment scores and bits
and are posted for anonymous ftp on ncbi.nlm.nih.gov in /pub/blast
(with support routines in /pub/ncbi, /pub/gish, and /pub/dfa).

Warren Gish
National Center for Biotechnology Information / National Library of Medicine

*The informativeness reported for alignments is described as being
"virtually" independent of the scale of the PAM matrix, because some
precision is lost when real-valued, log-odds PAM scores are rounded
to nearest integers for subsequent use by the BLAST programs.

Karlin and Altschul (1990).  Methods for assessing the statistical
significance of molecular sequence features by using general scoring
schemes.  Proc. Natl. Acad. Sci. USA 87:2264-2268.

Altschul (1991).  Amino acid substitution matrices from an information
theoretic perspective.  J. Mol. Biol. 219:555-565.

  Warren Gish                           phone:  (301) 496-2475, ext. 64
  Staff Fellow                          FAX:  (301) 480-9241
  National Center                       Internet:  gish at ncbi.nlm.nih.gov
     for Biotechnology Information

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net