Hi Netters,
I have received many requests for our paper
"A comparison of seven protein database search programs",
so I thought it better to post it here on Bio-soft,
as it is short one. Thanks to BINARY and Bio-line for allowing
the paper to be posted before publication. Please don't
send any more reprint requests.
Why seven?, well those were the ones we could hook
up to the same database, so e-mail servers couldn't be
included.
In presenting our paper comparing database search programs
I invite people to carry out performance comparisons
in other areas of bioinformatics, as this kind of work is
a bit thin on the ground. I know most academic programmers
tend not to be too keen on comparison work and I understand
the reasons for this (we plan to discuss this as part
of a future document). However,
(1) biologists aren't doing their best research if they
aren't using the most efficient and effective applications
from bioinformatics. So it is important to do and publish
these tests.
(2) Comparison should also promote improvement in
applications, as well as support the core pure science
research in bioinformatics.
By the way, our next discussion document should be out soon,
about improving the impact of bioinformatics in biology.
Duncan Rouch
School of Biological Sciences, University of Birmingham, UK
------------------------------------------------------------
A comparison of seven protein database search programs*
-------------------------------------------------------
BINARY (1994) 6:17-18.
*This version is as for BINARY, but you will have to see the journal
to see the figure, however I've put table in to stand in for it.
See the Appendix for information on obtaining BINARY or the
Bioline version.
Duncan A. Rouch, Nigel L. Brown and Alan J. Bleasby1
School of Biological Sciences, The University of Birmingham,
Birmingham B15 2TT, U.K.
Electronic mail: D.A.Rouch at uk.ac.bham, N.L.Brown at uk.ac.bham
1 SEQNET, SERC Daresbury Laboratory, Daresbury, Warrington WA4
4AD, U.K.
Electronic mail: A.Bleasby at uk.ac.daresbury
Address for correspondence:
Dr D.A. Rouch
School of Biological Sciences
University of Birmingham
Edgbaston
Birmingham B15 2TT
UK
Telephone: (021) 414 6551
FAX: (021) 414 6557
When a contiguous gene reading frame of unknown function
is identified in a nucleotide sequence, the next step is
usually to search for proteins homologous to the translation
product. We have attempted to determine which of a range of
programs are best suited to such an initial general
comparison of a protein sequence against a database. Seven
programs were used; Wordsearch (1), FASTA (2), GBLASTA (3),
BLASTP (3), BLAST3 (4), SWEEP (5) and PROWL (J.K. Crook and
J.F. Collins, unpublished). All programs were configured to
search the PIR23 protein database (all sections, 6), and
were executed with default parameters as far as possible in
order to most closely approximate the way these programs are
used in practice by most molecular biologists. The query
sequence used was Human b-globin (PIR23, entry HBHU). The
sequence was used both complete and as contiguous
derivatives of a third of the total length. A globin was
chosen as the probe due to both the recurrence of globins in
the database and the range in the degree of pairwise
similarity amongst these. Furthermore, the identities of
homologous sequences within PIR23 can be established
independently, by scanning the list of names; there are 499
globin family sequences in PIR23 from a total of 14,372
sequences.
In order to compare the results from different algorithms
a new, program-independent, evaluation method was required
since the scoring systems of most of the programs are
unique. The ability of the programs to detect homologous
globin sequences was measured as follows. Result lists from
database searches, ordered by score, were scanned downwards
with a window of 10 sequences until the number of globins in
the window fell to 5. The number of globins in the result
list above and including the last globin in the window was
then determined. Finally, the globin count was converted to
a percentage of the total number of globins in the database.
This method might have given biassed results if different
programs embedded non-homologous (non-globin) sequences in
different ways, in regions where there was a high density of
homologous sequences. However, empirical tests indicated
this effect to be negligible. The method thus allowed an
objective evaluation of how well each program can detect
homologous sequences.
Using complete human b-globin as the query sequence, the
programs showed a range of efficiency in detecting other
globins, Table 1. The top three programs, using this method,
show similar globin recoveries, these were PROWL(90.8%),
SWEEP(90.0%) and BLAST3(90.6%). The other programs gave
scores between 73.5% and 87.8%. When the shortened b-globin
sequences were used as probes there was a drop of
approximately 20% in globin detection for most programs.
This is consistent with the length dependence of the scoring
techniques used by the programs. The top three programs
were the same as with the first test (PROWL 72.5%, SWEEP
70.3%, BLAST3 72.5%). Of these three programs, BLAST3 has
the limitation that there must be at least two homologous
sequences in the database for homology to be found, as it
depends on 3-way alignments. The drop in globin detection
for shorter sequences was most pronounced for Wordsearch, a
program from the UWGCG package (1). In summary, this
method suggests that of the programs tested, for general
protein database searching, PROWL (Prosrch), SWEEP and
BLAST3 are the best programs to choose.
Table 1. Performance of database search programs in globin detection.
_____________________________________________________________
Programs detection of detection of
b-globin (%) shortened b-globin %
-------------------------------------------------------------
PROWL 90.8 72.5
SWEEP 90.0 70.3
BLAST3 90.6 72.5
BLASTP 87.8 63.5
GBLASTA 86.6 65.3
FASTA 76.0 60.1
WORDSEARCH 73.6 30.9
_____________________________________________________________
Table 1, Performance of database search programs in
globin detection. Programs evaluated were PROWL 0.1
(PR), SWEEP 1.0 (SW), BLAST3 * (BL3), BLASTP *
(BLP), GBLASTA * (GBL), FASTA 1.0 (FA) and
Wordsearch 7.0 (WO): *, version as at 9/1992.
Although not yet distributed, PROWL is equivalent to
Prosrch (7), which is accesible on the SEQNET node at
Edinburgh, U.K.. Percentage detection of globin
family sequences in PIR23 is shown for query
sequences, human b-globin (light hatching) and
shortened b-globin derivatives (heavy hatching): in
the latter case each third of the globin sequence was
queried independently, and the three results
averaged. All programs were run to give pairwise
alignments with default parameters, except to give
extended result lists (BLAST-type programs were
executed with S=35, R=1.0, and L=105, where
applicable).
ACKNOWLEDGEMENTS
----------------
We thank James Crook (for making PROWL available) and
Academic Computing Service staff at Birmingham. This work
was supported by the Science and Engineering Research
Council (CCP11) and Medical Research Council (Grant
G.9025236CB to N.L.B.).
References
----------
1. Devereux, J., Haeberli, P., and Smithies, O. (1984) A
comprehensive set of programs for the VAX. Nucl. Acids.
Res. 12, 387-395.
2. Pearson, W.R., and Lipman, D.J. (1988) Improved tools
for biological sequence comparison. Proc. Natl. Acad.
Sci. USA 85, 2444-2448.
3. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and
Lipman, D.J. (1990) Basic alignment search tool. J.
Mol. Biol. 215, 403-410.
4. Altschul, S.F., and Lipman, D.J. (1990) Protein
database searches for multiple alignments. Proc. Natl.
Acad. Sci. USA 87, 5509-5513.
5. Akrigg, D., Bleasby, A.J., Dix, N.I.M., Findlay,
J.B.C., North, A.C.T., Parry-Smith, D., Wooton, J.C.,
Blundell, T.L., Gardner, S.P., Hayes, F., Islam, S.,
Sternberg, M.J.E., Thornton, J.M., Tickle, I.J., and
Murray-Rust, P. (1988) A protein sequence/structure
database. Nature 335, 745-746.
6. George, D.G., Barker, W.C., and Hunt, L.T. (1986) The
protein identification resource (PIR). Nucl. Acids Res.
14, 11-15.
7. Coulson, A.F.W., Collins, J.F., and Lyall, A. (1987)
Protein and nucleic-acid sequence database searching - a
suitable case for parallel processing. Computer J. 30,
420-424.
Appendix: BINARY and Bioline information
-----------------------------------------
Binary is an international journal which publishes a broad range
of articles related to all aspects of computing as applied to
microbiology.
SUBSCRIPTION INFORMATION: 6 Issues per annum. Submissions and
subscription information from the editorial office at the
School of Pure & Applied Biology
University of Wales
College of Cardiff
PO Box 915
Cardiff CF1 3TL, UK
Tel: 0222 874000 x 5743/4974;
fax: 0222 874305;
email: sabjbe at uk.ac.cardiff.thor
BINARY- Computing in microbiology, whose contents list appears
regularly in the BIO-JRNL newsgroup, is now available in an
electronic format, downloadable from the Base de Dados
Tropical (BDT), Brazil.
Abstracts and summaries of papers in BINARY are all available free
of charge. The system is easy to use since it is available
through the increasingly familiar gopher system on the Internet.
Instructions or use are provided from option "Instructions for using
Bioline Publications" on the main menu.
For more information, please email to
BIO at BIOSTRAT.DEMON.CO.UK
or mail/fax to:
Bioline Publications
Stainfield House
Stainfield
Bourne
Lincs PE10 0RS, UK
Fax: +44 778 570175
Tel: +44 778 570618