IUBio

*Release* PairWise and SearchWise 1.0 (long)

ewan birney birney at molbiol.ox.ac.uk
Wed Feb 1 18:39:43 EST 1995


           PairWise and SearchWise 1.0
           ***************************

	SearchWise and PairWise are part of a new sequence analysis 
package for comparisons of protein sequences or protein profiles against 
DNA sequences which is ROBUST towards errors in the DNA sequence,
in particular frame shifting errors. There are two main programs
in this package. A one-to-one alignment program called PairWise and
a database searching program called SearchWise.

	This information can be read on the WWW at the URL

http://www.molbiol.ox.ac.uk/www/users/birney/wise/description.html

Algorithm

	The core of both programs is a simple extension of the
dynamic matrix routines used in sequence analysis in which a protein
profile is compared to all three frames of a DNA sequence simultaneously
allowing for frame shifting errors. The reverse frames are done in a similar
manner.

	A protein profile can be trivially derived from a protein sequence
and a comparison matrix, and both programs allow on-the-fly loading
of a single protein sequence for the query.

	Although the algorithm is effectively the equivalent of a 
six frame translation of the DNA sequence, there is neither explicit 
translation nor complementation of the sequence, making it a very
efficient method.

	There are some subtle differences in the coding of the 
dynamic matrix which departs from more "standard" smith+waterman
coding. Email me for more info.

	For completeness, standard profile to protein sequence/database
routines have also been included in the programs.

Uses

	The fact that these programs are robust towards frame shifting
errors make them ideal for many different aspects of sequence analysis.

	o Comparing a determined protein sequence or better still, a
protein profile against a DNA database and aligning those sequences
without worrying about errors in the DNA database. I had originally
envisaged the main benefit of this being in searches against EST
databases, but trials have shown that there are a depressing number
of sequencing errors in the coding sequences of full cDNA sequences too.
This program allows confident searching of DNA databases.

	o Comparing protein sequence or protein profile against DNA
sequences (or a DNA database) allowing the sequence to put arbitary
length gaps into the aligned DNA. This allows a comparison to "jump"
introns, and therefore one can search a database containing genomic
DNA without knowing any exon/intron boundaries, and still reasonably
expect an alignment spread across three or four exons to be returned
in the high scores. Note that this is an entirely different strategy 
from trying to predict exon/intron boundaries.

	o Comparing a recently determined DNA sequence against a protein
database, and not having to be confident in the precise DNA sequence
(especially frame shifting errors). By using programs such as
PairWise and SearchWise one might hope that people will realise their
sequencing errors by noticing homologies which require a frame shift
in their sequenced DNA.

	Generally we  have found it very useful to be able to use
"dirty" DNA sequence data (either in the databases or as sequences) 
without having to be always concerned about frame shifts.

	Some examples of PairWise output are given at the end of this
message.


Programs
	
	PairWise is a one-on-one alignment program, like GCG's BESTFIT
or GAP.

	SearchWise is the database comparison program, like FASTA
or PROFILESEARCH.

	PairWise has a number of features which makes it easy to use:

	o A menu driven user interface (for simple linefeed terminals)
	o PairWise can be linked (at compile time) to GCG8, so that
sequences can be pulled straight out of GCG8 databases if needed. PairWise can 
also run as a stand-alone program.
	o There are up to five different parameters to be set for each
alignment, and the numbers are very different from usual alignment programs.
PairWise has a variety of 'rule-of-thumb' default parameters for 
different sequence comparisons to make it easier to find the best parameters.
	o PairWise also includes standard protein seq<->protein seq or
protein profile<->protein seq routines.
	o Quite nice outputs.

	SearchWise is run entirely from the command line, but there
is a menu-driven program (Opensearch) to submit searchwise batch jobs in 
VMS or nice nohup background processes for UNIX machines.

	o SearchWise can use the following database formats:
	
	GCG-binary
	GCG-ascii
	Fasta (Pearson) format
	EMBL .dat formats

	These formats can be mixed and matched to any extent, as long
as the database remains either all protein or all dna.

	Opensearch is the menu-driven program which comes included
in the package to submit searchwise jobs.

Bad Features.

	Technically the difficulties are mainly finding the correct 
parameters for a particular alignment, which especially for new users
are hard to get used to.

	Currently the on-the-fly alignments in SearchWise are at best FRAGILE. 
They shouldn't be used at the moment.

	SearchWise is CPU-hungry and the full comparison of a 100 residue
profile vs the full EMBL database on a small but unused DEC Alpha takes
about a day and a half. As most people (blindly) just chuck in their
favourite 450aa protein against the EMBL database it usually clogs the
machines up. I am looking at fine-grain and coarse-grain parallelisation
methods at the moment.

	For most sites it is probably only feasible to run the embl 
EST and protein databases on SearchWise. SearchWise should be used
as a final stage of a search strategy which would start off by using faster
methods such as BLAST, BLITZ or FASTA. The full benefit of using SearchWise is
only seen when good protein profiles are produced.

	In terms of the algorithm, it is probably better to use a full hidden 
markov model for the protein to DNA sequence comparison as then exon/intron 
boundary information could be incorporated. 


Compiling and Platform specifics.

	PairWise and SearchWise has been compiled on the following platforms.

	o DECAlpha running OpenVMS
	o DECAlpha running OSF/1
	o SGI running IRIX 5.2
	o Sun4 running solaris2.3

Executables (unliked to GCG8) are provided for these platforms.

There are no MS-windows/Mac versions currently made.

The package can found by anonymous ftp at sable.ox.ac.uk
in the directory /pub/users/ba97001/wise/

Look at the file README...

It is probably better to look first at the WWW site at

http://www.molbiol.ox.ac.uk/www/users/birney/wise/topwise.html


Thanks

	SearchWise and PairWise have benefitted hugely from the
beta-testing by other people over the last couple of months. The
mechanics of getting easy to use, platform independant code are
horrendous, and I certainly cannot claim that these programs are in
a finished state. However I do think that they are useful for a 
broad range of applications.

	In particular I would like that thank Kay Hoffmann at 
ISREC in Switzerland and Toby Gibson at EMBL for the massive amount
of help (and patience) they both provided over the beta testing period.

	This project was started at Cold Spring Harbor (under Adrian
Krainer), went to EMBL (under Toby Gibson) where most of the coding
took place. Julie Thompson at EMBL wrote many of the UNIX specific and 
config routines. 

	At Oxford Liz Cowe has been principle in getting my programs
to behave well with GCG8, and Jasper Rees has been my most picky
beta tester as well as giving me time and space on the Oxford computers.

	Publications

Currently two publications feature SearchWise results.

Gibson, T. J., Hyvoenen, M., Birney E., Musacchio, A. and Saraste, M. (1994) 
   PH domain: the first anniversary. Trends Biochem. Sci., 19, 349-353.

Aasland, R., Gibson, T. J. and Stewart, A. F. (1995) The PHD-finger: 
   implications for chromatin-mediated transcriptional regulation.
   Trends Biochem. Sci., February issue.


Any others would be gratefully recieved 

ewan birney

birney at molbiol.ox.ac.uk
	
---------------------------------------------------------------------------
 *Error in cDNA example*

PairWise 1.0 Output. Written by Ewan Birney (birney at molbiol.ox.ac.uk).
Profile file:	rrmcut.p62p
Sequence file	em:mm14648

Sat Jan 28 09:37:39 1995

Gap 2200, Gap extension 200

Frame penalty 1870, Frame extension 1320, Stop codon 500
Alignment of
Profile rrmcut.p62p vs
Sequence mm14648 in file From GCG database

Score 33572
Aligned Ranges:
1-99 (profile)
214-464 (sequence)

Showing forward strand

                    * * **     *    **  *   *    *                   
rrmcut.p62      1  KLFVGNLgnPPDTTEEELRELFSQFGEIESVKvmrdesqvrddhlnrqap
Translate          SLKVDNL..TYRTSPDTLRRVFEKYGRVGDVYIPRD.............P
EmRod:mm14    214  tcaggac  atcatcgacacgtgatgcggggtaccg             c
                   ctataat  cagcccactggttaaaggtgatatcga             c
                   ccggccg  ccccgcccggcccgacgccccgctggc             g


                          *  ** *     ** *  *  * *  ** ***  *        
rrmcut.p62     51  ktgksrGF^^FVTFESEEDAEKAIEALNGKVIGGRVLRVKAAQKkeerqk
Translate          YTKESRIF^^FVRFHDKPHAEDAMDAMDGAVLDGRELRVQLARYGRPPDS
EmRod:mm14    319  taagtcatGCtgctcgaccggggaggaggggcggcgccgccgctgcccgt
                   acaacgtt  ttgtaaacacaactactagcttaggatgtatcgaggccac
                   ccggcgcc  ccgcccggccaccgccgcgggccccgggggggccccgccg


----------------------------------------------------------------------------

   * Genomic Example *
PairWise 1.0 Output. Written by Ewan Birney (birney at molbiol.ox.ac.uk).
Profile file:	hnrnp.prf
Sequence file	hnrnpa1.gen

Wed Feb  1 23:33:15 1995

Gap 2400, Gap extension 400

Frame penalty 1800, Frame extension 0, Stop codon 500
Alignment of
Profile hnrnp.prf vs
Sequence X12671 in file hnrnpa1.gen

Score 128374
Aligned Ranges:
1-179 (profile)
1383-2455 (sequence)

Showing forward strand

                    ************* * ***  *  ******   * **            
hnrnp.prf       1  TEPEQLRKLFIGGLDFRTTDDGLKAHFEQWGNIVDVVV^+++++++++++
Translate          KEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVV^-----------
X12671       1383  agcgccaactaggtatgaaggacaactgctgacagtggAGATTTGGAAGG
                   aacaatgatttggtgtaccaagtggataaggctcagtt            
                   agcaggggcctaggctaattgcggcttgagagcgctga            


                                                                     
hnrnp.prf      39  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1509  GACAAAGCAGTAAAACAGCCGATTTCCTTGGCTTATCTTGGTGCAGTCTT
                                                                     
                                                                     


                                                                     
hnrnp.prf      39  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1559  CTCCGAATGCTTATGAAAGTAGTTAATAGCATTATAGTTAGAGCTTTGTT
                                                                     
                                                                     


                                                                     
hnrnp.prf      39  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1609  GGCAAAGGAACGTCCTGCTTTGATTTTAAAAGCTAACCTCTTAAATCTAA
                                                                     
                                                                     


                                                                     
hnrnp.prf      39  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1659  GGGTAGTGGGAAACTGGACGAACTTTTTATAAAAGGCTGGTGTAAAGTTT
                                                                     
                                                                     


                                                                     
hnrnp.prf      39  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1709  CCTATTGCCCTATTCAAAGTTAAAATAACAAAAGCTTTTGCGGTCAGACT
                                                                     
                                                                     


                                                        ** ********* 
hnrnp.prf      39  ++++++++++++++++++++++++++++++++++++KDPKTKRSRGFGFI
Translate          ------------------------------------RDPNTKRSRGFGFV
X12671       1759  TTGTGTTACATAAATTAACACTGTTCTCAGGTAATGagcaaactagtgtg
                                                       gacacagcggtgtt
                                                       ataccgctgctgtc


                   **     ** * ****** *** *******                    
hnrnp.prf      53  TYSQSYMVDNAQNARPHKIDGRTVEPKRAV^+++++++++++++++++++
Translate          TYATVEEVDAAMNARPHKVDGRVVEPKRAV^-------------------
X12671       1837  atgagggggggaagaccagggagggcaaggTCCAGAGAAGTGAGTGGGTT
                   cacctaatacctacgcaataggttacagct                    
                   atctggggtatgtagacggtaatgaagatc                    


                                                                     
hnrnp.prf      83  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1947  TTTTTTCTTCTTCTTCTTAAACTTACTTGGATATGTGCTGCTATGAACTT
                                                                     
                                                                     


                                                                     
hnrnp.prf      83  ++++++++++++++++++++++++++++++++++++++++++++++++++
Translate          --------------------------------------------------
X12671       1997  AAGATTCGGGAGTTTTCTAAACTTACCAAAATTTTTTATTCGAGTATAGG
                                                                     
                                                                     


                                                * *  **** **** * *  *
hnrnp.prf      83  +++++++++++++++++++RQEI....DSPEAGATVKKLFVGGLKDDHDE
Translate          -------------------WFFSY*DSQRPGAHLTVKKIFVGGIKEDTEE
X12671       2047  CTTTGCTAATCTAAACCTAttttttgtcacggctagaaatgggaaggagg
                                      gttcaaacagcgcatctaatttggtaaacaa
                                      gtcctgttaaatccatgagatttctaactaa


                     ** ** * * *   ** **   ********* *** * ****      
hnrnp.prf     110  ECLREYFKQFGQIVSVEIVTDKDTGKKRGFAFVEFDDYDPVDKI^+++++
Translate          HHLRDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKI^-----
X12671       2159  cccagttgctgaaggagaaagcgagaaagtgtgatggcgtggaaGTCAGT
                   aatgaataaagatattattcaggggaaggtcttctaaaactaat      
                   tcaatttagtaatagtacgtcactcgagctctactcctccgtgt      


                                      *                              
hnrnp.prf     154  +++++++++++++++++++L^^++++++++++++++++++++++++++++
Translate          -------------------L^^----------------------------
X12671       2297  AAGTATCAGATAGTGGCATtGTAAGGGTTCCACAATCTGTATGGCATTCT
                                      t                              
                                      a                              


                                                               * *   
hnrnp.prf     155  ++++++++++++++++++++++++++++++++++++++++++++KTHSIK
Translate          --------------------------------------------KYHTVN
X12671       2349  AAACCCTGATACCATGTTGTATCTATGTTTTTTTTTTAGTTCAGatcaga
                                                               aaacta
                                                               acttgt


                   * *  * **  ** *
hnrnp.prf     161  GKNVDVKKAIAKQDM
Translate          GHNCEVRKALSKQEM
X12671       2411  gcatggaagctacga
                   gaagatgactcaaat
                   ccctataacgagagg







More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net