IUBio

BLAST search with large sequences

Joe Ryan jfryan at NHGRI.NIH.GOV
Wed May 12 08:25:34 EST 1999


> Are there any publicly available solutions to doing BLAST searches with 
> large sequences, then viewing and manipulating the results?
(full contents of original post follows my reply)

Hello Gary,

I have considered this problem as well.  And I began work on a perl program
that would split sequences, use an overlap and reassemble the results 
into a BLAST Report (you are welcome to the code).  However, my 
work got put on hold.  One of the reasons the project was put on hold
is, I questioned whether splitting and BLASTing and reassembling would be
more sensitive than straight BLASTing.

I did not do a whole lot of comparing results but here is my gut feeling
on the situation.

Both PowerBLAST and NCBI BLAST are phenomenal programs.

PowerBLAST will handle very large sequences and has a lot of other 
nice features such as masking and taxonomy filters.  
The downside to PowerBLAST is that it runs over the network so you are
limited by bandwidth and demand on the server.  You are also limited to 
databases available at NCBI  (no local databases).
I would recommend reading the PowerBLAST paper (PMID: 9199938) which 
explains the problem that you are describing and their approach in great
detail.

The latest local NCBI BLAST seems to handle very large sequences pretty nicely.
Your limiting factor is memory; I have been told that there is no
upper limit in query size.  I trust the results of the local
BLAST a little more than PowerBLAST because the local BLAST uses the
latest Gap algorithms from NCBI.  And local BLAST is continually being
updated and maintained.

I would be interested in hearing any other results or conclusions others 
have.  I have also CCed blast-help at ncbi.nlm.nih.gov who can perhaps 
address the question as well.

(initial post follows this mail)

Joe
--
Joseph Ryan
Programmer
National Human Genome Research Institute


> It is starting to become common for people to want to do BLAST searches with
> sequences of 200 Kb and upwards. 
>  
> There are then problems with memory, time to do the search and many strong 
> matches forcing interesting weak matches in other regions of the query 
> sequence off the bottom of the list of output alignments.
>  
> I am interested in how other sites are approaching this problem. 
>  
> My initial thoughts on this are that the query sequence should be split 
> into lengths of maybe 50 Kb with a overlap of maybe 1 Kb.  The results 
> can then be processed to produce a composite MSPcrunch format file which 
> can be searched with existing scripts.  Display scripts can then 
> reintegrate the alignment results from two or more output files in the 
> region of interest.
>  
> Are there any publicly available solutions to doing BLAST searches with 
> large sequences, then viewing and manipulating the results?
>  
> Has anyone found ways to do BLAST searches with large sequences without 
> splitting them?
>  
> What other problems (and solutions) do people encounter with large sequence 
> searches?
> 




More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net