IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Formatting EBML full release for blast

Rodrigo Lopez rls at ebi.ac.uk
Mon Aug 20 06:12:22 EST 2001


There are quite a number of different ways of achieving this but it all 
depends on what you want to do. First of all you have to identify which 
blast you are using. There are two: NCBI-blast and WU-Blast. The former 
uses a program called formatdb, as Keith pointed out. The latter uses 
(depending on which version you use) pressdb or xdformat. With either of 
these blast distros you don't need to have a one large blast database for 
EMBL. You create individual blast databases for each of the files in the 
release and create alias or 'farm' files to search groups of these. 
Information regarding how to use NCBI's formatdb to create multivolume 
databanks can be found at:


Here you will see how to create a .nal (for nucleic acid db's) and .pal 
(for protein sequence db's) file to suit your needs wrt EMBL. These 'farm' 
file are what you need. The entire human division of EMBL currently 
comprises 8 files. Create the 8 fasta files for each one of them (not one 
large fasta file for all of them!). Then run formatdb on each of these and 
call each of them hum1, hum2, hum3 and so on using the -n parameter of 
formatdb. Then, create a human.nal file which will contain:

TITLE human
DBLIST hum1 hum2 hum3 hum4 hum5 hum6 hum7 hum8

In order to use these w/ blast you would type:

% blastall -p blastn -d human -i myseq.na ....


In the case of WU-Blast (version 2.x): This version support virtual 
databanks. These can be refered to as groups from the command line. If you 
create the databanks using, for example, each of the 8 humX.dat files from 
8 individual fasta files (created with, for example, EMBOSS's seqretall 
see: http://www.emboss.org/ - please note that there are also blast 
formatting utilities in EMBOSS such as dbiblast for producing WU-
BLAST/NCBI-BLAST style indices :-)) You can refer to these using WU-BLAST 
in the following way:

% blastp "hum1 hum2 hum3 hum4 ..." myseq.pep ....

Hope this helps,



krb at sanger.ac.uk (Keith Bradnam) wrote in 
<Pine.OSF.4.21.0108201008350.20284-100000 at caldy.sanger.ac.uk>:

>On 16 Aug 2001, Bent Nagstrup Terp wrote:
>> Hi!
>> Could anybody please tell me how I get from having downloaded all the
>> .dat.gz's in the full release plus the cumulative update, to having a
>> "blastable" database?
>First you need to convert your sequences from EMBL format into a format
>suitable for creating BLAST databases.  E.g. FASTA format.
>When you have all your sequences in one file, you can use the formatdb
>program (which comes with BLAST) to convert them into a BLAST
>database.  But if you are planning to create one BLAST database containing
>everything in EMBL then this will be very, very, big.
>~  Keith Bradnam - WormBase group: http://wormbase.sanger.ac.uk/
>~  The Sanger Centre, Wellcome Trust Genome Campus
>~  Hinxton, Cambridge, CB10 1SA, UK.  Tel (01223) 497516

More information about the Embl-db mailing list

Send comments to us at biosci-help [At] net.bio.net