Reformatting of databases for BLAST

David Kristofferson kristoff at genbank.bio.net
Sun Oct 6 14:09:50 EST 1991

        Nucleic and protein databases must be reformatted for use with blastn
and blastp, respectively.  What are the considerations and consequences for
maintaining two sets of nucleic and protein blast databases, one set for the
latest quarterly releases, and another set for new sequences that is updated
(reformatted) at daily or weekly intervals?  I suppose what is embedded in
this question is another about the appropriate uses and "abuses" of FastA's
init scores versus blast's probability values w.r.t. the sizes of the
databases searched.



	The problem of affecting BLAST's probability values is a real
one.  We decided last week that we would have to regularly combine the
GenBank and EMBL new data with the latest quarterly release, remove
duplicates (caused by updates of existing entries occuring in the new
data) and reindex each entire database for BLAST each night to do
this.  Fortunately (or so I've been told), this reindexing does not
take a long time at present.  The alternative is to keep two separate
databases at the expense of affecting the probability values.

	We are looking to have the daily GenBank and EMBL updates
available on the GOS BLAST server around mid-week.


				Dave Kristofferson
				GenBank Manager

				kristoff at genbank.bio.net

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net