IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

how do I automatically update gene coordinates after re-sequencing the genome

Philipp Pagel philipp.pagel at gmx.de
Mon Jul 14 11:49:50 EST 2003

>>One could do this by writing a program that blastn all the genes
>>against the new sequence and then pick out the new coordinates for the
>>nearly identical hits. Gene duplicates etc could make it a bit messy

> For bacterial genomes, BLAST is probably fast enough.  For mammalian
> genomes it isn't (unless you have many hundreds of CPUs available, which
> only a few sites do).

How many times per day did you upgrade your sequence? While it is
definitely a good idea to throw a few CPUs at such a job I don't see why
you would need hundeds of them. A BLAST search of all predicted mouse
genes against a database of about the same size runs for only a few days
on something like 10 CPUs.

>>Has anyone out there done this before, and do you have any tips?
>>Would be extremely grateful if you could share them!

> Most genome annotation projects have this problem, as you suggest.
> I used to work at Incyte Genomics, and while there I employed someone
> specifically to write code to solve this very problem.  Unfortunately
> the code was not made publically available.

> I can't speak for the particular problems of bacterial genomes, but in
> the human genome we were hit by the usual issues; many features are very
> difficult to remap automatically.  For example, I remember trying to
> remap an STS tiling path across the coding regions of one particular
> gene from the original gene build (which was on HTG draft sequence) onto
> the final sequence that came along later.

> The problem was that this particular gene had about 12 alternative 5'
> exons, which were on average about 98% sequence identical with each
> other.  Made remapping very difficult (as well as designing unique STSs
> for that gene, of course!)

> The second problem was speed.  BLAST and other DP algorithms just
> weren't fast enough.  We did come up with an exact string matching
> method that was much faster, but were usually left with about 20% of
> features which the algorithm would flag up as needing human
> intervention; typically this occurred when the new version of the
> sequence contained indels relative to the original build sequence.

> The smaller the feature, the harder it is to remap, of course, because
> it has more chance of occurring by chance.  SNPs were the trickiest, of
> course, since you then have to decide on how much flanking sequence to
> use to help the mapping process.  The more you use, the more accurate it
> gets, but slower to run.

> Many annotation projects currently seem to prefer the approach of
> re-running their automated annotation pipelines than trying to remap
> their existing annotation.

> You may consider this to be burying one's head in the sand, and I
> couldn't possibly comment.  :-)

> Tim

> ---

Dr. Philipp Pagel                                Tel.  +49-89-3187-3675
Institute for Bioinformatics / MIPS              Fax.  +49-89-3187-3585
GSF - National Research Center for Environment and Health
Ingolstaedter Landstrasse 1
85764 Neuherberg, Germany

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net