IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

how do I automatically update gene coordinates after re-sequencing the genome

Marcus Claesson m.claesson at student.ucc.ie
Tue Jul 15 08:57:10 EST 2003

Thanks for your answer Tim,

Frustrating that so many go thorugh the same ordeals over and over
again isn't it? However, after I sent out my question I found a public
program that seem to solve the problem at least for prokaryotes. It's
Sequin at http://www.ncbi.nlm.nih.gov/Sequin/

It has quite a nice feature that checks the differences between old
and new sequence, let you scroll thorugh them and then updates. The
input is fasta format and a tab-separated textfile with annotations.
Sequin then creates a genbank entry that can be updated with new
sequence in fasta format.

To me this looks very good! Have a look at it!


timc at chiark.greenend.org.uk (Tim Cutts) wrote in message news:<S+x*-W7Wp at news.chiark.greenend.org.uk>...
> In article <e818c15b.0307100440.58d9f00b at posting.google.com>,
> Marcus Claesson <m.claesson at student.ucc.ie> wrote:
> >One could do this by writing a program that blastn all the genes
> >against the new sequence and then pick out the new coordinates for the
> >nearly identical hits. Gene duplicates etc could make it a bit messy
> >though.
> For bacterial genomes, BLAST is probably fast enough.  For mammalian
> genomes it isn't (unless you have many hundreds of CPUs available, which
> only a few sites do).
> >Has anyone out there done this before, and do you have any tips?
> >Would be extremely grateful if you could share them!
> Most genome annotation projects have this problem, as you suggest.
> I used to work at Incyte Genomics, and while there I employed someone
> specifically to write code to solve this very problem.  Unfortunately
> the code was not made publically available.
> I can't speak for the particular problems of bacterial genomes, but in
> the human genome we were hit by the usual issues; many features are very
> difficult to remap automatically.  For example, I remember trying to
> remap an STS tiling path across the coding regions of one particular
> gene from the original gene build (which was on HTG draft sequence) onto
> the final sequence that came along later.
> The problem was that this particular gene had about 12 alternative 5'
> exons, which were on average about 98% sequence identical with each
> other.  Made remapping very difficult (as well as designing unique STSs
> for that gene, of course!)
> The second problem was speed.  BLAST and other DP algorithms just
> weren't fast enough.  We did come up with an exact string matching
> method that was much faster, but were usually left with about 20% of
> features which the algorithm would flag up as needing human
> intervention; typically this occurred when the new version of the
> sequence contained indels relative to the original build sequence.
> The smaller the feature, the harder it is to remap, of course, because
> it has more chance of occurring by chance.  SNPs were the trickiest, of
> course, since you then have to decide on how much flanking sequence to
> use to help the mapping process.  The more you use, the more accurate it
> gets, but slower to run.
> Many annotation projects currently seem to prefer the approach of
> re-running their automated annotation pipelines than trying to remap
> their existing annotation.
> You may consider this to be burying one's head in the sand, and I
> couldn't possibly comment.  :-)
> Tim
> ---

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net