Prot_map - a fast tool to align proteins with genome and reconstruct exon-
intron structure
has been developed recently and available to run at:
http://sun1.softberry.ru/berry.phtml?
topic=prot_map&group=programs&subgroup=xmap
Prot_map program maps a set of protein sequences onto genomic sequence
producing gene
structures and the corresponding alignments of coding exons with the similar
or identical
protein queries. Prot_map uses a genomic sequence and a set of protein
sequences as
its input parameters. Prot_map reconstructs the gene structure on the base of
identical
or similar protein instead of a set of unordered alignment fragments that
generated
the Blast program. The program is very fast, and the produces gene structure
similar
with the accuracy of slow Genewise program (that practically required knowing
the
protein genomic location) (Table 1). You can further significantly improve the
accuracy
of gene reconstruction with Fgenesh+ program by using the results of Prot_map
(i.e.a fragment of genomic sequence and the protein sequence mapped on it) (
Table 2).
(1) Prot_map program is used in pipeline (Fgenesh++) of automatic annotation
of
new genomic sequences, as well as (2) to generate a set of genes in new
genomes
(without known genes) to train parameters of gene-finding programs. (3) It is
very useful to find pseudogenes by selection of corrupted gene structures
resulted in mapping a set of known proteins.
Figure 1. Example of mapping a protein sequence on the human 19 chromosome.
L:3000000 Sequence Chr19 [cut:1 3000000]
[DD] Sequence: 1( 1), S: 105.56, L:1739
IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3
Summ of block lengths: 1284, Alignment bounds:
On first sequence: start 2146727, end 2167197, length 20471
On second sequence: start 263, end 1682, length 1420
Blocks of alignment: 21
1 E: 2146727 70 [ca GT] P: 2146727 263 L: 23, G: 101.574 S:14.75
2 E: 2147573 107 [AG GT] P: 2147575 287 L: 35, G: 103.465, S:18.56
3 E: 2148934 42 [AG GT] P: 2148934 322 L: 14, G: 103.043, S:11.68
4 E: 2150399 111 [AG GT] P: 2150399 336 L: 37, G: 102.130, S:18.82
5 E: 2150620 235 [AG GT] P: 2150620 373 L: 78, G: 101.500, S:27.15
6 E: 2151098 114 [AG GT] P: 2151100 452 L: 37, G: 106.924, S:19.76
7 E: 2151750 92 [AG GT] P: 2151752 490 L: 30, G: 101.424, S:16.82
8 E: 2153538 102 [AG GT] P: 2153538 520 L: 34, G: 100.496, S:17.73
9 E: 2153848 138 [AG GT] P: 2153848 554 L: 46, G: 99.003, S:20.30
10 E: 2154470 126 [AG GT] P: 2154470 600 L: 42, G: 101.283, S:19.87
1 11 2146713 2146723 2146739 2146769
gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg
---------------(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS-
248 248 249 259 267 277
2146797 2146806 2147558 2147568 2147581 2147611
]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK
---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK
286 286 286 286 289 299
2147641 2147671 2147686 2148919 2148926 2148937
PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP
PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP
309 319 322 322 322 323
2148967 2148982 2150384 2150391 2150402 2150432
KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP
KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP
333 336 336 336 337 347
2150462 2150492 2150513 2150523 2150609 2150619
TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK
TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK
357 367 373 373 373 373
Table 1. Speed of processing sequences by Prot_map, Fgenesh+ and GeneWise.
Fgenesh+ Prot_map GeneWise
88 sequences of genes < 20 kb ~1 min ~1 min ~90 min
8 sequences of genes > 400000 kb ~1 min ~1 min ~1200 min
Table 2. Comparison of accuracy of gene identification programs: ab initio
Fgenesh and prediction with protein support: Fgenesh+ , GenWise and Prot_map
on a set of human genes using mouse or drosophila homologous proteins. %CG
(correct genes) is % of exactly predicted genes.
Mouse homologs: 60% < similarity level < 80% - 1425 sequences
Sn ex Sno ex Sp ex Sn nuc Sp nuc CC %CG
Fgenesh 83.4 90.9 86.8 93.2 94.9 0.937 30
Genwise 88.1 96.5 90.5 97.8 99.2 0.984 43
Fgenesh+ 93.9 97.9 94.9 98.4 99.3 0.988 65
Prot_map 87.0 96.5 86.6 97.0 98.5 0.976 40
Drosophila homologs: similarity level > 80% - 66 sequences.
Sn ex Sno ex Sp ex Sn nuc Sp nuc CC CG%
Fgenesh 90.5 93.8 95.1 97.9 96.9 0.950 55
Genwise 79.3 83.9 86.8 97.3 99.5 0.985 23
Fgenesh+ 95.1 97.8 97.0 98.9 99.5 0.9914 70
Prot_map 86.4 95.3 88.1 97.6 99.0 0.982 41
---