New FGENESB - Finding genes in microbial genomes
New FgenesB is the fastest (E.coli genome analyzed in ~14 sec) and most
accurate ab initio Bacterial gene prediction program available.
http://www.softberry.com/berry.phtml?topic=fgenesb
It uses parameters learned for different bacteria by FgenesB-train script,
which input is just new bacterial sequence. It will automatically create
file with gene prediction parameters for the analyzed organism.
It takes only ~10 minutes to create such file for such genome as
E.coli using its sequence. If you need parameters for your new bacteria,
please contact Softberry Inc., we can include them in the WEB list.
Algorithm based on pattern recognition of different types of signals
and Markov chain models of coding regions. Optimal combination of these
features is then found by dynamic programming and a set of gene models
is constructed along given sequencea.
--------------------------------------------------------------------------------
Accuracy of prediction estimated on B.subtilis sequence:
Number of non-first possible start codon genes - 19.1%
Borodovsky et al. (see GeneMark WEB pages) calculated accuracy for all genes,
and 3 sets of difficult short genes (L <= 300bp) having protein similarity
support to demonstrate that short genes also can be predicted reasonably good.
First set (51set) has 51 genes with at least 10 strong similarities to known
proteins. Then 72set has 72 genes with at least 2 strong similarities and
123set with at least one homolog.
Here is data of GeneMarkS and Glimmer as he calculated and
FgenesB (after 3 iterations of fgenesB-train):
Sn (exact Sn (exact+overlapping
predictions) predictions)
123set:
Glimmer 57.0% 91.1
GeneMarkS 82.9 91.9
FgenesB 89.3 98.4
72set:
Glimmer 57.0% 91.7
GeneMarkS 88.9 94.4
FgenesB 91.5 98.6
51set:
Glimmer 51.0% 88.2
GeneMarkS 90.2% 94.1%
FgenesB 02.0 98.0
All genes set:
Glimmer 62.4% 98.1
GeneMarkS 83.9 96.7
FgenesB 83.8 98.7
(PS: we should note that many genes in GenBank is annotated using GeneMark
program, and it should generate overestimation of accuracy for GeneMark).
FgenesB output:
bact Tue Aug 27 00:12:46 EDT 2002
FgenesB: Finding genes in microbial genomes (Softberry Inc.)
Time: Tue Aug 27 00:12:46 2002
Seq name: Softberry SERVER PAST Sequence
Length of sequence - 12780 bp Parameters: Escherichia_coli_K-12.dat
Number of predicted genes - 12
N S Start End Score
1 + CDS 190 - 255 100.0
2 + CDS 337 - 2799 2467.0
3 + CDS 2801 - 3733 785.0
4 + CDS 3734 - 5020 1493.0
5 + CDS 5234 - 5530 161.0
6 - CDS 5683 - 6459 870.0
7 - CDS 6529 - 7959 1033.0
8 + CDS 8238 - 9191 1319.0
9 + CDS 9306 - 9893 544.0
10 - CDS 9928 - 10479 775.0
11 - CDS 10643 - 11356 594.0
12 - CDS 11382 - 11786 394.0
.................................
Predicted protein(s):
>GENE 1 190 - 255 21 aa, chain +
MKRISTTITTTITITTGNGAG
>GENE 2 337 - 2799 820 aa, chain +
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA
LPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINA
ALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIP
ADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD
EDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISF
CVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGAL
LEQLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRL
VKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSR
RKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLA
REMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMA
NLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAF
YSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV
>GENE 3 2801 - 3733 310 aa, chain +
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEP
RENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLND
TRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGI
KVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLP
GFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRL
DTAGARVLEN
...............................
-------------------------------------------------
This mail sent through AceDSL WebMail (http://webmail.acedsl.com)
---