FGENESH_GC - NONCANONICAL GC in predicting genes/ ALTERNATIVELY SPLICED genes
A version of FGENESH program including NONCANONICAL GC dinucleotide
in donor splice sites is installed to use on-line:
www.softberry.com
This program is useful to analyze ALTERNATIVE gene structure, where non-
standard splice
sites are often found (see also FGENES-M program to predict alternative gene
variants)
and create A SET of GENES and PROTEINS absent in standard gene prediction.
Donor GC splice site is accounting for the major part of non-standard splice
sites in
human genes. It present about 0.6% of all splice sites and observed in more
than 5% of
human genes. Prediction genes on large scale genomic sequences will contain
hundreds of
GC-donor exons and required programs which will predict their major amount.
The noncanonical splice sites were investigated by us recently
(Burset, Seledtsov and Solovyev, 2000,Nucleic Acids Res., 28(21), 4364-4375.)
and we received about 20000 verified by EST splice sites. We received a very
strong
GC-donor site weight matrix which is used in gene prediction program. We have
developed
this variant of program to predict GC-donor exons in addition to standard exons
and we
preserve the accuracy of program on the standard genes. Testing the program on
68 human
genes with at least one GC donor site shows that FGENESH (GC) provide 10%
higher rate
of exact exon prediction for such group and 5% higher accuracy on the
nucleotide livel.
Click Human parameters and FGENESH_GC button Paste your sequence to the window
or
load your file with sequence in FASTA format
Solovyev V.V. (2001) Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.),
John Wiley & Sons, Ltd., p. 83-127.
Fgenesh_GC output:
(IN THIS EXAMPLE 2nd EXON HAVING GC-DONOR SITE IS FOUND, and it is LOST by
STANDARD gene finders)
G - predicted gene number, starting from start of sequence;
Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence: CDSf - First (Starting with Start codon),
CDSi - internal (internal exon), CDSl - last coding segment, ending with stop
codon);
TSS - Position of transcription start (TATA-box position and score);
Start and End - Position of the Feature;
Weight - Log likelihood*10 score for the feature;
ORF - start/end positions where the first complete codon starts and the last
codon ends.
fgeneshgc Wed Jan 30 20:59:27 EST 2002
FGENESH (with GC possible donor site) Gene prediction in Human genomic DNA
Time: Wed Jan 30 20:59:27 2002
Seq name: Softberry SERVER PAST Sequence
Length of sequence: 2932 GC content: 65 Zone: 4
Number of predicted genes 1 in +chain 1 in -chain 0
Number of predicted exons 5 in +chain 5 in -chain 0
Positions of predicted genes and exons:
G Str Feature Start End Score ORF Len
1 + 1 CDSf 501 - 580 15.57 501 - 578 78
1 + 2 CDSi 747 - 853 22.53 748 - 852 105
1 + 3 CDSi 1847 - 1980 17.97 1849 - 1980 132
1 + 4 CDSi 2255 - 2333 10.88 2255 - 2332 78
1 + 5 CDSl 2563 - 2705 15.94 2565 - 2705 141
Predicted protein(s):
>FGENESH 1 5 exon (s) 501 - 2705 180 aa, chain +
MADSELQLVEQRIRSFPDFPTPGVVFRDISPVLKDPASFRAAIGLLARHLKATHGGRIDY
IAGLDSRGFLFGPSLAQELGLGCVLIRKRGKLPGPTLWASYSLEYGKAELEIQKDALEPG
QRVVVVDDLLATGGTMNAACELLGRLQAEVLECVSLVELTSLKGREKLAPVPFFSLLQYE
-------------------------------------------------
This mail sent through AceDSL WebMail (http://webmail.acedsl.com)
---