IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

New version of FGENESH with GC-donor exons gene prediction

Victor Solovyev solovyev at sanger.ac.uk
Wed Dec 15 22:32:43 EST 1999


 We installed new version of gene-finding HMM based program FGENESH (GC) 
 for multiple gene prediction including GC-exons in  genomic DNA at
                     http://genomic.sanger.ac.uk/

 
FGENESH (with possible Donor GC) - Prediction of multiple genes in
genomic DNA
                                   sequences 
  Prediction genes on large scale genomic sequences will contain
hundreds 
of GC-donor exons and required programs which will predict their
major amount. 
  A NEW version of FGENESH program including NONCANONICAL GC
dinucleotide in donor 
splice sites. This is the first program including in prediction the
noncanonical 
exons. Donor GC splice site is accounting for the major part of
non-standard splice 
sites in human genes. It present about 0.6% of all splice sites and
observed in 
more than 5% of human genes. 
The noncanonical splice sites we investigated by us recently (Burset,
Seledtsov 
and Solovyev,1999 in preparation) and we received about 20000 verified
by EST 
splice sites. We received a very strong GC-donor site weight matrix
which is 
used in gene prediction program. 
  We have developed this variant of program to predict GC-donor exons in
addition 
to standard exons and we preserve the accuracy of program on the
standard genes. 
Testing the program on 68 human genes with at least one GC donor site
shows 
that FGENESH (GC) provide 10% higher rate of exact exon prediction for
such 
group and 5% higheraccuracy on the nucleotide livel. 

Paste your sequence to the first window or load your file with
nucleotide 
sequence in FASTA format

Paste your protein sequence to the second window 

     References: Salamov A.A., Solovyev V.V. (1999), unpublished data. 
     Please reference: CGG WEB server:
     http://genomic.sanger.ac.uk/ 

     Fgenesh  output: 

             
      G - the number of predicted gene (from sequence start)
      Str -  DNA strand (+ and - for complementary)
      Feature - type of coding sequence (CDSf - First 
                (Starting with Start codon); 
                 CDSi - internal (internal exon);
                 CDSl - the last coding seagment, 
                        finishing by stop codon)
      TSS - Position of transcription start (TATA-box position and
score) 

      Start and End - Position of the Feature
      Weight - Log likelihood*10 score for the feature
      ORF-start/end - positions where the complete codons start and end 

 FGENESH-1.1 Prediction of potential genes in genomic DNA
          Time:   Thu Nov 28 19:25:51 1999.
          Seq name: HUMHBB      73308 bp    DNA             PRI      
20-JAN-1994        
          length of sequence  73308bp  G+C content: 39 Isochore: 1
          number of predicted genes 7 in +chain 7 in -chain 0
          number of predicted exons 18 in +chain 18 in -chain 0

            Gn S   Type   Start       End   Score        ORF          
Len
            -- -   ----   -----       ---   -----        ---          
---
             1 +   TSS    19447             -7.15  
             1 +   CDSf   19541 -   19632   16.12   19541 -   19630    
90
             1 +   CDSi   19755 -   19977   14.12   19756 -   19977   
222
             1 +   CDSl   20833 -   20961    2.99   20833 -   20961   
129
             1 +   PolA   21055              1.05  

             2 +   TSS    34437             -7.15  
             2 +   CDSf   34531 -   34622   15.25   34531 -   34620    
90
             2 +   CDSi   34745 -   34967   20.74   34746 -   34967   
222
             2 +   CDSl   35854 -   35982    5.59   35854 -   35982   
129
             2 +   PolA   36043              1.05  

             3 +   TSS    39373             -7.15  
             3 +   CDSf   39467 -   39558   15.25   39467 -   39556    
90
             3 +   CDSi   39681 -   39903   20.74   39682 -   39903   
222
             3 +   CDSl   40770 -   40898    5.74   40770 -   40898   
129
             3 +   PolA   40959              1.05  

             4 +   TSS    44415             -8.75  
             4 +   CDSf   45995 -   46151   16.01   45995 -   46150   
156
             4 +   CDSl   46997 -   47100    2.71   46999 -   47100   
102
             4 +   PolA   47243              1.05  

             5 +   TSS    54703             -4.45  
             5 +   CDSf   54790 -   54881   13.41   54790 -   54879    
90
             5 +   CDSi   55010 -   55232   14.20   55011 -   55232   
222
             5 +   CDSl   56131 -   56259    3.87   56131 -   56259   
129
             5 +   PolA   56365              1.05  

             6 +   TSS    62100             -6.65  
             6 +   CDSf   62187 -   62278   13.59   62187 -   62276    
90
             6 +   CDSi   62409 -   62631   19.50   62410 -   62631   
222
             6 +   CDSl   63482 -   63610   10.23   63482 -   63610   
129
             6 +   PolA   63718              1.05  

             7 +   TSS    68088             -9.45  
             7 +   CDSo   68183 -   68428   14.87   68183 -   68428   
246
             7 +   PolA   68509              1.05  

          Predicted protein(s):
          >ID  1   3 exon (s)  19541  -  20961    147 aa, chain +
          MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPK
          VKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFG
          KEFTPEVQAAWQKLVSAVAIALAHKYH
          >ID  2   3 exon (s)  34531  -  35982    147 aa, chain +
          MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK
          VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG
          KEFTPEVQASWQKMVTGVASALSSRYH
          >ID  3   3 exon (s)  39467  -  40898    147 aa, chain +
          MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK
          VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG
          KEFTPEVQASWQKMVTAVASALSSRYH
          >ID  4   2 exon (s)  45995  -  47100     86 aa, chain +
          MGNPKVKAHGKKVLISFGKAVMLTDDLKGTFATLSDLHCNKLHVDPENFLVSTLRQRDID
          CFGNPLQRGFYPTDTGFLAVTNKCCG
          >ID  5   3 exon (s)  54790  -  56259    147 aa, chain +
          MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPK
          VKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFG
          KEFTPQMQAAYQKVVAGVANALAHKYH
          >ID  6   3 exon (s)  62187  -  63610    147 aa, chain +
          MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
          VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
          KEFTPPVQAAYQKVVAGVANALAHKYH
          >ID  7   1 exon (s)  68183  -  68428     81 aa, chain +
          MEQSWAENDFDELREEGFRRSNYSKLKEEVRTNGKEVKNFEKKLDEWITRITNAQKSLKD
          LMELKTKAGELRDKYTSLSNR
---



More information about the Bionews mailing list

Send comments to us at biosci-help [At] net.bio.net