Neural Network programs for sequence analysis?

Jacob Engelbrecht engel at biobase.aau.dk
Tue Sep 7 03:46:05 EST 1993

some years ago we published the following, and we are continuing to work
with neural networks as well as other methods for localizing structure
and function in biological sequences.

Jacob Engelbrecht
Center for Biological Sequence analysis
engel at virus.fki.dth.dk


   The  NetGene  mail  server  is a  service  producing  neural  network
   predictions  of splice  sites in  vertebrate  genes as described  in:
   Brunak, S.,  Engelbrecht,  J., and Knudsen, S.  (1991)  Prediction of
   Human mRNA Donor and Acceptor  Sites from the DNA  Sequence.  Journal
   of Molecular Biology, 220, 49-65.


   Artificial  neural  networks have been applied to the  prediction  of
   splice site location in human  pre-mRNA.  A joint  prediction  scheme
   where  prediction  of transition  regions  between  introns and exons
   regulates  a cutoff  level for  splice  site  assignment  was able to
   predict splice site locations with confidence  levels far better than
   previously  reported in the  literature.  The  problem of  predicting
   donor and  acceptor  sites in human genes is hampered by the presence
   of  numerous   amounts  of  false   positives  -  in  the  paper  the
   distribution  of these false splice sites is examined and linked to a
   possible  scenario  for the  splicing  mechanism  in vivo.  When  the
   presented  method detects 95% of the true donor and acceptor sites it
   makes less than 0.1% false donor site  assignments and less than 0.4%
   false acceptor site assignments.  For the large data set used in this
   study this means that on the  average  there are one and a half false
   donor sites per true donor site and six false acceptor sites per true
   acceptor site.  With the joint assignment method more than a fifth of
   the true donor sites and around one fourth of the true acceptor sites
   could  be  detected  without  accompaniment  of  any  false  positive
   predictions.  Highly  confident  splice  sites  could not be isolated
   with a widely used weight  matrix  method or by separate  splice site
   networks.  A complementary  relation between the confidence levels of
   the  coding/non-coding  and the  separate  splice site  networks  was
   observed, with many weak splice sites having sharp transitions in the
   coding/non-coding  signal and many stronger  splice sites having more
   ill-defined transitions between coding and non-coding.


   In order to use the NetGene mail-server:

   1) Prepare a file with the sequence in a format  similar to the fasta
      format:  the first line must start  with the symbol  '>', the next
      word  on  that  line  is  used  as the  sequence  identifier.  The
      following lines should contain the actual sequence,  consisting of
      the symbols A, T, U, G, C and N.  U is converted to T, letters not
      mentioned  are converted to N.  All letters are converted to upper
      case.  Numbers,  blanks and other  nonletter  symbols are skipped.
      The lines  should not be longer than 80  characters.  The  minimum
      length  analyzed  is 451  nucleotides,  and the  maximum is 100000
      nucleotides  (your  mail  system  may have a lower  limit  for the
      maximum  size of a message).  Due to the  non-local  nature of the
      algorithm  sites  closer than 225  nucleotides  to the ends of the
      sequence will not be assigned.

   2) Mail the file to netgene at virus.fki.dth.dk.  The response time will
      depend on system  load.  If nothing else is running on the machine
      the  speed is about  1000  nucleotides/min.  It may  take  several
      hours  before you get the answer, so please do not  resubmit a job
      if you get no answer within a short while.


   Publication  of output from  NetGene must be  referenced  as follows:
   Brunak, S.,  Engelbrecht,  J., and Knudsen, S.  (1991)  Prediction of
   Human mRNA Donor and Acceptor  Sites from the DNA  Sequence.  Journal
   of Molecular Biology, 220, 49-65.

   Your  submitted  sequence  will be  deleted automatically immediately 
   after processing by NetGene.


   Should be addressed to:

   Jacob Engelbrecht

   e-mail: engel at virus.fki.dth.dk

   Department of Physical Chemistry
   The Technical University of Denmark
   Building 206
   DK-2800 Lyngby

   phone: +45 4288 2222 ext. 2478 (operator)
   phone: +45 4593 1222 ext. 2478 (tone)
   fax:   +45 4593 4808


   A file test.seq is prepared with an editor with the following contents:

   . Here come more lines with sequence.
   This is sent to the NetGene mail-server, on a Unix system like this:
   mail netgene at virus.fki.dth.dk < test.seq
   In return an answer similar to this is produced:

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net