intron/exon borders

Dan Jacobson danj at welchgate.welch.jhu.edu
Wed Dec 30 17:09:52 EST 1992

In article <30DEC199214434286 at aardvark.ucs.uoknor.edu> bfrank at aardvark.ucs.uoknor.edu (FRANK,BART) writes:
>Can anyone recommend a good program to screen human genomic seqeunces
>and predict positions of intros/exon borders?
>Bart Frank

There are three mail servers which do this type of thing, namely
GRAIL, GENEID, and GENMARK.  I am including information about these
servers below.

Happy Holidays,

Dan Jacobson

danj at welchgate.welch.jhu.edu


Welcome to GRAIL (Gene Recognition and Analysis Internet Link)

Grail is an interface to a system which will ultimately provide
automated gene assembly from DNA sequence data.  Currently the
system provides analysis of protein coding potential of a DNA
sequence.  The coding recognition module (CRM) uses a multiple-
sensor neural network approach  to identify coding exons than are
at least 100 bases long.  In its current configuration the CRM
identifies 90% of such regions with less than 1 false positive
coding exon per 5 coding exons indicated. Your success rate will
depend on a number of parameters including the G/C content of 
your sequence. In general, coding regions in sequences of low 
G/C content are not as well recognized as those in higher G/C.
Investigation is underway to try and improve the performance 
for low G/C sequences.

This part of the system is specifically designed to locate
regions of DNA sequence with protein encoding potential.  The
system has been trained to recognize coding regions in Human DNA
but seems to work well on DNA sequences from other mammals. 
Because the system has not been tested extensively on species
other than human, no claims are made for the predictions of
coding potential on DNA's from other species.

To use GRAIL you must first register and get a user ID. 
To become a registered user please send the following
e-mail message to:

    grail at ornl.gov

Your Name
Your address
Your phone number
your E-mail address

To have sequences analyzed send e-mail to:

     grail at ornl.gov

The message will start with the word "sequences" followed by the
number of sequences you are sending followed by your user ID
followed by the sequences you wish to have analyzed in the
following format:

Sequences number_of_sequences  your_user_ID



For the system to return any interpretation the sequence to be
analyzed must be at least 100 bases long (and not more than
100kb).  For each sequence the following information will be
1.  The score for the coding potential for each position analyzed
on each strand (the f-(forward) strand represents the sequence as
received, and the r-(reverse) strand is the reverse compliment). 
These scores range from  0.0 to 1.0 and a score greater than 0.5
identifies a region with protein encoding potential. Non-coding
regions often have a score of 0.000. To reduce the output, only 
regions with scores of at least 0.01 are reported.
2.  frame.  In calculating the coding potential, the system
calculates the reading-frame which is "preferred" in the window
over which the calculation is done and this information is
returned for regions with scores over 0.5.
3.  orf.  The limits between which the preferred frame is open is
returned for windows with scores over 0.5.

The second part of the output is the system's interpretation of
the raw data. This output gives the limits (in general a minimum)
of the extent of the coding exon, the most likely strand for the
exon with a probability for the correctness of the strand
assignment, the preferred reading frame for the exon and a
quality assessment.  An interesting phenomenon we have noted
is that some exons seem to have coding character on both strands
or even more coding character on the wrong strand. be aware that strand
assignments are not always correct, and it is sometimes useful to
consider both strands as possible. Any exon with a quality score of
"excellent" is worth further consideration.  Please remember that 
the system is designed to find coding exon of 100 or more bases,
so small coding exons may well be missed.        

This implementation of the CRM has been tested on a set of human 
genes containing 102kb of sequence. This set contained 70 coding
exons and the system identified 62 (89%) and assigned them all to
the correct strand. (Though in a larger test set strand assignment
was 90-95% correct). The preferred reading frame assignment was
correct for 60 (96%) of these exons while the frame assignment 
for the other two had some ambiguity. Of the eight missed 6 were 
less than 100 bases long. Of 43 predicted exons with a quality 
score of "excellent" all were actual coding exons. Of predicted
exons scoring "good" 11 of 16 (69%) were expected and of 49
predicted exons with a score of "marginal" only 8 (16%) were
"real". Though this is a rather limited test set, the results
of this analysis give some guidance for interpreting CRM output.

N.B.  This is an alpha+ version so we are open to feed-back.
We have a new e-mail address called GRAILMAIL at ORNL.GOV
for user feedback to the GRAIL staff. Or communication can be
addressed specifically to us:

Direct questions to:  Richard J. Mural, e-mail:
     m9l at stc10.ctd.ornl.gov
     Phone: 615-576-2938


Edward C. Uberbacher, e-mail:
     uber at msr.epm.ornl.gov
     Phone: 615-574-6134


GRAIL staff, e-mail:
     grailmail at ornl.gov

To receive a copy of this help file send the message "help" to
     grail at ornl.gov. 

Appendix A: GRAIL updates
Modifications to the GRAIL rule base for constructing the exon
table from the coding probability information have been made as
of Feb. 19, 1992. These changes have been designed to recognize
situations where a single real exon, usually with significant 
extent, is recognized by GRAIL as multiple peaks or multiple exons.
These additional rules interconnect predicated peaks under
circumstances where consecutive predicated regions have the same
preferred reading frame, the frame is open between them, and they
are relatively close together. The result is generally a beneficial
simplification of the exon table and a more accurate representation
of exon structure. This also better adapts GRAIL for use with cDNAs.
Feedback or questions can be addressed to GRAILMAIL at ornl.gov.

  The GRAIL staff


>------------------------------ GENE-ID OUTPUT -------------------------------<

                              GENEID UPDATES

1. The top ranking gene model is now automatically compared to protein
   databases using the BLAST Network Service provided by the National 
   Center for Biotechnology Information.  The results will be mailed to
   you separately and might give you some clues as to the function of
   your gene.

2. NETGENE is now available on this server. Just include the keyword line
   "NetGene" between the keyword line "Genomic Sequence" and your
   sequence.  More information is available in the info file which can be
   obtained by including the keyword line "geneid info".

3. GENEID was originally developed to predict the exon structure of
   full-length pre-mRNA. If the sequence does not contain first or last
   exons, then GENEID will still try to predict first and last exons,
   although they will tend to be short (<15 bp) and have low scores 
   (<0.5). The lack of first or last exons may also affect the prediction
   of internal exons (see item 5. - 7. of the output). A future version
   will allow scanning for internal exons in small gene fragments.

4. If you have success in confirming GENEID predictions, we would like to
   hear about it. Send an email to steen at darwin.bu.edu.


                           version 1.0 2/1/1992

Geneid is an Artificial Intelligence system for analyzing vertebrate genomic
DNA and prediction of exons and gene structure (1). A prototype is implemented
as a fast, automatic email-response system. Users have the option of having 
their DNA sequence analyzed by NetGene (2) simultaneously.

Before or simultaneously with submitting a sequence for analysis, you need to
register your name by sending a line with the word "register", followed by
your name and address. Example:

register, Don Johnson,  Miami Vice,  Baywiev Marina Dock A12,  Miami, FL  34566-
1234, U.S.A.

NOTE>>  The line can be longer than 80 characters as long as it contains NO
linebreaks, (that is, do NOT press the <Return> key until the end of the

Send the line in a mail to: geneid at darwin.bu.edu.  The registration
information will only be used for maintaining a file of the number and
geographic distribution of the users.

Your sequences must be submitted in the following format (approximately same
format as used for fasta, BLAST and GRAIL):
You can submit only one sequence per mail. Put the sequence after the keyword
"Genomic Sequence" as shown below:

Genomic Sequence


(Restrict the line length to 80 characters. The seqname is limited to 20


If the reply file with the results will exceed the Mail limit of 300
kB, the reply will be split into several files.  On a UNIX system you
could send the File containing the sequence as follo

More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net