IUBio

summary of intron/exon splice site programs

Hirdeypal S. Bhathal hsb at ucselx.sdsu.edu
Sat Jul 4 17:27:11 EST 1992





 I got four response for my request about intron/exon splice site programs.
I haven't tried any yet. I am bit lazy to send personal mail to people who asked for summary so I am posting all the responses. Thanks to all who responded
to my request. 


----------------------------------------------------------------------------- -
Response #1


try GRAIL =======>>>>>  send email "help" to grail at ornl.gov

----------------------------------------------------------------------------- 
Response #2


Such a program was announced on the bionet.general board in mid-May.  It is 
used via the NetGene mail server.  To quote the hlpe info:
"The NetGene mail server is a service producing neural network prediction of 
splice sites in vertebrate genese as described in Brunak, S., et al., 
'Prediction of Human mRna Donor and Aceeptor Sites from the DNA Sequence,' J. 
Mol. Biol 220: 49-65 (1991)."  The info can be obtained by e-mail to 
Jacob Engelbrecht
engel at virus.fki.dth.dk
Department of Physical Chemistry
The Technical University of Denmark


A second source of help is the GeneID artificial intelligence system for 
analyzing vertebrate genomic DNA and predicting exons and gene structure.  
Obtain more informaton from steen at darwin.bu.edu.  This program accepts e-mail 
inquiries.

Hope the infomration is useful.

Gregg Wells
Department of Pathology
University of Pennsylvania
Philadelphia, PA  19103-4283
e-mail:  pathology at a1.mscf.upenn.edu
 
------------------------------------------------------------------------------
Response #3



Hi Hirdeypal:

	I do not know of any programs that are specifically available for 
this purpose.  I can suggest the following.

	Many computer programs relevant to biological sciences are available
at two sites that I know.  Although I have not done it before, I was given
to understand that you can download the programs by ftp from these sites.
I would suggest you check these places.  
They are	ftp.bio.indiana.edu
and		geneserver.uh.edu

	I believe you may be helped by a person who knows more about this.
May be you can send a mail to him.  His address is gilbert at bio.indiana.edu

We recently published a paper in June biotechniques that can identify the 
potential for silent mutagenesis to introduce restriction enzyme sites.  
This program can also be used to identify sites that can be mutated to 
introduce other small sequences like splice sites.  If that is anywhere
near what your plans are, go through the paper biotechniques 12:882-884.
If you are interesting in getting the program free of cost, send me a
diskette (DOS formatted), I will be very happy to send you one.  
	I hope you will find what you are looking for.
	If you need anything else, let me know.
Raj Shankarappa
bsh at med.pitt.edu
Pathology, University of Pittsburgh,
730B Scaife Hall, Pittsburgh PA 15261.
----------------------------------------------------------------------------
Response #4



Here is a pointer to a good program.  It is really for predicting exons but
finds likely splice sites in the process.  The program is not as good as mine :-)  
but then mine is still in development.  


      GENEID AND NETGENE ONLINE SYSTEMS FOR PREDICTION OF GENE STRUCTURE
                           version 1.0 2/1/1992

GENEID
_______________________________________________________________________________
Geneid is an Artificial Intelligence system for analyzing vertebrate genomic
DNA and prediction of exons and gene structure (1). A prototype is implemented
as a fast, automatic email-response system. Users have the option of having 
their DNA sequence analyzed by NetGene (2) simultaneously.

REGISTRATION:
Before or simultaneously with submitting a sequence for analysis, you need to
register your name by sending a line with the word "register", followed by
your name and address. Example:

register, Don Johnson,  Miami Vice,  Baywiev Marina Dock A12,  Miami, FL  34566-
1234, U.S.A.

NOTE>>  The line can be longer than 80 characters as long as it contains NO
linebreaks, (that is, do NOT press the <Return> key until the end of the
address.)

Send the line in a mail to: geneid at darwin.bu.edu.  The registration
information will only be used for maintaining a file of the number and
geographic distribution of the users.

SUBMITTING SEQUENCES:
Your sequences must be submitted in the following format (approximately same
format as used for fasta, BLAST and GRAIL):
You can submit only one sequence per mail. Put the sequence after the keyword
"Genomic Sequence" as shown below:

Genomic Sequence

>seqname
TTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCC
CGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTG
GCCAACTCCATCACTA...................

(Restrict the line length to 80 characters. The seqname is limited to 20
characters).

NOTE>>  IF YOUR MAIL DOES NOT CONTAIN THE KEYWORD "GENOMIC SEQUENCE", OR
ANY OTHER KEYWORDS LISTED IN THIS FILE, NO MAIL WILL BE RETURNED TO YOU.

If the reply file with the results will exceed the Mail limit of 300
kB, the reply will be split into several files.  On a UNIX system you
could send the File containing the sequence as follows: mail -v
geneid at darwin.bu.edu <File


LIMITS:
GeneId currently will not accept sequences smaller than 100 bp or larger
than 20 kb.

CONFIDENTIALITY:
Your submitted sequence will be deleted automatically immediately after
reception by GeneID.


ANALYSIS:
GeneID will scan your sequence for potential splice sites, startcodons, and
stopcodons. Then it will try to assemble these into potential first exons,
internal exons, and last exons. Exons will be evaluated according to a number
of characteristics related to coding and splicing, and only likely exons will
be kept. Mutually exchangeable exons (normally overlapping and in the same
frame) will be put together in classes. Only the top 15 ranking first and
last exon classes, and the top 35 ranking internal exon classes
from each sequence will be kept, and assembled into potential gene models with
open reading frame, that will be ranked according to quality of the exons
they contain. The top 20 models will be included in the return mail. Your
return mail will also contain lists of the sites and exons created during the
analysis. GeneID will not analyze the reverse complement of your sequence. If
you suspect a gene on the other strand, submit the reverse complement sequence
separately.

TIPS FOR USE OF GENEID:
GeneID will try to identify first, internal, and last exons in each of the
sequences you submit, and try to assemble these into models of ONE likely
gene in each sequence. To avoid missing any exons, the number of exons will
be vastly overpredicted, and only a few of them are likely to be true (they
tend to be the top ranking exons, but a few true exons rank very low). But
these few true exons are likely to be found in the gene models because they
fit together to form a continuous open reading frame. Thus you should look to
the gene models to find a probable coding region.
If you submit a sequence that turns out to contain two genes, the behavior of
GeneID is unpredictable. It could either predict one large gene containing
both, or it could predict only the gene with the most typical charateristics.
If you submit a sequence that contains only part of a gene, GeneID will try to
identify an entire gene in this sequence. Thus the predicted first exon may
actually be part of a true internal exon, or the predicted last exon may be
part of a true internal exon. If GeneID fails to predict any genes, you might
look at the potential exon lists.
Thus you can experiment with input and response, by starting out with sequences
that are not too long (for example less than 10 kb), and see if GeneID is
able to extend the gene if you extend the sequence. If you have very large
sequences, it may be a good idea to request analysis by NetGene first (see
below). NetGene will analyze sequences up to 100 kb, and may find regions
containing exons of very high likelihood. These regions can then be resubmitted
to GeneID for further analysis.
GeneID will not construct models with more than 22 exons.
If the sequence contains frameshift errors in exons, then that may affect the
quality of the prediction in the current implementation.

ACCURACY:
In a test on 28 genes from GenBank, 91% of the nucleotides were correctly
predicted as coding or non-coding. Since these two categories are unequally
represented, a better measure of accuracy may be the correlation coefficient,
which was found to be 0.68. See paper for details.

ANALYSIS TIME:
Will depend on the load on the system and grows approximately linearly with
the length of the sequence input. Expect at least 1 minute per kb. Longer
response times can occur if the system is temporarily down (check with the
UNIX command: "finger geneid at darwin.bu.edu").

FURTHER INFORMATION:
A preprint of a paper describing the development and testing of GeneID is
available as a Stuffit.hqx file for Macintosh. Simply include the line:

  Preprint Request

in your mail to geneid at darwin.bu.edu, and the manuscript will be mailed to you.


REFERENCING:
Publication of output from GeneID must be referenced as follows:
(1) Guigo, R., Knudsen, S., Drake, N., and Smith, T. (1992) Prediction of Gene
Structure. Journal of Molecular Biology. In Press.


PROBLEMS, COMMENTS, AND SUGGESTIONS:
Can be mailed to steen at darwin.bu.edu.

Users of the MBCRR and BMERC national compute



More information about the Bio-soft mailing list

Send comments to us at biosci-help [At] net.bio.net