EGCG release 8.1.0 is now available from ftp.sanger.ac.uk in directory
pub/pmr/egcg81 (tar.Z distribution). A VMS .bck distribution is in
preparation (for directory /pub/pmr/egcg81vms).
Major features include 33 new programs (or enhancements to GCG programs),
support for main programs written in C, support for a range of input
sequence formats, input sequence validation, automatic help and enhanced
sequence range selection.
The release notes are included below. E-mail support is available from
egcg at embnet.org
EGCG 8.1 requires GCG release 8.1, and is supported on Irix, OSF,
Solaris and VMS. We believe it may also run on AIX. The programs have
been built on each of these. Due to continued development, we may not
have tested the latest versions on all systems. Please contact us with
any questions, problems, comments or suggestions.
The EGCG Team
===================================================================
EGCG 8.1 Release Notes
======================
Peter Rice (Sanger Centre, Hinxton, UK)
Rodrigo Lopez (Norwegian EMBnet node)
Reinhard Doelz (Swiss EMBnet node)
Jack Leunissen (Netherlands EMBnet node)
EGCG 8.1 continues the work of the EGCG team in the previous release
[1]. The code has been further standardized, and critical parts of
the internals are now ported to C so that routines can also be called
from C main programs. The documentation has been reviewed, checked
for omissions and further standardized to the point where we are able
to produce an HTML Web version. The URL for the Web version, still
under development and changing rapidly, is http://www.sanger.ac.uk/egcg/
New programs in EGCG 8.1 are provided by members of the EGCG team, and
also by the French, German and Italian EMBnet nodes and David Mathog
at Caltech.
Further details will be provided at the GCG Users Forum in Heidelberg
on March 28th.
New programs in EGCG 8.1:
-------------------------
POLYDOT - produces all-against-all dotplots (compare -word style) on
many sequences. POLYDOT is intended to compare all contigs in a
fragment assembly project, but can also compare groups of database
entries to find overlaps, or compare protein families. The output is
graphical, but POLYDOT also writes a report of overlaps and an input
file for GCG's SEGMENTS program which can make the alignments.
PATTERNPLOT - produces a graphical representation of the results of
GCG's FindPatterns program.
PROFILEPLOT - produces a graphical report of the frequency of patterns
in a protein or nucleotide sequence.
SORTCONSENSUS - identifies the strong consensus regions of an
alignment in an MSF file and reports them in sorted order.
STSSEARCH - looks for primer pairs in a set of sequences.
GENETRANS - extracts and/or translates coding regions as defined in
the feature table of sequences stored in the EMBL or Genbank
databases.
MULTALIGN - does a simultaneous alignment for two or more DNA or
protein sequences. The program is based on a generalization of the
algorithm of Waterman, Smith and Beyer by Krueger and Osterburg.
ECLUSTALW, CLUSTREE and PROFALIGN - are parts of the original ClustalW
distribution from Des Higgins [2], modified for inclusion in EGCG.
WORDUP - Reports unusual (statistically significant) nucleotide
patterns of size 6 to 9 bases. The method used is that of Pesole et al [3].
CHAOS - makes a CHAOS game representation of a nucleic acid sequence
using the method of Jeffrey [4]. We have used this porgram to
demonstrate patterns down to 5 base resolution in E.coli sequences.
PRIMA - GCG's PRIME program is now extended in EGCG. The only change to date
is to allow ranges to be specified relative to the end of the sequence, but
many others are planned in the near future.
TANDEM - Looks for tandem repeats of a given size range in nucleotide
sequences
QUICKTANDEM - Rapidly scans a nucleotide sequence for potential tandem repeats.
INVERTED - Looks for imperfect inverted repeats in nucleotide sequence.
CPGREPORT - Reports potential CpG islands in nucleotide sequences.
ECOMPOSITION - an extended version of GCG's COMPOSITION which calculates
molecular weights for single and double stranded DNA and RNA.
EOVERLAP and FILTEROVERLAP - an extended version of GCG's Overlap and
a quality filter program for use in database nonredundancy checks
and fragment assembly project validation.
CODFISH - calculates a set of codon usage statistics for a sequence
using a specified codon usage table. The name comes from the original
requirement for codon usage analysis of fission yeast.
WORDCOUNT - counts the commonest words in a sequence and reports them
in order of frequency and sequence.
GAPFRAME - moves all gaps in a DNA sequence reading frame to be at
codon boundaries.
PEPCORRUPT - randomly introduces small numbers of substitutions,
insertions, and deletions into protein sequence(s).
RFINDPATTERNS - is a version of GCG's FINDPATTERNS that writes each
hit to a separate sequence file.
CREFORMAT - a version of GCG's REFORMAT that allows base ranges to be
selected or excluded, and some sequence characters to be replaced.
ECODONFREQUENCY .... ETRANSLATE - The remainder of Jaakko Hattula's
conversions of additional GCG programs to use the command line (from
EGCG 7.x) have been revived as his methods are often different to thoe
GCG used in version 8.0, and in some cases we feel they still can be
very useful. These programs also now support the new EGCG interface
options (see below).
EFROMFASTA - a version of GCG's FROMFASTA that preserves the case of
the output file name.
EPEPTIDESORT - a version of GCG's PEPTIDESORT with additional output
options.
ELINEUP - a version of GCG's LINEUP allowing up to 500 sequences with
improved row numbering and allowing extended screen sizes.
EPLOTSIMILARITY - a version of GCG's PLOTSIMILARITY with gaps where the
sequences are gapped.
IG2NBRF - a utility program that converts an IG formatted file into an
NBRF formatted database which GCG's PIRTOGCG can index.
PHYLIP2TREE - displays trees computed with one of the PHYLIP-programs
in GCG style.
EMBLTOGCGSC - the Sanger Centre's version of EMBLTOGCG to improve
results with SRS and with Swiss-Prot updates.
Enhancements:
-------------
An early release of EGCG 8.1 was compiled on AIX, although we expect
some further problems.
QUICKSEARCH and QUICKMATCH now support sequences longer than 32000,
for example cosmid sequences being compared to the complete database.
These programs also have new qualifiers to aid in database self-comparison.
PEPWHEEL and PEPNET can mark residues in their own style, or in the same
style as GCG's HELICALWHEEL program.
New interface details:
----------------------
All EGCG programs have a new qualifier "-help" which brings up the
egenhelp text on that program.
When asking for a sequence range, most EGCG programs can accept "-100"
to mean "100 bases from the end". If this is allowed, the prompt is
"Start" rather than "Begin".
Most EGCG programs which read sequences are now able to handle sequences
in GCG, FASTA, STADEN and TEXT formats by a slight change in syntax.
fa:abc.fasta reads a single FASTA format sequence
fdb:xyz.fasta reads a file with many sequences in FASTA format
(including SRS getz sequence output files).
fdb:xyz.fasta:LACI reads the sequence "LACI" from a multiple
sequence FASTA file.
staden:abd.sdn reads a sequence in STADEN format
text:abc.txt reads a sequence in plain text format
We expect to extend this syntax rapidly in the coming months. Any suggestions
are welcome for new sequence formats.
Not all EGCG programs support this style of sequence specification. Those
that do will provide an additional message when prompting for sequences,
for example:
TWORDSEARCH uses protein sequence data
TWORDSEARCH of what sequence ?
These programs will additionally check that all sequences are valid, so
the program does not need to perform any additional checks (for gaps,
ambiguity codes, DNA to RNA conversion, and so on).
Distribution:
=============
Current major version: 8.1.0
URL: ftp://ftp.sanger.ac.uk/pub/pmr/egcg81
E-mail contact:
===============
egcg at embnet.orgpmr at sanger.ac.uk
References:
===========
[1] Rice P. et al. "EGCG 8.0." embnet.news 2(2): 5-7 (1995)
[2] Thompson J.D., Higgins D.G. and Gibson T.J. "CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and
weight matrix choice."
Nucleic Acids Research 22 4673-4680 (1994)
[3] Pesole, G. et al. "WORDUP: an efficient algorithm for discovering
statistically significant patterns in DNA sequences."
Nucleic Acids Res. 20, 2871-2875 (1992).
[4] Jeffrey, H.J. "Chaos game representation of gene structure."
Nucleic Acics Research 18, 2163-2170 (1990).
--
------------------------------------------------------------------------
Peter Rice | Informatics Division
E-mail: pmr at sanger.ac.uk | The Sanger Centre
Tel: (44) 1223 494967 | Hinxton Hall, Hinxton,
Fax: (44) 1223 494919 | Cambs, CB10 1RQ
URL: http://www.sanger.ac.uk/~pmr/ | England