Staden Package Release Notes

Tim Littlejohn tim at morgan.angis.su.OZ.AU
Sat Feb 3 16:48:58 EST 1996

	 Release for Staden Package 1996.0.0 30th January 1996

One of the major changes (and the one  that took most time to produce)
in this release is that GAP4 and TREV now have online help and we have
created  our own WWW pages.   The help can be  browsed from within the
programs using Netscape or   the simple inbuilt WWW  browser  included
with the package. Now that this  information is available we give less
detail in these Release Notes as the  online help reflects the current
status of the programs.

Apart from  the online help, the main  changes in this  release are to
GAP4 (for those who missed it, the  GAP4 paper came out: Bonfield,J K,
Smith,K  F and Staden,R. A New  DNA Sequence Assembly Program. Nucleic
Acids Res.  23, 4992-4999 (1995) )  but it also includes improvements,
bug fixes  and additions to other  components of the  package. Several
changes  have  been made  in response  to  requests by   users and  we
encourage more groups to contact us  with suggestions for improvements
and with bug reports.

As an example of this, the Genome Sequencing Centre, in St Louis asked
if we could reduce the file size for SCF  files and so reduce the cost
of disk  storage for  large projects   such as  theirs. We decided  to
change the way the different types of data were stored in SCF files so
that compression programs such as gzip would  work more effectively on
them.  These new style SCF files  (SCF version 3.00) can be compressed
to around 40% of their original size. Programs  like TREV and GAP4 can
read the files in compressed or uncompressed  form (as well as all the
older styles of SCF format).  All the new code for  this purpose is in
our  io_lib directory.  The documentation for  this  useful library of
functions  has also been improved.   Other  io_lib changes include two
new programs - "extract_seq"  extracts only the sequence  component of
either a  trace or experiment file,  and  "scf_update" can  be used to
convert between SCF formats 2 and  3; the RAWDATA environment variable
is now used as a list of directories when looking  for a trace file to
load (from an experiment file); and a few minor bug fixes.

The Sanger Centre  said they wanted  to treat  readings produced using
dye terminators  as being equivalent  to having  a reading  from  both
strands of the sequence. That is they wanted all of the functions that
check for coverage  of both strands  of the sequence, for  example the
Experiment Suggestion functions or the  Quality Plot, to treat  single
stranded segments  that are   covered  by dye terminator   readings as
double   stranded.  This has  been  implemented  by  introducing a new
Experiment file record type, the CH or "Special Chemistry" record, and
a new entry in the GAP4 database  for storing "Reading Flags".  If the
CH   record is present  and set  to 1, the  reading  can be treated as
equivalent to two  readings, one from each  strand. Within GAP4  users
can choose whether they want  such readings to be  treated in this way
for any of the functions that calculate a consensus sequence.

We  made another change  to  Experiment file  format.  The terminology
surrounding  the direction  of a reading  on  a template, the readings
sense, its strand and   orientation is confusing.   In an  attempt  to
simplify it we have extended the Primer Type (PR record) definition by
adding the type 4. Now 0 means unknown, 1 means forward from universal
primer, 2  reverse from  universal   reverse primer, 3 means   forward
custom  primer (previously  it was  any  custom  primer), and 4  means
reverse custom primer. We hope this  makes it easier to create correct
Experiment  files and means that the  Direction of Read (DR record) is
no longer required (although we will continue to support it for now).

While looking at GAP4  databases  from external laboratories  we  have
discovered that   many    contain missing   or conflicting    records,
particularly relating to template information,  and have tracked  this
down to errors/omissions in their Experiment files. In some cases this
was due to use of   external (bugged) programs    for setting up   the
Experiment files, but it  has also emphasised the  need for us to help
people   to use PREGAP.   To improve  this we  have updated the PREGAP
documentation and added some    template configuration files.   It  is
important to realise that to get the best from GAP4 it is necessary to
give it complete and correct data about all the readings it assembles.

Changes to GAP4 include  a modified Quality  Plot to make the problems
more apparent; a new "Independent Assembly"  function in which a batch
of readings can be assembled as though  they were the only readings in
the database, ie they will only be  compared with one another; command
line arguments for the maximum consensus  length and maximum number of
records in the  database are  now  available; for long-running  tasks,
like assembly, results are now written  to the Output Window while the
function is  running,  rather  than  buffered  up until  the  task has
finished;  several  functions  have been greatly   speeded-up; padding
characters are now given an accuracy estimate  that is the mean of the
characters adjacent to it; the code for checking on Read Pairs and for
plotting readings and  templates  in  the  Template Display  has  been
greatly improved (including bug fixes) and  the listed output adjusted
accordingly; a  bug in assembly  that allowed reading names, that were
not the same as the Experiment file name, to be entered more than once
was fixed; a bug caused  by reading names  of 16 characters (thanks to
colleagues in Japan) was  found and fixed; a  bug that  sometimes gave
incorrect  consensus sequences in   Find   Internal Joins was   fixed;
consensus and quality cutoff figures were previously often not used; a
consensus tag   corruption  occured in  some  specific  joining cases;
extract readings now outputs correct TN lines  and is more robust with
very long  sequences. Large  numbers of  less serious  bugs  were also

TREV can read SCF files via their Experiment file. All edits are saved
to Experiment files,  rather than to  the SCF file. Several small bugs
fixed  in TREV. ALFSPLIT and CONVERT   have also been improved. People
have started to use REPE  for sequence families other  than Alu and in
doing so  have uncovered a  number of Alu  specific assumptions, which
have now been removed.

Two bugs have  been fixed in the pattern   search routines in NIP  and
NIPL.   Bernard Caudron at the  Pasteur  pointed  out that the  CODATA
version of  PIR  files had changed  and this  had  broken our sequence
library  index creation programs and   our  reading routines. He  sent
fixes and they are included in the Release.

Silicon  Graphics  have now fixed  the  bug in their  Fortran that was
breaking  our  sequence library  access  routines  and  so, once again
libraries can be read on SGI machines.  This bug fix is available as a
patch  from SGI for  current systems, and is  fixed as standard in the
forthcoming Irix 6.2 release.

In summary the major changes have been the  addition of online help to
GAP4 and TREV, numerous bug fixes and speedups to GAP4, and changes to
SCF and Experiment file formats. Feedback welcome.

	Rodger Staden, James Bonfield and Kathryn Smith

     Release Notes for Staden Package 1995.1.0 14th September 1995

The major change with this release is the inclusion of our new version
of gap.  Currently, to distinguish this from  the existing gap, we are
calling our new version "gap4". Gap4 is currently considered as a beta
release.  When we finish the  beta test stage   gap will be renamed to
gap3, and gap4 will  be renamed to gap.  In  the longer term gap (i.e.
the new program) will be the only assembly program we support.  In the
even longer term the whole package will have a gap4-like interface. We
encourage the use  of  this release of  gap4  and are  unaware  of any
serious bugs.

An overview  of  gap4 is contained   in the file   doc/gap4.help and a
partially assembled  database B0334 which can  be used to try  out the
new program is in userdata.

We  also include our new    trace  viewer and  editor program   called
trev. This was initially written as an excercise in the use of Tcl and
Tk  but  now gives  a  better user   interface   and interaction  with
experiment files.

For those on our automatic update list we apologise for the long delay
since the last update,  which  is due  to us concentrating  on getting
gap4 into a releaseable state. We hope you find it worth waiting for.

Highlights from gap4

One of our  main objectives with the new  program was to provide  many
more visual clues as to the current state of  a sequencing project and
to allow  the  users to interact   in more  intuitive ways  with their
data. We were particularly interested  in the problems of dealing with
repetitive sequences,  and  wanted to   supply tools  to  display  and
manipulate  the various   types of data   that   might help to   solve
difficult assemblies. To this end  we have introduced new displays and
a  new gap  data item the  "contig  order".  The new displays  are the
"contig selector", the   "Contig Comparator", the  "template display",
the  "restriction enzyme map"  and the "stop codon  map". We have also
made it  possible  to  have any  number of  contig  editors and contig
joining editors running simultaneously. The  same contig can be viewed
in several  editors simultaneously,  hence allowing repetitive regions
to be compared.

In previous versions of our assembly  programs the user had no control
over the relative order of contigs  during processing and, even had it
been possible, there  was no functionality to make  use of it. The new
gap stores the "contig order"  in its database  and through a new type
of display,  the "contig selector"  this information is always visible
while  the program is  running.  The  "contig  order"  is  simply  the
relative positions of  the  contigs.  In  the  "contig selector"   all
contigs are  shown,   each  being  represented by  a  horizontal  line
proportional  to its length.  The left to right   order of these lines
defines  the contig order. Users  can reorder  the contigs by dragging
the lines that   represent them  around  inside  the contig   selector
display. The  contig selector can also  be used  to select contigs for
processing. Tags can be displayed in the contig selector window.

The   Contig Comparator is used   to display the  results of comparing
contigs.  It is our  solution to  the  problem of displaying  multiple
types of data about the possible relationships between contigs. It can
currently show     the results of searches    for  templates that have
readings  in   more than one  contig,   the results of  the  old "find
internal joins" function, the results of searches  for repeats and the
results of "Check Assembly".  These  searches reveal information about
the possible  relative  order of   the  contigs, or the  positions  of
problems, and the Contig Comparator allows all  of their results to be
displayed and manipulated together. When  any of these types of search
is  performed the contig selector  automatically  converts to a Contig
Comparator  by duplicating itself  in  the vertical direction. Results
are  plotted  in the   rectangular  display created  in this  process.
Furthermore the manual contig  shuffling procedure outlined above  can
still be performed and the plotted results associated with any dragged
contig will move along with it to its new location in the display.  As
is  explained below this greatly  facilitates  contig ordering and can
help  users understand difficult assemblies  and plan experiments. The
Contig Comparator   can also be  used  to invoke the  join editor, the
contig editor and the template display.

The  template display shows   a schematic  of   all the  readings  and
templates for   a single contig. Each is   represented by a horizontal
line proportional to its length. Colour  coding shows strandedness and
arrows  indicate the direction  of the  reading.  Selected tags can be
plotted as can the quality plot (now colour  coded) that was available
in  the previous programs   and  a  new restriction  enzymes  display.
Templates that appear  in    more than one   contig are   also  colour
coded.   This  display  can also   be   used  to select   readings for

For those   who employ restriction enzyme  mapping  data to  aid their
assembly projects we have  added functions to  locate and display  the
positions  of restriction  sites. Selected sites   can be converted to
tags that can be displayed in all the usual ways.

A stop codon plot is available to display stop  codons in three or six
reading frames. It can  be linked to the  contig editor to reflect the
edits made.

The contig editor contains  several selectable status lines to display
information about the readings contributing  to the consensus and  for
displaying translations in any of the six reading frames.

A further new feature of gap is its ability to create and use "lists".
Users of our package will be familiar with  the idea of "files of file
names"  and know their value for  processing batches  of data. For the
new gap we  have extended  this concept  so that many  of its commands
operate on  lists of items. To facilitate   this mode of work  we have
provided routines to create and manage lists.

Changes and bug fixes to other programs

1. Pregap
	The "eba" phase was always running eba on the first reading in the
	input list.

	Correctly handles cases where the ID/EN lines are named
	differently  from the experiment file filename.

	Support for odd reading names, such as those that exist but are blank,
	or that contain spaces.

	Better ALF support (reading name generation).

        Interactive clipping using the trev program (a new version of ted).

	Uses the vepe screen against vector instead of the gap option.

2. Gap
	More robust with inputting invalid CS line input or zero length

	Bug fixes with X11 timings during contig editor quit align command in
	contig editor works better

	Removed memory leak from enter preassembled data. Also no longer
	crashes then the input file of filenames does not exist. More robust
	when we have no LN/LT/SQ lines.

	Fixed file permissions after copy database
	Busy files renamed to PROJECT.V.BUSY

	Tags of precisely length 1 previously sometimes caused problems

	Fixed buffer overrun in save consensus tags (which caused blank output
	files under SunOS 4). We also support specifying ranges for this now

	Removed cross hair usage for "find read pairs" (it didn't work)

	Corrected assembly alignment failure error code (was 5, now 2). Also
	some readings were failing incorrectly when requesting not to join.

	Break contig is more intelligent in cases where breaking at a single
	reading would generate more than 2 contigs.

	Disassembly previously could corrupt the tag lists.

	Removed ABMG tag from the standard GTAGDB file.

3. Vepe
	Added "screen for restriction sites"

	Added "screen against vector"

	Cosmid vector search now locates the left and right ends.

4. Trace file IO (within gap, gap4, ted, makeSCF, etc)
	Programs using SCF files now support the older SCF format too.

	ABI reading code will now recognise the simplest forms of MacBinary and
	automatically strip off the header (this assumes it's 128 bytes).

	Added getABIstring and getABIcomment functions to the ABI io library.

	Updated getABISampleName. As before, but greatly simplified code.

	New getABIdate command.

	Better support for reading experiment files via the trace level
	interface. We have better support for writing back to the original
	experiment file.

5. A new trace viewer program, named trev, to replace ted.
	Improved user interface.

	Better integration with experiment files.

6. New version of gcgentryname2 to support their new format.

7. Clip no longer crashes when run on blank files.

8. The convert program is more reliable when converted from bap to gap. The
default quality value for bases has been changed from 0 to 100.

	James Bonfield, Kathryn Smith and Rodger Staden


		Release Notes for Staden Package 1995.0.0

Most of the changes in the new release are bug fixes, but some
additions are listed at the top. Currently our efforts are going into
the production of a new graphical user interface and so we have less
time for writing new options.
We find that most of our bug reports come from a small subset of
users. Please, if you find problems with the package, let us know by
email, and we will fix things as quickly as possible, and make the
fixes available via ftp.

Changes and bug fixes

1. Major changes to gap. Added idea of "active tag types" which the user
can define and use during find internal joins and assembly. 
Added new option in assembly which enables reads that match but do not
align well to be entered as new contigs. 
Active tags are set in "set display parms". 
Replaced all references to operating on all or one contig by a new
routine that uses a radio button instead of the commonly used yesno. One
outcome is that many scripts will need changing. Also changed the
quality codes to be 0-9 rather 0-4. 

2. Added preassembly code. This is a new option that enters a
single contig into the database. To facilitate the change other
options have been changed. Save consensus tags now accepts a region;
extract readings now asks whether quality and position information
should be ouputted (and a few other question shuffles here); expFileIO
and seqInfo have been updated to handle opos[] and conf[] items in exp

3. Added an interactive "cop" (ie find places in the consensus for
which the evidence is unclear) to the contig editor. Named as "Verify &" and
"Verify |" in the search window. Verify & means look at both strands
together, Verify | means treat strands separately.
Removed the old option from the menus. Bug fixed: Cop was taking
notice of strand information when not appropriate.

4. Extract gel readings now outputs SL, SR, CS, PR, ON and AV line types.

5. Changed "Type:" label in the search window to "Tag type:". This also has
the side effect of changing the name in the tag editor window, but this isn't
a problem.

6. Although it seems a backward step, modified the sequence library 
handling routines to include gcg as well as all the others. I do not
recommend changing to this format, but having the ability to deal with
it will help sites that want to support both packages and provide
sequence library access for each with minimal use of disk space.
My view is that there should be a single sequence library format
(not a different one for each collection and distribution centre), 
and that the libraries should be distributed ready to use, and hence 
should not require reformatting. We have made this possible for users
of the EMBL CDROM, and have come as close as we could for the other
libraries by providing index creating programs for their distributed formats.
>From this point of view handling yet another proprietary format is a
retrograde step and it would be better if all packages supported the
distributed format. That said I hope some sites find it useful.
Requires modification of the division lookup file format (for gcg libraries) 
but will also work with existing files. At present the new sequence
reading code is not as efficient as that for the distributed formats
of the libraries, but I believe it works.
Old format:
     1 EMBLPATH/phg.dat
New format:
     1 GCGPATH/em_ph.seq GCGPATH/em_ph.ref
     1 EMBLPATH/phg.dat EMBLPATH/phg.dat
     1 EMBLPATH/phg.dat

7. The SunOS 4 Makefile has been changed to use X11R5. This cannot be done in
a fashion to portable without people editing the file to change the gcc
library location.

8. Find internal joins. Several problems were found relating to the
new mode of use "search with single segment". These have been fixed.

9. Minor fix to hairpin loop search in nip: an unitialised variable
for the case of zero matches caused problems for displaying the number
of matches found.

10. The temporary tags used by select oligo in the contig editor were
disrupting consensus tags. Crashes could result.

11. Traces for complemented readings would become misaligned when
adjusting cutoffs with padded sequence.

12. In course of trying to handle gcg sequence libraries discovered
mistakes in seqlibsubs.f: two sequence reading routines were sent an extra
argument and a 5 byte string was filled with 6 chars.

13. Spotted a missing 'break;' in the undo case statement. The consequence was
that undoing a confidence change also used the same data to undo a transpose

14. Delete contig could corrupt memory.

15. Fix get_gel_num() function when dealing with the /name convention. This
fixes the alter relationship gel code.

16. Fixed a bug with the compare strands function of the contig editor. Various
options would then fail when computing a consensus - eg deleting a consensus
pad, using align, dump contigs.

17. io_get_extension() would return negative lengths or crash in memcpy when
vector tags were in the used data. It could also miss VEC tags when they
overlapped the used data by > 1.

18. Assembly in gap (and presumably bap, etc) crashed if there were
more than maxc=100 overlaps.

19. Initialise the 'next' pointer to zero for newly created tags in the
spltag_() function. Previously this produced unpredictable results for break
contig and disassemble readings.

20. Tags weren't being shifted on the consensus correctly when disassemble
readings changed the contig start.

21. Recent improvement to give more information for infrequent restriction 
enzyme sites was bugged in routine findl1 which caused routine s2 to crash.

22. Fix freeDB() bug in contig editor. It didn't check if DB_Name and DB_Seq
had been allocated before freeing, and hence could free NULL data.

23. The Find Oligo editor function creates temporary tags. These had locally
(non malloced) defined comments that were later incorrectly freed.

	Rodger Staden and James Bonfield

		Release Notes for Staden Package 1994.2.0

		(Please make sure these notes reach the users)

	This release contains a very large number of changes to the
	software and the manual because, for the first time, it
	includes our new and long awaited assembly program gap, and
	all its associated programs and scripts. Gap replaces our
	previous assembly programs (dap and bap) which were always
	temporary. Rather than rewrite the same information in several
	ways we include below the preface to the new edtion of the
	manual which contains a list of the major changes.
	The body of the manual contains description of the new features and
	rewrites of changed options. We will only fix bugs in bap in
	future and dap will be removed from the distribution.
	Note that we have removed reference to sap and bap from the
	manual and that the contents concerned with assembly are true
	only for gap. Previous copies of the manual should be kept by
	groups finishing off bap projects. Note that the package
	includes a program for converting from dap and bap databases to
	gap database format.

	The testpackage directory includes a complete set of data for
	demonstrating pregap and gap.

	The copy of prosite on the distribution corresponds to release
	12.0. Note that this release is the first to contain the new
	"matrix" method for defining patterns. As yet we have not
	written our own code to deal with this format and so such
	patterns are not translated into pattern files by splitp3. For
	this release only one pattern is described as a matrix and is
	hence missing from our set of patterns for use by pip. We hope
	to be able to use the matrix files shortly.

	Preface to manual 94.2

As is obvious from its increased size (up by around 50 pages) this
new edition of the manual contains the greatest number of additions
and changes so far made, and the increase is due to the launch of our
final sequence assembly program and all its accompanying innovations.

The new program gap (genome assembly program) has a new database that
stores a great deal of extra information about each reading, and uses
experiment files as its standard input format. Although reliable
values are not yet available, gap is programmed to use numerical
estimates of base accuracy for several of its operations. The manual
includes essays on the use of experiment files and numerical estimates
of base accuracy in the introductory chapter.

To accompany gap and the use of experiment files we include several
new pre-assembly programs and a script to combine them all into a
single operation. The new script is called pregap and it uses init_exp
to initialise an experiment file, makeSCF to convert ABI and Pharmacia
ALF files to SCF files, eba to estimate base accuracies for data in
SCF files, clip to locate and mark the start of poor quality at the 3'
ends of readings, vepe to find and mark the positions of vector
sequences, and repe to perform a similar job for repeat families such
as Alu. Pregap is designed to be easily interfaced with local file
organisation and databases.

The gap database is completely new and is designed to be extendable
and robust in the event of system crashes. The gap program has a large
number of new options and improvements to previous ones. Routines have
been added to automatically locate problems and suggest solutions. The
program can find single stranded regions and suggest resequencing
particular reads on a long gel machine to fill them. Another routine
can find regions that are tagged as containing compressions and
suggests resequencing readings using Taq terminators. Routines that
extract data from the database such as "extract gel readings" now
write their output in experiment file format. Similarly routines that
read in tags use experiment file format.

The program cop has now been properly integrated into gap, and as gap
records all its edits in a better way than previous programs, its
results are obtained more quickly and are more reliable.

The contig editor has been improved in numerous ways, necessitating a
complete rewrite of the corresponding section of the manual. Trace
movement is now coupled to the movement of the editing cursor. The
"find next problem" command now works in three modes: a) find the next
position where the consensus is not A,C,G or T; b) find the next
position where there is not good data on both strands (and they
agree); c) find the next position where an edit has been performed.
Several new key bindings have been added and an alignment routine has
been provided to give alignments between hidden data and the
consensus, or between contigs in the join editor.

The consenus calculation and the "find next problem" options in the
contig editor have been reprogrammed to use only data for which the
numerical estimate of accuracy is above a cutoff value. In theory this
means that "find next problem" will only find problems in good data
because the poor data will be ignored. Almost all edits are performed
to make bad data agree with good, so in the long term, this procedure
should bring a huge saving in editing time. However reliable values
for base accuracy are best provided by the base calling software, and
the ones calculated by our routine eba, although correlated with the
numbers of edits required, should be viewed as being for demonstration
purposes only. By default therefore, the cutoff for inclusion of data
in the consensus calculation is set to -1, so that all bases are used.

For large scale projects we expect the use of experiment files to be
of great interest. They are augmented gel reading files and provide a
very simple mechanism for passing information bewteen processing
steps. Based on EMBL sequence library entries with two-letter codes at
the beginning of each record, they are easy to parse and easy to
write. The general idea is that processing programs read all they need
from the experiment file, perform their particular operation and write
the result back to the end of the experiment file. For example vepe
reads, not only the sequence, but also the names of the vectors and
the primers used to produce the reading, then fetches the vector
sequence, performs its search and appends the result to the experiment
file. This new information can then be used by subsequent programs
including gap.

The "find internal joins" option in gap has been extended so that in
addition to the facility to compare the ends of all contigs with all
other contigs, users can now elect to compare the full length of each
contig against all others, or to select a single contig to compare
against all others. A further refinement allows tagged regions to be
either "marked" or "masked". Marking means that the tagged regions
will appear in lower case in the displayed alignments. Masking means
that the tagged regions will not be used to find matches between
contigs, although if a match is found adjacent to such a segment it
will be aligned and the alignment score included in the overall value.
These last two options are designed to help with highly repetitive DNA
such as human where Alu repeats will cause many spurious matches to be
found. At present the only tag type recognised by these options is
that for Alu but in the near future the method will be generalised to
include lists of tag types for masking and marking.

A program called convert, for converting bap databases to gap
databases, is included in the package.

Finally we note that an internet newsgroup for users of the package has
been created recently. Although the group, which is run from Montreal,
Canada by Tim Littlejohn, is independent of us, the developers, we
encourage people to make use of it. Requests for functions not
available in the package will help guide future development and
questions about existing programs will help us to improve this manual.

		Release Notes for Staden Package 1994.1.0
	This release contains few major changes. However it allows us
	to announce that the next release will contain a new assembly
	program called GAP (Genome Assembly Program). Further details
	are contained in the file PreRelease.GAP

	1. BAP has been changed to permit command line arguments for
	specifying the maximum consensus sequence length, and the
	maximum database size.

	2. Find internal joins has been broadened in its scope and made more
	useful for dealing with repetitive sequences:
	a. It will now allow comparisons of everything against everything
	   whereas originally it only compared a segment of size "probe
	   length" from the ends of each contig with every other contig.
	   Now all segements of size probe from along the length of every
	   contig can be compared against all other contigs.
	b. It will now allow a single probe to be compared against all
	   other sequences. This is useful for comparing a repeated region
	   against other occurrences of the repeat.
	c. For those assembling human data two new facilities can be used
	   to ease the problem of dealing with Alu rich sequences. Segments
	   of sequence tagged as containing Alu can now be masked or marked.
	   Masked means that Alu tagged segments will not be used to search
	   for matches but will be included when the percentage mismatch is 
	   calculated. (The Alu segments will also be shown in lower case
	   letters.) This means that all matches will contain a section of
	   size "minimum match length" that is not tagged as Alu. Marked
	   means that Alu tagged regions can be used during the matching
	   process but will be shown in lower case in the display of the 
	   alignment. These new facilities could easily be generalised to
	   other tag types.

	3. Numerous minor bug fixes in the assembly programs.

	4. Bug fix in the sequence library accessing routines that caused
	searches based on accession numbers to crash on DEC alphas.

		Release Notes for Staden Package 1994.0.0

	Sequence library changes

	The text and author searches of the sequence libraries are
	one of the strong points of the package. We have made some
	useful additions and changes.

	1. Taxon (or species) index search added.

	2. NOT operator added for index searches.

	3. The routines now only show options for which indexes are

	4. We have greatly simplified the installation procedure
	   for the sequence libraries. All relevent files are in
	   $STADTABL and further information is contained in file

	Prosite library changes

	1. We now provide access to prosite via the sequence library
	   index searching routines. Text searches can be performed
	   on both the .dat and the .doc files. As with the sequence
	   library index searches they are effectively instantaneous.
	   The new interface makes it very easy to search and browse
	   through prosite.

	2. A copy of the prosite library and indexes is now included
	   in the distribution in $STADTABL/prosite/indices.

	3. Copies of the reformatted prosite library suitable for use
	   by pip are now included in the distribution in
	   $STADTABL/prosite/pats. In addition to the environment variable
	   PROSITENAMES we have now included PROSITEP which is set to
	   $STADTABL/prosite/pats. This means that any entry in the prosite 
           library (say entry PS000XYZ) can be searched for from pip by 
           using PROSITEP/PS00XYZ.PAT as the pattern file name.

	4. A bug was fixed in splitp3 and it is now modified to deal with
	   prosite.dat as it arrives on cdrom. ie it is no longer necessary
	   to remove the ^M before running splitp3 to produce the pattern
	   files for pip.

	The provision of index searching makes the use of prosite from
	the package much easier and more powerful.

	New version of the manual 

	1. The manual has been updated to include recent additions and
	   is now 165 pages in length.
           It also includes a simple method of calculating the cloning
	   site and primer positions for vep. The manual is on disk and in
	   $STADENROOT/doc/manual.PS  (postscript)
	   $STADENROOT/doc/manual.RTF  (RTF format)

	Assembly program changes

	1. We have detected and fixed a bug which resulted in padding
	   characters in overlapping readings not being aligned. Future
	   alignments should be better.

	New script to aid data manipulation prior to assembly
	1. Prebap script (see directory $STADENROOT/src/scripts/prebap) will
           automate the procedures to take a folder of ABI samples to end
           with data (gel sequence files, SCF files, and a file of filenames)
           ready for input to bap/xbap. Please see the prebap manual for more

	New documentation files

	1. Prebap

	2. GelReadingFile.format

	3. KnownBugs

	4. new.help (in $STADENROOT/help directory)

Changes in 1993.3 release 16/11/93

	There is a new 154 page manual in $STADENROOT/doc/manual.rtf

	A test package ($STADENROOT/testpackage) is now available to run
	through most functions of mep, nip, nipf, pip and sip. There is
	also a test database for bap.

	Cop memory corruption fixed.

	Nip now complements uncertainty codes correctly. Non fatal trna
	search bug fixed for SGI.

	Assembley program changes:
	1. Faster contig editor

	2. Find internal joins bug fixes (alignments no longer displayed
	   when passing mismtatch test but failing pad test; corrections
	   when mismatch score is precisely the maximum).

	3. Repeat search option

	4. Single stranded 'calculate consensus' now available

	5. Check assemble additions

	Known bugs still in this release

	When using Find Internal Joins in xbap using the 'save contig' 
        option of the contig editor can cause problems for later
	joins found within the same 'round'. This is noticed when adjusting
	the position of the cutoff data.
	Quiting Find Internal Joins and restarting solves the problem.

Changes in 1993.2 release 21/9/93

	The assembly programs bap and xbap heve several new functions:
	1. Find single stranded regions and try to fill them with "hidden"
	   data from the adjacent readings.

	2. Find single stranded regions (includes ends of contigs) and 
	   select primers and templates for double stranding them (joining

	3. Pre assembly screening for readings to find those that align
	   best. Optionally the hidden data can also be included in the
	   comparison (part of assembly function).

	4. Find pairs of readings taken from opposite ends of the same
	   template (ie forward and reverse read pairs). List or plot their

	5. A new function to check that readings have been assembled into
	   the correct positions. It aligns the hidden (previously termed
	   "unused") parts of readings with the consensus they overlap to see
	   how well they align. Poor alignments are reported.  6. During
	   assembly each reading is now allowed to match up to 100 different

	It might be guessed from the above that we are trying to improve
	our ability to deal with the assembly of human data. Hence, also
	the next addition.

	A new experimental program (rep) for screening readings for Alu
	sequences prior to assembly. The Alu containing segments are tagged
	so they can be seen in the contig editor. A library of Alu
	sequences is included in /tables/alus. The program is quite slow as
	it compares each reading in both orientations with all of the Alu
	sequences (126 of them) in order to find the best match. Only time
	and more data will tell how sensitive it is, and whether the
	current default score 0f 0.6 is "correct". BEWARE rep modifies the
	original reading files to include the tag information. The only
	information is in /doc/alu.help

	A new program for extracting sets of sequences and their
	annotations from the sequence libraries (lip). The only information
	is in /help/lip.help

	Changes to the xterm userinterface. These routines have been
	completely rewritten. One addition is that now ?? in response to a
	question will allow the user to get help on any function in a
	program. help is also improved in the x version.

Changes in earlier releases

	DAP, XDAP have been replaced by BAP and XBAP (see below)

	A new function for examining repeats has been added to NIP

	A new repeat search has been added to SIP

	Some outputs have been changed to produce FASTA format files
	instead of PIR.

	MEP now allows searches for motifs in which any 8 out of a string
	of 20 can be switched on.

	The manual has been updated.

        Keyword and author searches on sequence libraries

	All programs that use the libraries can now perform author
        and keyword searches on all libraries (only nip did so before).

	Postscript output:

	All graphics can now be saved to disk in postscript form by
        use of a sub-option in "Redirect output".

	Sequence assembly:

	BAP, XBAP replace DAP and XDAP. A program to convert DAP databases
	to BAP databases (convert) is included. BAP databases can contain
	up to 8000 readings and a consensus of 500,000 bases. A minor edit
	and recompilation will allow up to 99,999 readings. The space is
	used more efficiently now as the databases grow as the number of
	readings increases. Reading names can be 16 characters in length.

	In addition:

	1) Assembly is 4 times as fast as in the DAP.

	2) Find internal joins is 5 times as fast and now brings up the
	   join editor with the two contigs in the correct orientation and

	3) The assembly routines align pads better, plus a new automatic
	   function can also be used to align them prior to editing.

	4) The contig editor has been greatly speeded up and its
	   functionality has been enhanced.

	5) A routine for selecting oligos for primer walking is included. 

	6) A new routine allows batches of readings to be removed from a

	7) We have also included routines for making SCF files, for getting
	   the sequence from SCF files, and one for marking the poor quality
	   data in readings. See the manual.

	Sequence library formats:

	The standard sequence library indexing method is now that used on
	the EMBL CD-ROM. The libraries (EMBL nucleotide and SWISSPROT
	protein) can be left on the CD-ROM or copied to disk. We include in
	the package programs for creating this type of index for EMBL
	updates, PIR in codata format, NRL3D and GenBank. If the indexes
	are created all programs can read all these libraries. Programs and
	scripts for this task are contained in the directory indexseqlibs.
	The keyword and author searches are particularly fast and the
	keyword index is based on ALL text in the files - not just the

	Feature table formats

	The programs now use the new feature table format common to EMBL
	and GenBank, but retain the old format for SWISSPROT which has not
	yet changed.

	For details of the above see file SequenceLibraries.

	Pattern searches

	Pipl and Nipl now have the facility to find only the best scoring
	match for each sequence. The prompt is "? report all matches", so
	typing only return means all matches will be shown and typing n
	means only the highest scoring will be reported. It is particularly
	useful when employed to create alignments. The corresponding help
	file has not been updated. Also to incorporate long unix file names
	the pattern files no longer include the annotation "filename".


	Option 38 in nip "translate and list" has been removed as the the
	more flexible routines of option 39 incorporate all its
	functionality. Many options that relate to feature tables have been
	modified but their help files are not yet up to date.


	A program (vep) for automatic excising of vector (either sequencing
	vector or cosmid vector) sequences from readings is now included in
	the package.

	Rodger Staden, James Bonfield

More information about the Staden mailing list

Send comments to us at biosci-help [At] net.bio.net