Release for Staden Package 1996.0.0 30th January 1996
One of the major changes (and the one that took most time to produce)
in this release is that GAP4 and TREV now have online help and we have
created our own WWW pages. The help can be browsed from within the
programs using Netscape or the simple inbuilt WWW browser included
with the package. Now that this information is available we give less
detail in these Release Notes as the online help reflects the current
status of the programs.
Apart from the online help, the main changes in this release are to
GAP4 (for those who missed it, the GAP4 paper came out: Bonfield,J K,
Smith,K F and Staden,R. A New DNA Sequence Assembly Program. Nucleic
Acids Res. 23, 4992-4999 (1995) ) but it also includes improvements,
bug fixes and additions to other components of the package. Several
changes have been made in response to requests by users and we
encourage more groups to contact us with suggestions for improvements
and with bug reports.
As an example of this, the Genome Sequencing Centre, in St Louis asked
if we could reduce the file size for SCF files and so reduce the cost
of disk storage for large projects such as theirs. We decided to
change the way the different types of data were stored in SCF files so
that compression programs such as gzip would work more effectively on
them. These new style SCF files (SCF version 3.00) can be compressed
to around 40% of their original size. Programs like TREV and GAP4 can
read the files in compressed or uncompressed form (as well as all the
older styles of SCF format). All the new code for this purpose is in
our io_lib directory. The documentation for this useful library of
functions has also been improved. Other io_lib changes include two
new programs - "extract_seq" extracts only the sequence component of
either a trace or experiment file, and "scf_update" can be used to
convert between SCF formats 2 and 3; the RAWDATA environment variable
is now used as a list of directories when looking for a trace file to
load (from an experiment file); and a few minor bug fixes.
The Sanger Centre said they wanted to treat readings produced using
dye terminators as being equivalent to having a reading from both
strands of the sequence. That is they wanted all of the functions that
check for coverage of both strands of the sequence, for example the
Experiment Suggestion functions or the Quality Plot, to treat single
stranded segments that are covered by dye terminator readings as
double stranded. This has been implemented by introducing a new
Experiment file record type, the CH or "Special Chemistry" record, and
a new entry in the GAP4 database for storing "Reading Flags". If the
CH record is present and set to 1, the reading can be treated as
equivalent to two readings, one from each strand. Within GAP4 users
can choose whether they want such readings to be treated in this way
for any of the functions that calculate a consensus sequence.
We made another change to Experiment file format. The terminology
surrounding the direction of a reading on a template, the readings
sense, its strand and orientation is confusing. In an attempt to
simplify it we have extended the Primer Type (PR record) definition by
adding the type 4. Now 0 means unknown, 1 means forward from universal
primer, 2 reverse from universal reverse primer, 3 means forward
custom primer (previously it was any custom primer), and 4 means
reverse custom primer. We hope this makes it easier to create correct
Experiment files and means that the Direction of Read (DR record) is
no longer required (although we will continue to support it for now).
While looking at GAP4 databases from external laboratories we have
discovered that many contain missing or conflicting records,
particularly relating to template information, and have tracked this
down to errors/omissions in their Experiment files. In some cases this
was due to use of external (bugged) programs for setting up the
Experiment files, but it has also emphasised the need for us to help
people to use PREGAP. To improve this we have updated the PREGAP
documentation and added some template configuration files. It is
important to realise that to get the best from GAP4 it is necessary to
give it complete and correct data about all the readings it assembles.
Changes to GAP4 include a modified Quality Plot to make the problems
more apparent; a new "Independent Assembly" function in which a batch
of readings can be assembled as though they were the only readings in
the database, ie they will only be compared with one another; command
line arguments for the maximum consensus length and maximum number of
records in the database are now available; for long-running tasks,
like assembly, results are now written to the Output Window while the
function is running, rather than buffered up until the task has
finished; several functions have been greatly speeded-up; padding
characters are now given an accuracy estimate that is the mean of the
characters adjacent to it; the code for checking on Read Pairs and for
plotting readings and templates in the Template Display has been
greatly improved (including bug fixes) and the listed output adjusted
accordingly; a bug in assembly that allowed reading names, that were
not the same as the Experiment file name, to be entered more than once
was fixed; a bug caused by reading names of 16 characters (thanks to
colleagues in Japan) was found and fixed; a bug that sometimes gave
incorrect consensus sequences in Find Internal Joins was fixed;
consensus and quality cutoff figures were previously often not used; a
consensus tag corruption occured in some specific joining cases;
extract readings now outputs correct TN lines and is more robust with
very long sequences. Large numbers of less serious bugs were also
TREV can read SCF files via their Experiment file. All edits are saved
to Experiment files, rather than to the SCF file. Several small bugs
fixed in TREV. ALFSPLIT and CONVERT have also been improved. People
have started to use REPE for sequence families other than Alu and in
doing so have uncovered a number of Alu specific assumptions, which
have now been removed.
Two bugs have been fixed in the pattern search routines in NIP and
NIPL. Bernard Caudron at the Pasteur pointed out that the CODATA
version of PIR files had changed and this had broken our sequence
library index creation programs and our reading routines. He sent
fixes and they are included in the Release.
Silicon Graphics have now fixed the bug in their Fortran that was
breaking our sequence library access routines and so, once again
libraries can be read on SGI machines. This bug fix is available as a
patch from SGI for current systems, and is fixed as standard in the
forthcoming Irix 6.2 release.
In summary the major changes have been the addition of online help to
GAP4 and TREV, numerous bug fixes and speedups to GAP4, and changes to
SCF and Experiment file formats. Feedback welcome.
Rodger Staden, James Bonfield and Kathryn Smith
Release Notes for Staden Package 1995.1.0 14th September 1995
The major change with this release is the inclusion of our new version
of gap. Currently, to distinguish this from the existing gap, we are
calling our new version "gap4". Gap4 is currently considered as a beta
release. When we finish the beta test stage gap will be renamed to
gap3, and gap4 will be renamed to gap. In the longer term gap (i.e.
the new program) will be the only assembly program we support. In the
even longer term the whole package will have a gap4-like interface. We
encourage the use of this release of gap4 and are unaware of any
An overview of gap4 is contained in the file doc/gap4.help and a
partially assembled database B0334 which can be used to try out the
new program is in userdata.
We also include our new trace viewer and editor program called
trev. This was initially written as an excercise in the use of Tcl and
Tk but now gives a better user interface and interaction with
For those on our automatic update list we apologise for the long delay
since the last update, which is due to us concentrating on getting
gap4 into a releaseable state. We hope you find it worth waiting for.
Highlights from gap4
One of our main objectives with the new program was to provide many
more visual clues as to the current state of a sequencing project and
to allow the users to interact in more intuitive ways with their
data. We were particularly interested in the problems of dealing with
repetitive sequences, and wanted to supply tools to display and
manipulate the various types of data that might help to solve
difficult assemblies. To this end we have introduced new displays and
a new gap data item the "contig order". The new displays are the
"contig selector", the "Contig Comparator", the "template display",
the "restriction enzyme map" and the "stop codon map". We have also
made it possible to have any number of contig editors and contig
joining editors running simultaneously. The same contig can be viewed
in several editors simultaneously, hence allowing repetitive regions
to be compared.
In previous versions of our assembly programs the user had no control
over the relative order of contigs during processing and, even had it
been possible, there was no functionality to make use of it. The new
gap stores the "contig order" in its database and through a new type
of display, the "contig selector" this information is always visible
while the program is running. The "contig order" is simply the
relative positions of the contigs. In the "contig selector" all
contigs are shown, each being represented by a horizontal line
proportional to its length. The left to right order of these lines
defines the contig order. Users can reorder the contigs by dragging
the lines that represent them around inside the contig selector
display. The contig selector can also be used to select contigs for
processing. Tags can be displayed in the contig selector window.
The Contig Comparator is used to display the results of comparing
contigs. It is our solution to the problem of displaying multiple
types of data about the possible relationships between contigs. It can
currently show the results of searches for templates that have
readings in more than one contig, the results of the old "find
internal joins" function, the results of searches for repeats and the
results of "Check Assembly". These searches reveal information about
the possible relative order of the contigs, or the positions of
problems, and the Contig Comparator allows all of their results to be
displayed and manipulated together. When any of these types of search
is performed the contig selector automatically converts to a Contig
Comparator by duplicating itself in the vertical direction. Results
are plotted in the rectangular display created in this process.
Furthermore the manual contig shuffling procedure outlined above can
still be performed and the plotted results associated with any dragged
contig will move along with it to its new location in the display. As
is explained below this greatly facilitates contig ordering and can
help users understand difficult assemblies and plan experiments. The
Contig Comparator can also be used to invoke the join editor, the
contig editor and the template display.
The template display shows a schematic of all the readings and
templates for a single contig. Each is represented by a horizontal
line proportional to its length. Colour coding shows strandedness and
arrows indicate the direction of the reading. Selected tags can be
plotted as can the quality plot (now colour coded) that was available
in the previous programs and a new restriction enzymes display.
Templates that appear in more than one contig are also colour
coded. This display can also be used to select readings for
For those who employ restriction enzyme mapping data to aid their
assembly projects we have added functions to locate and display the
positions of restriction sites. Selected sites can be converted to
tags that can be displayed in all the usual ways.
A stop codon plot is available to display stop codons in three or six
reading frames. It can be linked to the contig editor to reflect the
The contig editor contains several selectable status lines to display
information about the readings contributing to the consensus and for
displaying translations in any of the six reading frames.
A further new feature of gap is its ability to create and use "lists".
Users of our package will be familiar with the idea of "files of file
names" and know their value for processing batches of data. For the
new gap we have extended this concept so that many of its commands
operate on lists of items. To facilitate this mode of work we have
provided routines to create and manage lists.
Changes and bug fixes to other programs
The "eba" phase was always running eba on the first reading in the
Correctly handles cases where the ID/EN lines are named
differently from the experiment file filename.
Support for odd reading names, such as those that exist but are blank,
or that contain spaces.
Better ALF support (reading name generation).
Interactive clipping using the trev program (a new version of ted).
Uses the vepe screen against vector instead of the gap option.
More robust with inputting invalid CS line input or zero length
Bug fixes with X11 timings during contig editor quit align command in
contig editor works better
Removed memory leak from enter preassembled data. Also no longer
crashes then the input file of filenames does not exist. More robust
when we have no LN/LT/SQ lines.
Fixed file permissions after copy database
Busy files renamed to PROJECT.V.BUSY
Tags of precisely length 1 previously sometimes caused problems
Fixed buffer overrun in save consensus tags (which caused blank output
files under SunOS 4). We also support specifying ranges for this now
Removed cross hair usage for "find read pairs" (it didn't work)
Corrected assembly alignment failure error code (was 5, now 2). Also
some readings were failing incorrectly when requesting not to join.
Break contig is more intelligent in cases where breaking at a single
reading would generate more than 2 contigs.
Disassembly previously could corrupt the tag lists.
Removed ABMG tag from the standard GTAGDB file.
Added "screen for restriction sites"
Added "screen against vector"
Cosmid vector search now locates the left and right ends.
4. Trace file IO (within gap, gap4, ted, makeSCF, etc)
Programs using SCF files now support the older SCF format too.
ABI reading code will now recognise the simplest forms of MacBinary and
automatically strip off the header (this assumes it's 128 bytes).
Added getABIstring and getABIcomment functions to the ABI io library.
Updated getABISampleName. As before, but greatly simplified code.
New getABIdate command.
Better support for reading experiment files via the trace level
interface. We have better support for writing back to the original
5. A new trace viewer program, named trev, to replace ted.
Improved user interface.
Better integration with experiment files.
6. New version of gcgentryname2 to support their new format.
7. Clip no longer crashes when run on blank files.
8. The convert program is more reliable when converted from bap to gap. The
default quality value for bases has been changed from 0 to 100.
James Bonfield, Kathryn Smith and Rodger Staden
Release Notes for Staden Package 1995.0.0
Most of the changes in the new release are bug fixes, but some
additions are listed at the top. Currently our efforts are going into
the production of a new graphical user interface and so we have less
time for writing new options.
We find that most of our bug reports come from a small subset of
users. Please, if you find problems with the package, let us know by
email, and we will fix things as quickly as possible, and make the
fixes available via ftp.
Changes and bug fixes
1. Major changes to gap. Added idea of "active tag types" which the user
can define and use during find internal joins and assembly.
Added new option in assembly which enables reads that match but do not
align well to be entered as new contigs.
Active tags are set in "set display parms".
Replaced all references to operating on all or one contig by a new
routine that uses a radio button instead of the commonly used yesno. One
outcome is that many scripts will need changing. Also changed the
quality codes to be 0-9 rather 0-4.
2. Added preassembly code. This is a new option that enters a
single contig into the database. To facilitate the change other
options have been changed. Save consensus tags now accepts a region;
extract readings now asks whether quality and position information
should be ouputted (and a few other question shuffles here); expFileIO
and seqInfo have been updated to handle opos and conf items in exp
3. Added an interactive "cop" (ie find places in the consensus for
which the evidence is unclear) to the contig editor. Named as "Verify &" and
"Verify |" in the search window. Verify & means look at both strands
together, Verify | means treat strands separately.
Removed the old option from the menus. Bug fixed: Cop was taking
notice of strand information when not appropriate.
4. Extract gel readings now outputs SL, SR, CS, PR, ON and AV line types.
5. Changed "Type:" label in the search window to "Tag type:". This also has
the side effect of changing the name in the tag editor window, but this isn't
6. Although it seems a backward step, modified the sequence library
handling routines to include gcg as well as all the others. I do not
recommend changing to this format, but having the ability to deal with
it will help sites that want to support both packages and provide
sequence library access for each with minimal use of disk space.
My view is that there should be a single sequence library format
(not a different one for each collection and distribution centre),
and that the libraries should be distributed ready to use, and hence
should not require reformatting. We have made this possible for users
of the EMBL CDROM, and have come as close as we could for the other
libraries by providing index creating programs for their distributed formats.
>From this point of view handling yet another proprietary format is a
retrograde step and it would be better if all packages supported the
distributed format. That said I hope some sites find it useful.
Requires modification of the division lookup file format (for gcg libraries)
but will also work with existing files. At present the new sequence
reading code is not as efficient as that for the distributed formats
of the libraries, but I believe it works.
1 GCGPATH/em_ph.seq GCGPATH/em_ph.ref
1 EMBLPATH/phg.dat EMBLPATH/phg.dat
7. The SunOS 4 Makefile has been changed to use X11R5. This cannot be done in
a fashion to portable without people editing the file to change the gcc
8. Find internal joins. Several problems were found relating to the
new mode of use "search with single segment". These have been fixed.
9. Minor fix to hairpin loop search in nip: an unitialised variable
for the case of zero matches caused problems for displaying the number
of matches found.
10. The temporary tags used by select oligo in the contig editor were
disrupting consensus tags. Crashes could result.
11. Traces for complemented readings would become misaligned when
adjusting cutoffs with padded sequence.
12. In course of trying to handle gcg sequence libraries discovered
mistakes in seqlibsubs.f: two sequence reading routines were sent an extra
argument and a 5 byte string was filled with 6 chars.
13. Spotted a missing 'break;' in the undo case statement. The consequence was
that undoing a confidence change also used the same data to undo a transpose
14. Delete contig could corrupt memory.
15. Fix get_gel_num() function when dealing with the /name convention. This
fixes the alter relationship gel code.
16. Fixed a bug with the compare strands function of the contig editor. Various
options would then fail when computing a consensus - eg deleting a consensus
pad, using align, dump contigs.
17. io_get_extension() would return negative lengths or crash in memcpy when
vector tags were in the used data. It could also miss VEC tags when they
overlapped the used data by > 1.
18. Assembly in gap (and presumably bap, etc) crashed if there were
more than maxc=100 overlaps.
19. Initialise the 'next' pointer to zero for newly created tags in the
spltag_() function. Previously this produced unpredictable results for break
contig and disassemble readings.
20. Tags weren't being shifted on the consensus correctly when disassemble
readings changed the contig start.
21. Recent improvement to give more information for infrequent restriction
enzyme sites was bugged in routine findl1 which caused routine s2 to crash.
22. Fix freeDB() bug in contig editor. It didn't check if DB_Name and DB_Seq
had been allocated before freeing, and hence could free NULL data.
23. The Find Oligo editor function creates temporary tags. These had locally
(non malloced) defined comments that were later incorrectly freed.
Rodger Staden and James Bonfield
Release Notes for Staden Package 1994.2.0
(Please make sure these notes reach the users)
This release contains a very large number of changes to the
software and the manual because, for the first time, it
includes our new and long awaited assembly program gap, and
all its associated programs and scripts. Gap replaces our
previous assembly programs (dap and bap) which were always
temporary. Rather than rewrite the same information in several
ways we include below the preface to the new edtion of the
manual which contains a list of the major changes.
The body of the manual contains description of the new features and
rewrites of changed options. We will only fix bugs in bap in
future and dap will be removed from the distribution.
Note that we have removed reference to sap and bap from the
manual and that the contents concerned with assembly are true
only for gap. Previous copies of the manual should be kept by
groups finishing off bap projects. Note that the package
includes a program for converting from dap and bap databases to
gap database format.
The testpackage directory includes a complete set of data for
demonstrating pregap and gap.
The copy of prosite on the distribution corresponds to release
12.0. Note that this release is the first to contain the new
"matrix" method for defining patterns. As yet we have not
written our own code to deal with this format and so such
patterns are not translated into pattern files by splitp3. For
this release only one pattern is described as a matrix and is
hence missing from our set of patterns for use by pip. We hope
to be able to use the matrix files shortly.
Preface to manual 94.2
As is obvious from its increased size (up by around 50 pages) this
new edition of the manual contains the greatest number of additions
and changes so far made, and the increase is due to the launch of our
final sequence assembly program and all its accompanying innovations.
The new program gap (genome assembly program) has a new database that
stores a great deal of extra information about each reading, and uses
experiment files as its standard input format. Although reliable
values are not yet available, gap is programmed to use numerical
estimates of base accuracy for several of its operations. The manual
includes essays on the use of experiment files and numerical estimates
of base accuracy in the introductory chapter.
To accompany gap and the use of experiment files we include several
new pre-assembly programs and a script to combine them all into a
single operation. The new script is called pregap and it uses init_exp
to initialise an experiment file, makeSCF to convert ABI and Pharmacia
ALF files to SCF files, eba to estimate base accuracies for data in
SCF files, clip to locate and mark the start of poor quality at the 3'
ends of readings, vepe to find and mark the positions of vector
sequences, and repe to perform a similar job for repeat families such
as Alu. Pregap is designed to be easily interfaced with local file
organisation and databases.
The gap database is completely new and is designed to be extendable
and robust in the event of system crashes. The gap program has a large
number of new options and improvements to previous ones. Routines have
been added to automatically locate problems and suggest solutions. The
program can find single stranded regions and suggest resequencing
particular reads on a long gel machine to fill them. Another routine
can find regions that are tagged as containing compressions and
suggests resequencing readings using Taq terminators. Routines that
extract data from the database such as "extract gel readings" now
write their output in experiment file format. Similarly routines that
read in tags use experiment file format.
The program cop has now been properly integrated into gap, and as gap
records all its edits in a better way than previous programs, its
results are obtained more quickly and are more reliable.
The contig editor has been improved in numerous ways, necessitating a
complete rewrite of the corresponding section of the manual. Trace
movement is now coupled to the movement of the editing cursor. The
"find next problem" command now works in three modes: a) find the next
position where the consensus is not A,C,G or T; b) find the next
position where there is not good data on both strands (and they
agree); c) find the next position where an edit has been performed.
Several new key bindings have been added and an alignment routine has
been provided to give alignments between hidden data and the
consensus, or between contigs in the join editor.
The consenus calculation and the "find next problem" options in the
contig editor have been reprogrammed to use only data for which the
numerical estimate of accuracy is above a cutoff value. In theory this
means that "find next problem" will only find problems in good data
because the poor data will be ignored. Almost all edits are performed
to make bad data agree with good, so in the long term, this procedure
should bring a huge saving in editing time. However reliable values
for base accuracy are best provided by the base calling software, and
the ones calculated by our routine eba, although correlated with the
numbers of edits required, should be viewed as being for demonstration
purposes only. By default therefore, the cutoff for inclusion of data
in the consensus calculation is set to -1, so that all bases are used.
For large scale projects we expect the use of experiment files to be
of great interest. They are augmented gel reading files and provide a
very simple mechanism for passing information bewteen processing
steps. Based on EMBL sequence library entries with two-letter codes at
the beginning of each record, they are easy to parse and easy to
write. The general idea is that processing programs read all they need
from the experiment file, perform their particular operation and write
the result back to the end of the experiment file. For example vepe
reads, not only the sequence, but also the names of the vectors and
the primers used to produce the reading, then fetches the vector
sequence, performs its search and appends the result to the experiment
file. This new information can then be used by subsequent programs
The "find internal joins" option in gap has been extended so that in
addition to the facility to compare the ends of all contigs with all
other contigs, users can now elect to compare the full length of each
contig against all others, or to select a single contig to compare
against all others. A further refinement allows tagged regions to be
either "marked" or "masked". Marking means that the tagged regions
will appear in lower case in the displayed alignments. Masking means
that the tagged regions will not be used to find matches between
contigs, although if a match is found adjacent to such a segment it
will be aligned and the alignment score included in the overall value.
These last two options are designed to help with highly repetitive DNA
such as human where Alu repeats will cause many spurious matches to be
found. At present the only tag type recognised by these options is
that for Alu but in the near future the method will be generalised to
include lists of tag types for masking and marking.
A program called convert, for converting bap databases to gap
databases, is included in the package.
Finally we note that an internet newsgroup for users of the package has
been created recently. Although the group, which is run from Montreal,
Canada by Tim Littlejohn, is independent of us, the developers, we
encourage people to make use of it. Requests for functions not
available in the package will help guide future development and
questions about existing programs will help us to improve this manual.
Release Notes for Staden Package 1994.1.0
This release contains few major changes. However it allows us
to announce that the next release will contain a new assembly
program called GAP (Genome Assembly Program). Further details
are contained in the file PreRelease.GAP
1. BAP has been changed to permit command line arguments for
specifying the maximum consensus sequence length, and the
maximum database size.
2. Find internal joins has been broadened in its scope and made more
useful for dealing with repetitive sequences:
a. It will now allow comparisons of everything against everything
whereas originally it only compared a segment of size "probe
length" from the ends of each contig with every other contig.
Now all segements of size probe from along the length of every
contig can be compared against all other contigs.
b. It will now allow a single probe to be compared against all
other sequences. This is useful for comparing a repeated region
against other occurrences of the repeat.
c. For those assembling human data two new facilities can be used
to ease the problem of dealing with Alu rich sequences. Segments
of sequence tagged as containing Alu can now be masked or marked.
Masked means that Alu tagged segments will not be used to search
for matches but will be included when the percentage mismatch is
calculated. (The Alu segments will also be shown in lower case
letters.) This means that all matches will contain a section of
size "minimum match length" that is not tagged as Alu. Marked
means that Alu tagged regions can be used during the matching
process but will be shown in lower case in the display of the
alignment. These new facilities could easily be generalised to
other tag types.
3. Numerous minor bug fixes in the assembly programs.
4. Bug fix in the sequence library accessing routines that caused
searches based on accession numbers to crash on DEC alphas.
Release Notes for Staden Package 1994.0.0
Sequence library changes
The text and author searches of the sequence libraries are
one of the strong points of the package. We have made some
useful additions and changes.
1. Taxon (or species) index search added.
2. NOT operator added for index searches.
3. The routines now only show options for which indexes are
4. We have greatly simplified the installation procedure
for the sequence libraries. All relevent files are in
$STADTABL and further information is contained in file
Prosite library changes
1. We now provide access to prosite via the sequence library
index searching routines. Text searches can be performed
on both the .dat and the .doc files. As with the sequence
library index searches they are effectively instantaneous.
The new interface makes it very easy to search and browse
2. A copy of the prosite library and indexes is now included
in the distribution in $STADTABL/prosite/indices.
3. Copies of the reformatted prosite library suitable for use
by pip are now included in the distribution in
$STADTABL/prosite/pats. In addition to the environment variable
PROSITENAMES we have now included PROSITEP which is set to
$STADTABL/prosite/pats. This means that any entry in the prosite
library (say entry PS000XYZ) can be searched for from pip by
using PROSITEP/PS00XYZ.PAT as the pattern file name.
4. A bug was fixed in splitp3 and it is now modified to deal with
prosite.dat as it arrives on cdrom. ie it is no longer necessary
to remove the ^M before running splitp3 to produce the pattern
files for pip.
The provision of index searching makes the use of prosite from
the package much easier and more powerful.
New version of the manual
1. The manual has been updated to include recent additions and
is now 165 pages in length.
It also includes a simple method of calculating the cloning
site and primer positions for vep. The manual is on disk and in
$STADENROOT/doc/manual.RTF (RTF format)
Assembly program changes
1. We have detected and fixed a bug which resulted in padding
characters in overlapping readings not being aligned. Future
alignments should be better.
New script to aid data manipulation prior to assembly
1. Prebap script (see directory $STADENROOT/src/scripts/prebap) will
automate the procedures to take a folder of ABI samples to end
with data (gel sequence files, SCF files, and a file of filenames)
ready for input to bap/xbap. Please see the prebap manual for more
New documentation files
4. new.help (in $STADENROOT/help directory)
Changes in 1993.3 release 16/11/93
There is a new 154 page manual in $STADENROOT/doc/manual.rtf
A test package ($STADENROOT/testpackage) is now available to run
through most functions of mep, nip, nipf, pip and sip. There is
also a test database for bap.
Cop memory corruption fixed.
Nip now complements uncertainty codes correctly. Non fatal trna
search bug fixed for SGI.
Assembley program changes:
1. Faster contig editor
2. Find internal joins bug fixes (alignments no longer displayed
when passing mismtatch test but failing pad test; corrections
when mismatch score is precisely the maximum).
3. Repeat search option
4. Single stranded 'calculate consensus' now available
5. Check assemble additions
Known bugs still in this release
When using Find Internal Joins in xbap using the 'save contig'
option of the contig editor can cause problems for later
joins found within the same 'round'. This is noticed when adjusting
the position of the cutoff data.
Quiting Find Internal Joins and restarting solves the problem.
Changes in 1993.2 release 21/9/93
The assembly programs bap and xbap heve several new functions:
1. Find single stranded regions and try to fill them with "hidden"
data from the adjacent readings.
2. Find single stranded regions (includes ends of contigs) and
select primers and templates for double stranding them (joining
3. Pre assembly screening for readings to find those that align
best. Optionally the hidden data can also be included in the
comparison (part of assembly function).
4. Find pairs of readings taken from opposite ends of the same
template (ie forward and reverse read pairs). List or plot their
5. A new function to check that readings have been assembled into
the correct positions. It aligns the hidden (previously termed
"unused") parts of readings with the consensus they overlap to see
how well they align. Poor alignments are reported. 6. During
assembly each reading is now allowed to match up to 100 different
It might be guessed from the above that we are trying to improve
our ability to deal with the assembly of human data. Hence, also
the next addition.
A new experimental program (rep) for screening readings for Alu
sequences prior to assembly. The Alu containing segments are tagged
so they can be seen in the contig editor. A library of Alu
sequences is included in /tables/alus. The program is quite slow as
it compares each reading in both orientations with all of the Alu
sequences (126 of them) in order to find the best match. Only time
and more data will tell how sensitive it is, and whether the
current default score 0f 0.6 is "correct". BEWARE rep modifies the
original reading files to include the tag information. The only
information is in /doc/alu.help
A new program for extracting sets of sequences and their
annotations from the sequence libraries (lip). The only information
is in /help/lip.help
Changes to the xterm userinterface. These routines have been
completely rewritten. One addition is that now ?? in response to a
question will allow the user to get help on any function in a
program. help is also improved in the x version.
Changes in earlier releases
DAP, XDAP have been replaced by BAP and XBAP (see below)
A new function for examining repeats has been added to NIP
A new repeat search has been added to SIP
Some outputs have been changed to produce FASTA format files
instead of PIR.
MEP now allows searches for motifs in which any 8 out of a string
of 20 can be switched on.
The manual has been updated.
Keyword and author searches on sequence libraries
All programs that use the libraries can now perform author
and keyword searches on all libraries (only nip did so before).
All graphics can now be saved to disk in postscript form by
use of a sub-option in "Redirect output".
BAP, XBAP replace DAP and XDAP. A program to convert DAP databases
to BAP databases (convert) is included. BAP databases can contain
up to 8000 readings and a consensus of 500,000 bases. A minor edit
and recompilation will allow up to 99,999 readings. The space is
used more efficiently now as the databases grow as the number of
readings increases. Reading names can be 16 characters in length.
1) Assembly is 4 times as fast as in the DAP.
2) Find internal joins is 5 times as fast and now brings up the
join editor with the two contigs in the correct orientation and
3) The assembly routines align pads better, plus a new automatic
function can also be used to align them prior to editing.
4) The contig editor has been greatly speeded up and its
functionality has been enhanced.
5) A routine for selecting oligos for primer walking is included.
6) A new routine allows batches of readings to be removed from a
7) We have also included routines for making SCF files, for getting
the sequence from SCF files, and one for marking the poor quality
data in readings. See the manual.
Sequence library formats:
The standard sequence library indexing method is now that used on
the EMBL CD-ROM. The libraries (EMBL nucleotide and SWISSPROT
protein) can be left on the CD-ROM or copied to disk. We include in
the package programs for creating this type of index for EMBL
updates, PIR in codata format, NRL3D and GenBank. If the indexes
are created all programs can read all these libraries. Programs and
scripts for this task are contained in the directory indexseqlibs.
The keyword and author searches are particularly fast and the
keyword index is based on ALL text in the files - not just the
Feature table formats
The programs now use the new feature table format common to EMBL
and GenBank, but retain the old format for SWISSPROT which has not
For details of the above see file SequenceLibraries.
Pipl and Nipl now have the facility to find only the best scoring
match for each sequence. The prompt is "? report all matches", so
typing only return means all matches will be shown and typing n
means only the highest scoring will be reported. It is particularly
useful when employed to create alignments. The corresponding help
file has not been updated. Also to incorporate long unix file names
the pattern files no longer include the annotation "filename".
Option 38 in nip "translate and list" has been removed as the the
more flexible routines of option 39 incorporate all its
functionality. Many options that relate to feature tables have been
modified but their help files are not yet up to date.
A program (vep) for automatic excising of vector (either sequencing
vector or cosmid vector) sequences from readings is now included in
Rodger Staden, James Bonfield