tacg
a program for the restriction enzyme analysis of DNA
Release 1.33
by Harry Mangalam, UC Irvine
(mangalam at uci.edu, 714 824 4824)
# This posting is to announce the availability of 'tacg', a command line tool
for the restriction enzyme analysis of DNA for unix-like operating systems.
Binaries currently exist for IRIX (5.3), SunOS (5.3), OSF/1 (V3.0/347), and
Linux(1.2.8); others will be made available as I find systems on which to
compile them, or as others contribute binaries.
# For the impatient, here's an example of how to use it:
tail +44 seq.file | tacg -n 6 -o 5 -F 2 -l ladder.map -w 90 >seq.file.map
Translation: chop off the top 44 lines of seq.file and pipe the resulting
sequence to tacg, returning info on all 6+ cutters (-n 6) that generate 5'
overlaps (-o 5), giving me the sorted fragment sizes of those enz's that
match (-F 2) and a ladder map (-l ladder.map), along with the default linear
restriction map w/ 1 letter, 1 frame translation and write the output 90
characters wide (-w 90) to a file called seq.file.map
# If you're interested in using it, you can get it via anonymous ftp at:
ftp://mamba.bio.uci.edu/pub/tacg
# The source code is freely available for instructional and nonprofit
purposes, although since it is presently in beta release, I would suggest
that anyone contemplating incorporating it would wait for the next release
while
more bugs are shaken out. Assuming it's used a fair bit, I'd like to have
a chance to change it based on responses, document it more extensively and
neaten it up before general release.
# This is citation-ware. If you use it, please allow it to spit back about
100 bytes of data so I can analyze its use and spread. You can check the
source
code (especially udping.xx.c) to see what it does and if it still makes
you uneasy, you can disable it from the command-line or recompilation.
# The design criteria were:
1) Simplicity.
It requires only 3 files - the executable, the restriction enzyme database
file (rebase.data), and the codon usage file (codon.prefs). The 2 data
files are ascii text and can be edited and modified by the user, if
required.
It was designed along the same lines as other small unix utilities - a tool
that does a small set of things, does them reasonably well and can be
chained to other utilities or used in conjuction with them to extract the
information you need without too much fuss.
The output of this program uses only alphanumeric characters so that all of
its output can be viewed on a vanilla vt100-like terminal, although you can
do more useful things if you're using an X display. For instance, some of
the output can best be viewed using very small fonts or in multiple columns
on a page, generated by feeding the output to a postscript conversion
package (lptops, enscript, nenscript, genscript, etc).
2) High Portability
The program is written in vanilla ANSI C, with no arcane ifdefs. It
compiles with few complaints on SGI's IRIX (5.3) with cc, Sparcs running
SunOS (5.3) with cc and gcc, DEC Alphas running OSF1 (v3.0) with cc, and
*especially* Linux (1.2.8) with gcc.
3) Speed and Capacity
The program uses a hashtable-lookup of the restriction enzyme recognition
sites (generated on the fly) so that only about half of the sequence is
checked any further than the initial hash. Depending on what kind of
output you request and the i/o of the machine (output is by far the most
time-consuming part of the program), the program processes:
Speed* Hardware OS Compiler, flags
~14-150Kb/s i486/66/ISA Linux 1.2.8 gcc -O2
~16-80Kb/s Sparc 4/?MHz SunOS 5.3 gcc -O
~25-130Kb/s early DEC Alpha OSF/1 gcc -O2
~23-260Kb/s R4000/100 Indigo2 IRIX 5.3 cc -O2 -mips2
~94-700Kb/s R4400/200 Indigo2 IRIX 5.3 cc -O2 -mips2
It also uses dynamic memory allocation so that while there are a few
hard-coded limitations (in output format), it easily handles sequences into
the millions of bases.
4) Usability
Inspired by Christian Marck's elegant DNA Strider, I used a similiar output
format, changing a few things I didn't like, adding a few things I wanted.
The Feature Set:
a) produces linear restriction maps.
The map shows EXACT cutting position (not just the start of the recognition
sequence - minor nitpick with Strider and other programs), with same-page
translation (ditto) in 1/3/6 frames in 1 or 3 letter codes. tested up to
more than a million bases. ie:
============================================================================
MspI
HpaII
Sau96I
AvaII
RsrII BstUI FokI MaeII EcoRV
\ \ \ \ \ \
13981 agcggtccggctgtcgcggatgaatatgaccagccaacgtccgatatcacgaaggataaa 14040
tcgccaggccgacagcgcctacttatactggtcggttgcaggctatagtgcttcctattt
^ * ^ * ^ * ^ * ^ * ^ *
S G P A V A D E Y D Q P T S D I T K D K
============================================================================
b) filters enzymes inclusively by:
- magnitude of recognition sequence (tgca=4, tgyrca=5, tgcnnngca=6, etc)
- overlap of resulting ends (5', 3', blunt)
- minimum, maximum times they cut the sequence
b) handles linear/circular topologies, subsequences
c) produces Summaries of cuts:
============================================================================
Restriction Enzymes that DO NOT CUT in this sequence:
BbeI EheI FseI KasI NarI NheI NotI
PacI PaeR7I SalI SfiI SpeI SwaI XhoI
Total Number of Cuts per Restriction Enzyme:
AatII 5 BsiYI 130 EcoNI 5 MluI 7 SalI 0
AccI 5 BsmI 30 EcoO109I 2 MmeI 8 SapI 7
AflII 2 BsmAI 26 EcoRI 3 MnlI 184 SauI 1
AflIII 13 Bsp120I 1 EcoRII 49 MscI 17 Sau96I 61
AgeI 12 Bsp1286I 26 EcoRV 14 MseI 106 ScaI 4
AluI 89 BspEI 22 EheI 0 MspI 278 ScrFI 145
<etc>
============================================================================
- Tables of cutting sites. ie:
(for enzymes that pass the filtering options)
============================================================================
** Cut Sites by Restriction Enzyme **
AatII G_ACGT'C - 5 cut(s)
5110 9399 11248 14979 29041
AccI GT'mk_AC - 5 cut(s)
2192 15262 18836 19475 31303
AflII C'TTAA_G - 2 cut(s)
6541 12619
AflIII A'CryG_T - 13 cut(s)
459 629 5549 11282 15373 17792 18285 19997 20953 22221 24134
24169 26529
============================================================================
- Tables of fragment sizes (unsorted, sorted or both) ie:
============================================================================
** SORTED Fragment Sizes by Restriction Enzyme **
AatII G_ACGT'C - 5 Fragment(s)
1849 3449 3731 4289 5110 14062
AccI GT'mk_AC - 5 Fragment(s)
639 1187 2192 3574 11828 13070
AflII C'TTAA_G - 2 Fragment(s)
6078 6541 19871
AflIII A'CryG_T - 13 Fragment(s)
35 170 459 493 956 1268 1712 1913 2360 2419 4091
4920 5733 5961
============================================================================
- Ladder map, with 5', 3' blunt cutters indicated (\, /, |)
============================================================================
Ladder Map of Restriction Enzyme Cut Sites:
10000 20000 30000 40000
: : : :
AccI ---\-----\----------------------------------------------\--------
AceIII ---------------------------\---------------\---------------\---\-
AciI ----\--\\2-\2--\--2-\\-2\--\\-\\-\\-\--\-\-\----2\--\--\-\-\-\---
AflII --\--\--\\\\-----------------\\\-----------\\------\-------------
: : : :
AflIII --------------\-----------\------------\------\\\2---------------
AhdI -------/---------------------------------------------------------
AluI |3355323833353|43284|44-|45|324|54252-3426|2|22|52543|42323522|22
AlwI -2\--\2---3----2-----------------\\-\22-2\--\\\\-\---2--\-\-----2
: : : :
============================================================================
- A summary map (a la Strider) of enzymes that cut less than 2 times
(altho this may be changed to be length-sensitive)
============================================================================
Summary of Enzymes that cut ** 2 ** times or less:
XhoI at 5733Pfl1108I at 4774PvuI at 22499PshAI at 29823DrdI at 2598NarI at 27578BssSI at 4941BsiEI at 22499BssSI at 38186AhdI at 5113BsaHI at 27578BsmBI at 37969
| ||| | | |
|---------------------------------------------------------------------------
: : : :
: 10000 20000 30000 40000
50000
============================================================================
- A pseudo gel format that shows how different digests would look if run
on a gel. Currently, it uses a straight log10() approximation, but a
suggestion was made to use an additional transformation to mimic
different percentages of agarose/polyacryamide. It uses the same
representation as the ladder map, with single fragments represented as
'|', multiple fragments that cannot be resolved as a digit showing how
many map to that space
============================================================================
Pseudo-Gel Map of Digestions: *Maximum* Cuts: 50
100 1000
. . . . . . . . . .
AccI
AceIII
AciI | ||| | | | 2||| 2| 2 | |||
AflII || | | | |
. . . . . . . . . .
AflIII | || |
AhdI
AlwI 7 || | || || | | | 2 | 2
Alw26I |
============================================================================
d) Other options:
- extract subsequences from the input sequence (and make
circular/linear)
- translations with linear restriction map in 1, 3, or 6 frames,
with 1 or 3 letter codes
- Choose which of several codon preferences to use
- 'Write/don't write' most of the options
- User-settable printing widths to ~200 characters
5) The Odd one:
It was also designed to track it's own use and spread - something in which
I'm also interested. To that end, the binaries have been compiled with
code that spits a small amount of data back to me at each usage, telling me
the IP number of the hosting machine, the UID of person using it, the cpu
type and OS it was run on, what flags were used in calling it, and the
number of bases processed. It does not return host or domain names, user
names, or actual sequence. The exact data that is returned is shown on
stderr (usually the screen) each time.
Cheers
Harry
--
Harry J Mangalam, Microbiology and Molecular Genetics, UC Irvine,
Irvine, CA, 92717, (714) 824-4824, fax (714) 824 8598
http://hornet.mmg.uci.edu/~hjm/hjm.html
Computational Biology..SGI..Woodworking..Bicycling..Linux..WWW