One can use binary codes if one has phenotypic data which does not
allow one to discriminate the underlying genotype exactly and one can
code it as the presence (1) or absence (0) of factors such as the A and B
antigens.
Most disease locus data can be coded very effectively using
affected/unaffected and appropriate liability classes. Hope the
explanation is sufficiently clear.
(and another answer from Jurg Ott)
Binary factor notation allows representing loci with codominant
and dominant mode (full penetrance) of inheritance while 'allele
numbers' notation is good only for codominant loci. Few people use
binary factor notation, they either use allele numbers for
codominant loci, or affection status' notation for dominant loci
(complete or incomplete penetrance). The main reason why binary
factor notation is used is probably that CEPH's database is in that
notation.
Jurg Ott
What is the effect of having allele frequencies not add up to 1, eg.
when some alleles are not present in a pedigree under study?[Ellen
Wijsman;16may94]
Best approach is to specify n+1 alleles, where there are n alleles
actually observed in the pedigree. Use the correct allele frequencies for
the n alleles, and for the n+1th allele, use 1 minus the sum of the
frequencies of the observed alleles.
I use LINKAGE and/or FASTLINK. What references should I cite in
my papers?
FASTLINK:
As described in the papers:
R. W. Cottingham Jr., R. M. Idury, and A. A. Schaffer, Faster Sequential
Genetic Linkage Computations, American Journal of Human Genetics, 53(1993),
pp. 252-263.
and
A. A. Schaffer, S. K. Gupta, K. Shriram, and R. W. Cottingham, Jr.,
Avoiding Recomputation in Genetic Linkage Analysis, Human Heredity,
to appear. [NOTE, this has appeared, so get the correct reference from
the current linkage distribution--rootd]
In addition, all fastlink users should also cite the LINKAGE papers:
G. M. Lathrop, J.-M. Lalouel, C. Julier, and J. Ott, Strategies for
Multilocus Analysis in Humans, PNAS 81(1984), pp. 3443-3446.
G. M. Lathrop and J.-M. Lalouel, Easy Calculations of LOD Scores
and Genetic Risks on Small Computers, American Journal of Human Genetics,
36(1984), pp. 460-465.
G. M. Lathrop, J.-M. Lalouel, and R. L. White, Construction of Human
Genetic Linkage Maps: Likelihood Calculations for Multilocus Analysis,
Genetic Epidemiology 3(1986), pp. 39-52.
A discussion of recoding alleles in linkage data
From: wijsman at max.u.washington.edu
Newsgroups: bionet.molbio.gene-linkage
Subject: Re: Large Allele numbers
Date: 11 Jul 94 21:35:13 PDT
Organization:
Lines: 96
Distribution: world
Message-ID: <1994Jul11.213513.1 at max.u.washington.edu>
NNTP-Posting-Host: max.u.washington.edu
>> In my group we are scanning the human genome for genes responsible for a
>> complex disease. Not too far into the search, we have run into a few
>> markers which have 16 or more alleles. I have been able to modify the
>> LINKAGE programs (v 5.2) to allow up to 14 alleles, but past that, I get
>> compiling errors informing me that I am out of memory. Further
>> examination
>> tells me that the UNKNOWN program creates a matrix of the size:
>> (maxall)*(maxall+1)/2 X (maxall)*(maxall+1)/2
>> which is too big for DOS to handle.
>>
>> My question is, is there any way to get around this limitation by
>> splitting
>> up the pedigree set, or some other method?
>>
>
>Tim Magnus writes:
>
>Conservative renumbering will allow you to renumber each family down to
>4 alleles. The founding parents get 1 through 4. Each time a spouse
>marries in, the spouse gets the two alleles missing from their mate.
>(of course - if the alleles are the same size they are numbered the same
>so you will not use all 4 alleles in every mating).
>
This type of renumbering is only possible when the genotypes in the
founders are known, which is frequently not so for complex diseases. In
fact, in human genetics with the exception of marker mapping in CEPH-type
pedigrees, it is typical that there are some missing genotypes in founders.
Thus the simple answer to renumber alleles usually does not fix the
problem.
>Jonathon Haines writes:
>This is a recurring problem that has been vexing the genetic linkage
>community for many years. The basic problem is to preserve the genetic/
>segregation information while reducing the number of alleles to a range
>that allows easy computation. The method of recoding (recycling) alleles
>described by Ott (AJHG, 1978) works very well, but can only be done when
>the mode of inheritance of the disease is known (thus allowing the recoding
>of spouses).
It is usually possible to recode marker alleles to some extent even if the
mode of inheritance of the disease is not known since what is still desired
with respect to the marker is a labelling which preserves the available
information about the source of each marker allele. It is important,
however, where the full ancestry of alleles cannot be traced in a pedigree,
that the recoded alleles maintain the allele frequencies appropriate to the
original alleles.
>In a complex disorder, this may not be possible. If the marker
>in question has 14 alleles in the general population, but only 9 alleles
>in the study population, it is possible to reduce the functional number of
>alleles to 9 or 10. For the former, we usually adjust the allele
>frequencies to sum to 1 by dividng each allele freqeucny by the sum of
>the (observed) allele frequencies. For the latter, all the allele
>frequencies remain the same, but the unobserved ones are collapsed into
>a single allele (and frequency).
If there are 9 observed alleles (but we know there are 14 in the
population), then rescaling the frequencies of the observed 9 alleles will
also not produce quite correct results. Consider the unlikely example of a
huge pedigree with only the most recent generation observed in which the
observed 9 alleles all have very low and equal frequency; if there are
distantly separated relatives who are affected, there is some reasonable
support for linkage since the alleles are rare. But if we rescale
frequencies to 1/9 per alleles, then sharing of alleles isn't so unlikely.
Coding the marker with 10 alleles produces correct results as it will
produce the same lod scores as would coding the marker with 14 alleles.
As Jonathon noted, the multiple-allele problem is a big problem in
analysis. The multiple allele problem became one of our biggest
bottlenecks since we were analyzing families individually to reduce the
number of alleles in the analysis. Our partial solution was the following.
We use LIPED instead of LINKAGE for general 2-point analyses for a number
of reasons which I won't go into. We modified LIPED so that if we assume
a codominant marker and that alleles are labelled in a predetermined
sequence (which we force through a preprocessor program), we can reread the
specific observed alleles and their frequencies for each family. The
program then assumes one more allele per family to account for all the
other alleles at the locus. For genomic screening we don't do any
downcoding (although we do downcode by hand for multipoint analyses and
analyses with multi-looped pedigrees for which even 6 alleles is often too
many). But these program modifications to allow us to process all our
families together with only the observed number of alleles (plus one) per
pedigree had an enormous effect on our ability to throughput most analyses
relatively quickly. It is relatively unusual that we find more than 6-7
alleles in any one pedigree, which brings computation time (and memory
requirements) down to reasonable levels. Thus for 2-point analyses
downcoding is usually not necessary. I should note that we do our analyses
on a workstation, but I don't see any reason that the modifications we made
should not work on a PC, assuming the fortran is compatible.
Ellen Wijsman
Div of Medical Genetics, RG-25
and Dept of Biostatistics
University of Washington
Seattle, WA 98195
wijsman at u.washington.edu
COMPUTER ADMINISTRATION AND OPTIMIZATION
How can I increase the speed of the linkage/fastlink package on my
workstation? [rootd;15may94] [aha, finally a question I can confidently
answer!]
1. Use fastlink (it will increase your speed by an order of
magnitude)
2. Setting up tons of paging space (using the hard-drive as virtual
memory) and use the "fast" versions of fastlink. 300 megs is
usually plenty. Note that paging space is the same as swap
space.
3. Use gcc (the GNU/free software foundation C compiler) to
compile fastlink (gcc produces machine language that is about
10% faster than sun's C compiler).
4. Install the generic-small kernel instead of the generic kernel
(the generic kernel has device files for almost EVERYTHING.
The generic-small kernel is configured for a system without
many devices and without many users). Installing a
generic-small kernel is an option during system installation on
sun workstations.
5. Reconfigure your kernel so it has only devices which you need.
This is a task for an experienced system administrator. This
should give you a small improvement in overall system speed,
but if you are already running the generic-small kernel,
additional improvement may be so small that it's not worth the
trouble. If the generic-small kernel is insufficent for your
system (so you were forced to install the generic kernel) this
step is a MUST. The generic kernel will slow down your
workstation significantly, and most of the device-support is
unnecessary.
6. Don't run your linkage analyses in the background, because
running programs in the background gives them a lower
priority (on suns it reduces the priority level by 3 out of a total
range of 40). Either do the runs in the foreground (which is fine
as long as you don't plan to log out) or you can use the root
password to renice the pedin process by -3 to compensate
(negative nice values give a higher priority). If you need to log
out, you can use the screen command (distributed by GNU/free
software foundation) and "detach" a session so you can log out
without programs terminating. Later you can log back in and
"reattach" the session, which continued to run while you were
logged out. The screen command is available at prep.ai.mit.edu,
and is also on the O'Reilly Unix Power Tools CD-ROM.
According to the sun documentation, renicing below -10 can
interfere with the operating system and actually reduce the
process' speed. I just run them at a priority/nice level of 0 (the
standard default level). That gives me reasonable response with
my other applications, but still lets fastlink run at a decent
speed.
7. Run with 100% penetrance Runs with 100% penetrance can run
faster than runs with incomplete penetrance. Of course, if you
have an unaffected obligate carrier, this won't work. In
addition, incomplete-penetrance runs may be necessary for
your research to be "good" (decisions like this are why the
professors make the big bucks :-)
8. Change the block size of your filesystem (from Gerard Tromp)
One can increase performance of a filesystem by increasing the
block size -- this decrease the number of read-write
operations. A block device such as a hard disk usually accesses a
block of data simulataneously. Thus if one is expecting to use
large files, having large blocks will be an advantage.
Simultaneously though one usually trades the number of bytes
lost to partial files since one has to increase the fragment size to
a number larger than 1024 e.g. 2048. That is, each file or part of
a file occupies 2048 bytes, a file of 100 bytes will still occupy
2048 bytes. i.e. Bigger blocks == faster bigger blocks => bigger
fragments == more lost space. bigger blocks, allows for more
cylinders per group.
Related:
see: newfs (8) - create a new file system
for details on default values for file systems:
inode -- 2048 bytes/inode
block -- 8192 bytes/block
frag(ment) -- 1024 bytes/fragment
Gerard Tromp notes that you can increase the speed of programs which
create/access large files in the /tmp directory by creating a tmpfs
filesystem. The stuff is complicated and I haven't fully
assimilated/understood his email yet, so I'm not including it yet. I'll be
happy to send any interested parties a forward of Gerard Tromp's
email. I hope to have tmpfs information in the next edition of the FAQ.
Of course, buying more RAM will increase your speed. I've heard that
increasing RAM from 16 to 32 megs will result in a large increase in
speed. Increasing RAM from 32-64 megs will result in a significant
increase. Increasing beyond 64megs is not particulairly helpful. Note
that this data is anecdotal in nature (I haven't seen it myself), but it
makes intuitive sense to me. If someone sends me some SIMMS for our
sparcII, I'll be glad to test it out :-) A professor has offered to let me
run a fastlink benchmark on his sparc10 with 128megs RAM. I'll post
results as soon as they come in. note: I run on a sun sparcII. I'd like to
hear data from people on other platforms. I'd especially like to hear
data on the speed-RAM relationship.
I set up 300 megs of paging space on my workstation, but now I'm
running out of hard-drive space. Is there any way I can use my hard
drive space more effeciently? [rootd;29may94]
Paging space is hard-drive space which is used as virtual RAM. Unix
boxes use paging space constantly, swapping processes out to the
hard-drive and into RAM constant. Remember that "paging space" is
the same as "swap space". There are two types of paging-space on sun
systems (and many other types of Unix systems as well): paging files,
and paging partitions. Paging files are actual files (you can do an ls and
find them in a directory somewhere) in the filesystem. Paging
partitions are separate disk partitions, and as such are not in the
filesystem.
A filesystem has two types of overhead. Consider the following output:
bigbox% df
Filesystem kbytes used avail capacity Mounted on
/dev/sd0a 7735 5471 1491 79% /
/dev/sd0g 151399 127193 9067 93% /usr
/dev/sd3a 306418 266644 9133 97% /usr2
bigbox% df -i
Filesystem iused ifree %iused Mounted on
/dev/sd0a 951 3913 20% /
/dev/sd0g 10218 66390 13% /usr
/dev/sd3a 6278 150394 4% /usr2
The top df command shows the space available on "bigbox" in k. Note
that, although sd3a has 306 megs, of which 267 megs are used, only 9
megs are available. This is because the filesystem saves a "10%" rainy
day fund, so 10% of the filesystem is unusable. Although you can
reduce this percentage (with the root password and using an arcane
command), it is not recommended. According to sun's documentation,
when the filesystem gets more than 90% full the speed of the
filesystem will begin to rapidly drop. When you have a 100 meg
paging file, there is a corresponding 10 megs of "rainy-day-fund"
which you cannot access, so setting up a 100 meg paging file requires
110 megs of disk space. But when you use a seperate partition as a
paging partition, no 10% rainy-day fund is necessary. 100 megs of raw
disk space will give you 100 megs of virtual-RAM.
The bottom df command shows the number of inodes available in the
filesystem. An inode points to files, and is part of the filesystem that
you rarely need to look at. By default, when you create a filesystem in
a partition, one inode is created for every 2k in the partition. The 306
meg partition has 156,000 inodes, but only 4% of them are used. I don't
know how large an inode is (a quick search through my documentation
failed to find it) but I would guess that an inode is 256 bytes. If that's
true, the 150,000 unused inodes above are wasting 37.5 megs of
disk-space. One inode for every 2k is too much. When you create a 100
meg paging file, you only use 1 inode, but that 100 megs of filesystem
has a corresponding 50,000 inodes! If you create a paging-partition,
you are not using a filesystem, so no inodes are necessary. In addition,
when you create a filesystem, you can reduce the number of inodes to
something more reasonable (like one inode for every 10k of disk
space). I generally don't mess with the inode count on my / and /usr
partitions, since that contains the operating system. Make certain not to
reduce the default inode number too much: YOU DONT WANT TO
RUN OUT OF INODES. We converted our 350 megs of paging files
to paging partition, and got another 70 megs of free disk space as a
result (20%)!
But I don't know how to do all this optimization, and my research
assistant is spending all his/her time trying to figure it out.
[rootd;21may94]
Unix system administration is a complex task which requires
experience. An experienced sysadmin can do in minutes what it would
take you hours (or days) to accomplish. In addition, an experienced
sysadmin won't make stupid mistakes very often (lets see, while I was
learning on-the-job I ruined our backup tape during an upgrade
{luckily the upgrade was successful!}, moved a directory inside itself
as root, botched email service a couple times, and spent tons of time
figuring out how to accomplish simple tasks).
Most universities have small budgets for their system administrators.
Many head sysadmins have recruited students to assist them. Basically
the students slave away for nothing, learn tons of stuff, barely pass
their classes, become unix gods, and get hired for 40k+/year if/when
they graduate/flunk out. If your university has a sysadmin group like
this, you can probably "hire" them to support your machine for about
$6/hour at about 4 hours/week*machine. The head-sysadmin will be
happy to give some money to their more-experienced volunteers, the
volunteers get another line on their resume+additional experience, and
you get experienced sysadmins to run your machine. In addition, most
sysadmin groups have an automated nightly backup. Just think: your
machine gets backed up EVERY NIGHT AUTOMATICALLY!
At Portland State University the Electrical Engineering sysadmin
group has been hired to maintain the unix machines of four other
departments, at an average price of $15/week*machine (no additional
price for xterms!) The quality of the service is excellent (especially
since the most experienced volunteers are usually the ones given the
money), there is no annual training-gap as people leave (since the
experienced volunteers are constantly training the new ones) and you
have the entire resources and experience of the sysadmin group to help
you.
Of course, test them by deleting an unimportant file and seeing if they
can restore it from backups (the backup test is the most important in
system administration--have you tested your backups lately?). If they
successfully restore the file from backups, give them the
sun-optimization list (above two questions) and watch as the most
experienced volunteer turns the optimization into a recruit-training
session :-) They may even have a contest to see how small they can
make your kernel-configuration file!
If your location doesn't have such a group, perhaps another universtiy
in town has one.
How can I identify how much paging space is available on my
workstation? [gerard tromp; 29apr94]
Paging space, also referred to as swap space, as well as its use can be
identified by:
pstat -s (Non-root users need to use: /usr/etc/pstat -s)
e.g.
> sanger 1% /usr/etc/pstat -s
> 11456k allocated + 3108k reserved = 14564k used, 252744k available
> sanger 2%
swap space can be mounted on several disk partitions, that is on several
partitions on the same disk or on a partition on several disks.
e.g.
> sanger 2% cat /etc/fstab
> /dev/sd0a / 4.2 rw 1 1
> /dev/sd0e /usr 4.2 rw 1 2
> .
> ... several other partitions removed from listing
> .
> /dev/sd1b swap swap rw 0 0
> /dev/sd2b swap swap rw 0 0
> swap /tmp tmp rw 0 0
> sanger 3%
FILE FORMATS
How do I convert between crimap and linkage formts?
[rootd;29may94]
The crimap utilities package contains genlink and linkgen, which
converts between .gen files and linkage file. I am attempting to find an
ftp site. If you know of one, let me know. I already have source. If I
could find the authors, to have them authorize it, I'd be happy to put the
entire crimap-utilities package on one of my ftp sites.
How do I get my ceph data into crimap format? [rootd;29may94]
You can output the file in linkage format, and use link2gen (if you have
it, see F2). The disadvantage here is that your marker names are
seperated from your data, and its easy to make a mistake and get them
mixed up. You can output the file in ped.out format and use mkcrigen.
mkcrigen is a great program, which automatically transfers the
marker-names with the data (eliminating one source of error).
Unfortunately, I only have an executable with a hardcoded 80-marker
maximum. Nobody can find the source code.
lnktocri is very similar to link2gen, and is included in the multimap tar
file
John Attwood has a ceph2cri program, which reads your ped.out file
and outputs a .gen file. It is available via anonymous ftp from
ftp.gene.ucl.ac.uk in /pub/packages/linkage_utils. It runs on DOS
machines. According to John Attwood: "Making the Unix-based
system available is much more complex, as it involves many scripts,
Makefiles and executables, but I'll try to do it when I have time." If
you need the unix version, send me email and I'll forward a summary
to John Attwood. That way he won't waste time putting together a unix
version unless there is definitive interest.
Educational resources for teaching genetics
Genetics Construction Kit--fly genetics simulator [meir;10Aug94]
There is an excellent program called Genetics Construction Kit that
models fruit fly genetics - lots of features, and a pretty good interface.
It comes on a CD with a bunch of other really good biology education
software from a consortium called BioQuest ($89 for the CD, and its
really worth it - only mac stuff though). Look around on bulletin
boards for the Intro to BioQuest hypercard stack which gives their
philosophy and a description of the programs they have.
Michael Bacon says:
Well, recently out of a genetics class, I can recommend a program
called "Catlab." The idea is that you breed lots and lots of cats, and try
to figure out what genes control the cat's coat and tail.
gen5ajt says:
We use Populus 3.x for DOS (Windows version out soonish), this is an
excelent population genetics package, I couldn't recommend it too
much. It's free and downloadable by ftp from somwhere.
FAQ keeper: Darrell Root
rootd at ee.pdx.edu
or
rootd at ohsu.edu
HTML by Tim Trautmann
timt at ee.pdx.edu