IUBio

bionet.molbio.gene-linkage FREQUENTLY ASKED QUESTIONS (part 3 of 3)

Conan the Librarian rootd at ee.pdx.edu
Sun Nov 20 03:52:58 EST 1994


   One can use binary codes if one has phenotypic data which does not
   allow one to discriminate the underlying genotype exactly and one can
   code it as the presence (1) or absence (0) of factors such as the A and B
   antigens. 

   Most disease locus data can be coded very effectively using
   affected/unaffected and appropriate liability classes. Hope the
   explanation is sufficiently clear. 

   (and another answer from Jurg Ott) 

   Binary factor notation allows representing loci with codominant
   and dominant mode (full penetrance) of inheritance while 'allele
   numbers' notation is good only for codominant loci. Few people use
   binary factor notation, they either use allele numbers for
   codominant loci, or affection status' notation for dominant loci
   (complete or incomplete penetrance). The main reason why binary
   factor notation is used is probably that CEPH's database is in that
   notation. 

   Jurg Ott 

   What is the effect of having allele frequencies not add up to 1, eg.
   when some alleles are not present in a pedigree under study?[Ellen
   Wijsman;16may94] 

   Best approach is to specify n+1 alleles, where there are n alleles
   actually observed in the pedigree. Use the correct allele frequencies for
   the n alleles, and for the n+1th allele, use 1 minus the sum of the
   frequencies of the observed alleles. 
   I use LINKAGE and/or FASTLINK. What references should I cite in
   my papers? 

   FASTLINK:
   As described in the papers:

   R. W. Cottingham Jr., R. M. Idury, and A. A. Schaffer, Faster Sequential 
   Genetic Linkage Computations, American Journal of Human Genetics, 53(1993),
   pp. 252-263.

   and

   A. A. Schaffer, S. K. Gupta, K. Shriram, and R. W. Cottingham, Jr.,
   Avoiding Recomputation in Genetic Linkage Analysis, Human Heredity,
   to appear. [NOTE, this has appeared, so get the correct reference from
   the current linkage distribution--rootd]

   In addition, all fastlink users should also cite the LINKAGE papers:

   G. M. Lathrop, J.-M. Lalouel, C. Julier, and J. Ott, Strategies for
   Multilocus Analysis in Humans, PNAS 81(1984), pp. 3443-3446.

   G. M. Lathrop and J.-M. Lalouel, Easy Calculations of LOD Scores
   and Genetic Risks on Small Computers, American Journal of Human Genetics,
   36(1984), pp. 460-465.

   G. M. Lathrop, J.-M. Lalouel, and R. L. White, Construction of Human
   Genetic Linkage Maps: Likelihood Calculations for Multilocus Analysis,
   Genetic Epidemiology 3(1986), pp. 39-52.

   A discussion of recoding alleles in linkage data 

   From: wijsman at max.u.washington.edu
   Newsgroups: bionet.molbio.gene-linkage
   Subject: Re: Large Allele numbers
   Date: 11 Jul 94 21:35:13 PDT
   Organization: 
   Lines: 96
   Distribution: world
   Message-ID: <1994Jul11.213513.1 at max.u.washington.edu>
   NNTP-Posting-Host: max.u.washington.edu

   >> In my group we are scanning the human genome for genes responsible for a 
   >> complex disease.  Not too far into the search, we have run into a few 
   >> markers which have 16 or more alleles.  I have been able to modify the 
   >> LINKAGE programs (v 5.2) to allow up to 14 alleles, but past that, I get 
   >> compiling errors informing me that I am out of memory.  Further
   >> examination
   >> tells me that the UNKNOWN program creates a matrix of the size:
   >>   (maxall)*(maxall+1)/2  X (maxall)*(maxall+1)/2
   >> which is too big for DOS to handle.  
   >> 
   >> My question is, is there any way to get around this limitation by
   >> splitting
   >> up the pedigree set, or some other method?
   >> 
   >
   >Tim Magnus writes:
   > 
   >Conservative renumbering will allow you to renumber each family down to
   >4 alleles.  The founding parents get 1 through 4.  Each time a spouse 
   >marries in, the spouse gets the two alleles missing from their mate.
   >(of course - if the alleles are the same size they are numbered the same
   >so you will not use all 4 alleles in every mating).
   >

   This type of renumbering is only possible when the genotypes in the
   founders are known, which is frequently not so for complex diseases.  In
   fact, in human genetics with the exception of marker mapping in CEPH-type
   pedigrees, it is typical that there are some missing genotypes in founders. 
   Thus the simple answer to renumber alleles usually does not fix the
   problem.

   >Jonathon Haines writes:

   >This is a recurring problem that has been vexing the genetic linkage
   >community for many years.  The basic problem is to preserve the genetic/
   >segregation information while reducing the number of alleles to a range
   >that allows easy computation.  The method of recoding (recycling) alleles
   >described by Ott (AJHG, 1978) works very well, but can only be done when
   >the mode of inheritance of the disease is known (thus allowing the recoding
   >of spouses).

   It is usually possible to recode marker alleles to some extent even if the
   mode of inheritance of the disease is not known since what is still desired
   with respect to the marker is a labelling which preserves the available
   information about the source of each marker allele.  It is important,
   however, where the full ancestry of alleles cannot be traced in a pedigree,
   that the recoded alleles maintain the allele frequencies appropriate to the
   original alleles.

   >In a complex disorder, this may not be possible.  If the marker
   >in question has 14 alleles in the general population, but only 9 alleles
   >in the study population, it is possible to reduce the functional number of
   >alleles to 9 or 10.  For the former, we usually adjust the allele
   >frequencies to sum to 1 by dividng each allele freqeucny by the sum of
   >the (observed) allele frequencies.  For the latter, all the allele
   >frequencies remain the same, but the unobserved ones are collapsed into
   >a single allele (and frequency).

   If there are 9 observed alleles (but we know there are 14 in the
   population), then rescaling the frequencies of the observed 9 alleles will
   also not produce quite correct results.  Consider the unlikely example of a
   huge pedigree with only the most recent generation observed in which the
   observed 9 alleles all have very low and equal frequency; if there are
   distantly separated relatives who are affected, there is some reasonable
   support for linkage since the alleles are rare.  But if we rescale
   frequencies to 1/9 per alleles, then sharing of alleles isn't so unlikely. 
   Coding the marker with 10 alleles produces correct results as it will
   produce the same lod scores as would coding the marker with 14 alleles. 

   As Jonathon noted, the multiple-allele problem is a big problem in
   analysis.  The multiple allele problem became one of our biggest
   bottlenecks since we were analyzing families individually to reduce the
   number of alleles in the analysis.  Our partial solution was the following. 
   We use LIPED instead of LINKAGE for general 2-point analyses for a number
   of reasons which I won't go into.   We modified LIPED so that if we assume
   a codominant marker and that alleles are labelled in a predetermined
   sequence (which we force through a preprocessor program), we can reread the
   specific observed alleles and their frequencies for each family.  The
   program then assumes one more allele per family to account for all the
   other alleles at the locus.  For genomic screening we don't do any
   downcoding (although we do downcode by hand for multipoint analyses and
   analyses with multi-looped pedigrees for which even 6 alleles is often too
   many).  But these program modifications to allow us to process all our
   families together with only the observed number of alleles (plus one) per
   pedigree had an enormous effect on our ability to throughput most analyses
   relatively quickly.  It is relatively unusual that we find more than 6-7
   alleles in any one pedigree, which brings computation time (and memory
   requirements) down to reasonable levels.  Thus for 2-point analyses
   downcoding is usually not necessary.  I should note that we do our analyses
   on a workstation, but I don't see any reason that the modifications we made
   should not work on a PC, assuming the fortran is compatible.

   Ellen Wijsman
   Div of Medical Genetics, RG-25
   and Dept of Biostatistics
   University of Washington
   Seattle, WA   98195
   wijsman at u.washington.edu

   COMPUTER ADMINISTRATION AND OPTIMIZATION

   How can I increase the speed of the linkage/fastlink package on my
   workstation? [rootd;15may94] [aha, finally a question I can confidently
   answer!] 

    1. Use fastlink (it will increase your speed by an order of
      magnitude) 
    2. Setting up tons of paging space (using the hard-drive as virtual
      memory) and use the "fast" versions of fastlink. 300 megs is
      usually plenty. Note that paging space is the same as swap
      space. 
    3. Use gcc (the GNU/free software foundation C compiler) to
      compile fastlink (gcc produces machine language that is about
      10% faster than sun's C compiler). 
    4. Install the generic-small kernel instead of the generic kernel
      (the generic kernel has device files for almost EVERYTHING.
      The generic-small kernel is configured for a system without
      many devices and without many users). Installing a
      generic-small kernel is an option during system installation on
      sun workstations. 
    5. Reconfigure your kernel so it has only devices which you need.
      This is a task for an experienced system administrator. This
      should give you a small improvement in overall system speed,
      but if you are already running the generic-small kernel,
      additional improvement may be so small that it's not worth the
      trouble. If the generic-small kernel is insufficent for your
      system (so you were forced to install the generic kernel) this
      step is a MUST. The generic kernel will slow down your
      workstation significantly, and most of the device-support is
      unnecessary. 
    6. Don't run your linkage analyses in the background, because
      running programs in the background gives them a lower
      priority (on suns it reduces the priority level by 3 out of a total
      range of 40). Either do the runs in the foreground (which is fine
      as long as you don't plan to log out) or you can use the root
      password to renice the pedin process by -3 to compensate
      (negative nice values give a higher priority). If you need to log
      out, you can use the screen command (distributed by GNU/free
      software foundation) and "detach" a session so you can log out
      without programs terminating. Later you can log back in and
      "reattach" the session, which continued to run while you were
      logged out. The screen command is available at prep.ai.mit.edu,
      and is also on the O'Reilly Unix Power Tools CD-ROM.
      According to the sun documentation, renicing below -10 can
      interfere with the operating system and actually reduce the
      process' speed. I just run them at a priority/nice level of 0 (the
      standard default level). That gives me reasonable response with
      my other applications, but still lets fastlink run at a decent
      speed. 
    7. Run with 100% penetrance Runs with 100% penetrance can run
      faster than runs with incomplete penetrance. Of course, if you
      have an unaffected obligate carrier, this won't work. In
      addition, incomplete-penetrance runs may be necessary for
      your research to be "good" (decisions like this are why the
      professors make the big bucks :-) 
    8. Change the block size of your filesystem (from Gerard Tromp)
      One can increase performance of a filesystem by increasing the
      block size -- this decrease the number of read-write
      operations. A block device such as a hard disk usually accesses a
      block of data simulataneously. Thus if one is expecting to use
      large files, having large blocks will be an advantage.
      Simultaneously though one usually trades the number of bytes
      lost to partial files since one has to increase the fragment size to
      a number larger than 1024 e.g. 2048. That is, each file or part of
      a file occupies 2048 bytes, a file of 100 bytes will still occupy
      2048 bytes. i.e. Bigger blocks == faster bigger blocks => bigger
      fragments == more lost space. bigger blocks, allows for more
      cylinders per group. 

   Related:
   see: newfs (8)          - create a new file system
   for details on default values for file systems:
   inode           --      2048 bytes/inode
   block           --      8192 bytes/block
   frag(ment)      --      1024 bytes/fragment

   Gerard Tromp notes that you can increase the speed of programs which
   create/access large files in the /tmp directory by creating a tmpfs
   filesystem. The stuff is complicated and I haven't fully
   assimilated/understood his email yet, so I'm not including it yet. I'll be
   happy to send any interested parties a forward of Gerard Tromp's
   email. I hope to have tmpfs information in the next edition of the FAQ. 

   Of course, buying more RAM will increase your speed. I've heard that
   increasing RAM from 16 to 32 megs will result in a large increase in
   speed. Increasing RAM from 32-64 megs will result in a significant
   increase. Increasing beyond 64megs is not particulairly helpful. Note
   that this data is anecdotal in nature (I haven't seen it myself), but it
   makes intuitive sense to me. If someone sends me some SIMMS for our
   sparcII, I'll be glad to test it out :-) A professor has offered to let me
   run a fastlink benchmark on his sparc10 with 128megs RAM. I'll post
   results as soon as they come in. note: I run on a sun sparcII. I'd like to
   hear data from people on other platforms. I'd especially like to hear
   data on the speed-RAM relationship. 

   I set up 300 megs of paging space on my workstation, but now I'm
   running out of hard-drive space. Is there any way I can use my hard
   drive space more effeciently? [rootd;29may94] 

   Paging space is hard-drive space which is used as virtual RAM. Unix
   boxes use paging space constantly, swapping processes out to the
   hard-drive and into RAM constant. Remember that "paging space" is
   the same as "swap space". There are two types of paging-space on sun
   systems (and many other types of Unix systems as well): paging files,
   and paging partitions. Paging files are actual files (you can do an ls and
   find them in a directory somewhere) in the filesystem. Paging
   partitions are separate disk partitions, and as such are not in the
   filesystem. 

   A filesystem has two types of overhead. Consider the following output:

   bigbox% df
   Filesystem            kbytes    used   avail capacity  Mounted on
   /dev/sd0a               7735    5471    1491    79%    /
   /dev/sd0g             151399  127193    9067    93%    /usr
   /dev/sd3a             306418  266644    9133    97%    /usr2
   bigbox% df -i
   Filesystem             iused   ifree  %iused  Mounted on
   /dev/sd0a                951    3913    20%   /
   /dev/sd0g              10218   66390    13%   /usr
   /dev/sd3a               6278  150394     4%   /usr2

   The top df command shows the space available on "bigbox" in k. Note
   that, although sd3a has 306 megs, of which 267 megs are used, only 9
   megs are available. This is because the filesystem saves a "10%" rainy
   day fund, so 10% of the filesystem is unusable. Although you can
   reduce this percentage (with the root password and using an arcane
   command), it is not recommended. According to sun's documentation,
   when the filesystem gets more than 90% full the speed of the
   filesystem will begin to rapidly drop. When you have a 100 meg
   paging file, there is a corresponding 10 megs of "rainy-day-fund"
   which you cannot access, so setting up a 100 meg paging file requires
   110 megs of disk space. But when you use a seperate partition as a
   paging partition, no 10% rainy-day fund is necessary. 100 megs of raw
   disk space will give you 100 megs of virtual-RAM. 

   The bottom df command shows the number of inodes available in the
   filesystem. An inode points to files, and is part of the filesystem that
   you rarely need to look at. By default, when you create a filesystem in
   a partition, one inode is created for every 2k in the partition. The 306
   meg partition has 156,000 inodes, but only 4% of them are used. I don't
   know how large an inode is (a quick search through my documentation
   failed to find it) but I would guess that an inode is 256 bytes. If that's
   true, the 150,000 unused inodes above are wasting 37.5 megs of
   disk-space. One inode for every 2k is too much. When you create a 100
   meg paging file, you only use 1 inode, but that 100 megs of filesystem
   has a corresponding 50,000 inodes! If you create a paging-partition,
   you are not using a filesystem, so no inodes are necessary. In addition,
   when you create a filesystem, you can reduce the number of inodes to
   something more reasonable (like one inode for every 10k of disk
   space). I generally don't mess with the inode count on my / and /usr
   partitions, since that contains the operating system. Make certain not to
   reduce the default inode number too much: YOU DONT WANT TO
   RUN OUT OF INODES. We converted our 350 megs of paging files
   to paging partition, and got another 70 megs of free disk space as a
   result (20%)! 

   But I don't know how to do all this optimization, and my research
   assistant is spending all his/her time trying to figure it out.
   [rootd;21may94] 

   Unix system administration is a complex task which requires
   experience. An experienced sysadmin can do in minutes what it would
   take you hours (or days) to accomplish. In addition, an experienced
   sysadmin won't make stupid mistakes very often (lets see, while I was
   learning on-the-job I ruined our backup tape during an upgrade
   {luckily the upgrade was successful!}, moved a directory inside itself
   as root, botched email service a couple times, and spent tons of time
   figuring out how to accomplish simple tasks). 

   Most universities have small budgets for their system administrators.
   Many head sysadmins have recruited students to assist them. Basically
   the students slave away for nothing, learn tons of stuff, barely pass
   their classes, become unix gods, and get hired for 40k+/year if/when
   they graduate/flunk out. If your university has a sysadmin group like
   this, you can probably "hire" them to support your machine for about
   $6/hour at about 4 hours/week*machine. The head-sysadmin will be
   happy to give some money to their more-experienced volunteers, the
   volunteers get another line on their resume+additional experience, and
   you get experienced sysadmins to run your machine. In addition, most
   sysadmin groups have an automated nightly backup. Just think: your
   machine gets backed up EVERY NIGHT AUTOMATICALLY! 

   At Portland State University the Electrical Engineering sysadmin
   group has been hired to maintain the unix machines of four other
   departments, at an average price of $15/week*machine (no additional
   price for xterms!) The quality of the service is excellent (especially
   since the most experienced volunteers are usually the ones given the
   money), there is no annual training-gap as people leave (since the
   experienced volunteers are constantly training the new ones) and you
   have the entire resources and experience of the sysadmin group to help
   you. 

   Of course, test them by deleting an unimportant file and seeing if they
   can restore it from backups (the backup test is the most important in
   system administration--have you tested your backups lately?). If they
   successfully restore the file from backups, give them the
   sun-optimization list (above two questions) and watch as the most
   experienced volunteer turns the optimization into a recruit-training
   session :-) They may even have a contest to see how small they can
   make your kernel-configuration file! 

   If your location doesn't have such a group, perhaps another universtiy
   in town has one. 

   How can I identify how much paging space is available on my
   workstation? [gerard tromp; 29apr94] 

   Paging space, also referred to as swap space, as well as its use can be
   identified by: 

   pstat -s        (Non-root users need to use: /usr/etc/pstat -s)

   e.g. 
   > sanger 1% /usr/etc/pstat -s
   > 11456k allocated + 3108k reserved = 14564k used, 252744k available
   > sanger 2%

   swap space can be mounted on several disk partitions, that is on several
   partitions on the same disk or on a partition on several disks. 

   e.g. 
   > sanger 2% cat /etc/fstab
   > /dev/sd0a /                         4.2 rw                   1 1
   > /dev/sd0e /usr                      4.2 rw                   1 2
   >    .
   >    ... several other partitions removed from listing
   >    .
   > /dev/sd1b swap                      swap rw                  0 0
   > /dev/sd2b swap                      swap rw                  0 0
   > swap      /tmp                      tmp rw                   0 0
   > sanger 3%

   FILE FORMATS

   How do I convert between crimap and linkage formts?
   [rootd;29may94] 

   The crimap utilities package contains genlink and linkgen, which
   converts between .gen files and linkage file. I am attempting to find an
   ftp site. If you know of one, let me know. I already have source. If I
   could find the authors, to have them authorize it, I'd be happy to put the
   entire crimap-utilities package on one of my ftp sites. 

   How do I get my ceph data into crimap format? [rootd;29may94] 

   You can output the file in linkage format, and use link2gen (if you have
   it, see F2). The disadvantage here is that your marker names are
   seperated from your data, and its easy to make a mistake and get them
   mixed up. You can output the file in ped.out format and use mkcrigen.
   mkcrigen is a great program, which automatically transfers the
   marker-names with the data (eliminating one source of error).
   Unfortunately, I only have an executable with a hardcoded 80-marker
   maximum. Nobody can find the source code. 

   lnktocri is very similar to link2gen, and is included in the multimap tar
   file 

   John Attwood has a ceph2cri program, which reads your ped.out file
   and outputs a .gen file. It is available via anonymous ftp from
   ftp.gene.ucl.ac.uk in /pub/packages/linkage_utils. It runs on DOS
   machines. According to John Attwood: "Making the Unix-based
   system available is much more complex, as it involves many scripts,
   Makefiles and executables, but I'll try to do it when I have time." If
   you need the unix version, send me email and I'll forward a summary
   to John Attwood. That way he won't waste time putting together a unix
   version unless there is definitive interest. 

   Educational resources for teaching genetics

   Genetics Construction Kit--fly genetics simulator [meir;10Aug94]

   There is an excellent program called Genetics Construction Kit that
   models fruit fly genetics - lots of features, and a pretty good interface.
   It comes on a CD with a bunch of other really good biology education
   software from a consortium called BioQuest ($89 for the CD, and its
   really worth it - only mac stuff though). Look around on bulletin
   boards for the Intro to BioQuest hypercard stack which gives their
   philosophy and a description of the programs they have. 

   Michael Bacon says:

   Well, recently out of a genetics class, I can recommend a program
   called "Catlab." The idea is that you breed lots and lots of cats, and try
   to figure out what genes control the cat's coat and tail. 

   gen5ajt says:

   We use Populus 3.x for DOS (Windows version out soonish), this is an
   excelent population genetics package, I couldn't recommend it too
   much. It's free and downloadable by ftp from somwhere. 

   FAQ keeper: Darrell Root 
   rootd at ee.pdx.edu
   or 
   rootd at ohsu.edu
   HTML by Tim Trautmann 
   timt at ee.pdx.edu



More information about the Gen-link mailing list

Send comments to us at biosci-help [At] net.bio.net