To:IN%"genetic-linkage at net.bio.net"
CC:
Subj:RE: comparing allele frequencies
Don Bowden "bowden at mgrp.bgsm.wfu.edu" writes:
Question:
This is probably a dumb question, but... we have genotyped a
microsatellite in a large number of caucasian and
african-american samples and would like to compare the allele
frequency distribution to see if they are different. I did this
using a contingency table to calculate chi square and there is a
significant difference. I seem to recall though that if any of
the cells has less than 5 elements in it, chi square is not the
appropriate way to go. What is the right way to do this, and more
importantly, is there some textbook or other source which would
show a simple-minded molecular biologist how to do it?
>
Following is a summary of the many responses I received; I left
off the names to protect the innocent, but will be happy to get
people in touch with each other if they want. Thanks for the many
helpful suggestions.....
This is not a dumb question. It is not easy to deal with
statistical analysis of loci with lots of alleles, as is
typical of micro-satellite repeats. You could look at Bruce
Weir's "Data Analysis" book; there is some stuff on tests
involving multiple locus markers. Depending on the number of
alleles it may be easy or hard; there has been quite a bit
published in the last few years on statistical tests involving
extraordinarily polymorphic systems, but this literature hasn't
made it into books yet.
You are correct to be leery of tests which are based on
large sample approximations when your samples aren't big enough.
The "5" rule for the chi-square test is more a rule-of-thumb than
a hard-and-fast rule. For tables with not too many cells, it is
often possible to use exact permutation tests instead. Rather
than just consult a textbook if you are unsure of what you are
doing, why don't you see if your university has a statistical
consulting service? At least they might steer you to appropriate
analyses, even if you have to carry out the
computations yourself.....
Bruce Weir's book, Genetic Data Analysis (Sinauer, 1990) provides
a thorough and expert (but not simple) discussion. Be aware that
this is a hot and extremely controversial question at the moment
(if we knew the one, or any one, definitively correct way of
comparing allele frequency distributions for samples drawn from
two populations of humans, typed for multiallelic DNA markers,
and from the comparison estimating accurately how much allele
frequencies truly vary between populations, most of the
controversy concerning forensic DNA typing could go away).
What you specifically need to be aware of is that several
competing "definitive" solutions exist at the moment, and
Weir's is the only one....
You are right to question the accuracy of the Chi-square
result in the case where some of the cell numbers are < 5 in a
2Xn contingency table. The best way to do the test is by a
Monte-Carlo simulation, where many random datasets are generated
that all have the same marginal totals that your data have. The
Chi-square value is calculated for all of the tables and the
position of your table among all the tables is used as the
measure of significance.
A biologically relevant reference for doing this is: Roff,
D.A. and Bentzen, P. 1989. The analysis of mitochondrial DNA
polymorphisms: chi-square and the problem of small samples.
Molecular Biology and Evolution 6:539-454
A book that addresses this issue is: Agresti, A. 1990
Categorical Data Analysis, Wiley Pub Co.
A couple of years ago I wrote a DOS program to do the same
kind of analysis.....
Yes, Monte Carlo is a good way to do the tests, if programs
already exist. However, if the tables are very large and
sparse, standard Monte Carlo (just keeping the marginals fixed)
is very slow - there are methods based on Markov Chain Monte
Carlo
methods which become an option then, but these methods haven't
to my knowledge been implemented in such a fashion that for
your data programs already exist.
You should post a summary of your replies - you are
undoubtedly not the only person who wants to do such tests.
They are also relevant for case-control studies, e.g., when you
might be
interested in linkage disequilibrium in the vicinity of a mapped
disease locus.....
Thanks for forwarding the message regarding the analysis of
sparse, many-celled contingency tables (by Monte-Carlo
simulations). My small, home-brewed program called was
designed to analyze these kinds of data. I've used it to
compare allele frequency distributions of RFLP VNTR's for the
forensic lab at the Royal Canadian Mounted Police and the FBI.
I've distributed the program to anyone who requests it....
There is a problem with markers with large numbers of
alleles. Your approach is basically right .... and you are
right chi-square is not very robust with less than 5 per cell.
Most stat people will tell you to collapse cells until you
get at least 5 per cell... e.g. take the 110 and 112 bp alleles
and put them in one cell....
According to my old version of Steele and Torrie, you need
to simply apply a correction for continuity to your test. They
quote Yates as proposing the reduction of the absolute deviation
(observed-expected) by 0.5....
You calculated the chi-square value as sum [ (O-E)^2_
E]. A better statistic [but the exact same data structure] is
the G test. It is more closely distributed as a chi-square,
especially when class numbers are small. The rule of thumb is
to have no expected class number less than 1, for this test. If
classes are too small, a column can be lumped with anther
column of a rare allele. The larger the table (more than 2X2),
the less sensitive this test is to small expected numbers. See
Sokal and Rohlf, Biometry, Second Ed., 1981, Freeman, pages
731-747. Your design is probably model II (maybe I, but not
III). Especially seen pages 744-6. This is a test for
independence, homogeneity, or heterogeneity. The G-test might
be called a likelihood ratio test in another text.....
You should use a Fisher exact test (because of the small cell
sizes as you surmised) but probably a variation of the test
(since Fisher's test is for 2 X 2 tables) that was described by:
SW Guo and EA Thompson "Performing the exact test of
Hardy-Weinberg proportion for multiple alleles" Biometrics
48:361-372, 1992. Guo and Thompson also have a more detailed
technical report available from Elizabeth Thompson at the
University of Washington [(206) 543-7237] and ask the secretary
to send Technical Report #187 and a program available on
request. While you are not necessarily testing H-W, the methods
are easily adapted to the comparisons of population allelic
distributions you are doing....
If you are interested in doing exact tests of Hardy-Weinberg
Equilibrium, there is a nice program available from Sun-Wei Guo
at the University of Michigan. His programs are written in `C'
and require compilation on your own machine. I've compiled them
on a Sun Workstation and found them very easy to use.....
are easily adapted to the comparisons of population allelic
distributions you are doing....
If you are i