Skip to main content

Hardy-Weinberg Equilibrium

Testing for Hardy-Weinberg Equilibrium (often abbreviated HWE) is a fairly common practice in population genetics. In a two-allele system, the HWE equation is defined as: p2+2pq+q2=1p^2 + 2pq + q^2 = 1 , where pp is the frequency of the first allele and qq is the frequency of the second allele. The formula describes the frequency of all possible genotypes where

HWE variableGenotypeState
p2p^2"pp"homozygous
q2q^2"qq"homozygous
2pq2pq"pq"heterozygous

Testing for deviation from HWE is usually done with a Chi-Squared (ฯ‡2\chi^2) test, where one compares the observed genotype frequencies to the expected genotype frequencies given the observed allele frequencies at a locus. Specifically, the equation is

โˆ‘(observedโˆ’expected)2expected\sum{\frac{(observed - expected)^2}{expected}}

where observedobserved is the observed genotype frequency and expectedexpected is the expected genotype frequency for a locus. To generate our test statistic, we calculate the degrees of freedom:

degreesย ofย freedom=nexpectedย genotypesโˆ’nobservedย allelesdegrees\ of\ freedom = n_{expected\ genotypes} - n_{observed\ alleles}

and use this as the parameter for our ฯ‡2\chi^2 distribution, followed by a cumulative density function using this ฯ‡2\chi^2 distribution and our ฯ‡2\chi^2 value calculated above.

Chi-Squared Testโ€‹

hwetest(data::PopData, by::String = "locus", correction::String = "none")

Calculate chi-squared test of HWE for each locus and returns observed and expected heterozygosity with chi-squared, degrees of freedom and p-values for each locus. Use by = "population" to perform this separately for each population (default: "locus"). Use correction = to specify a P-value correction method for multiple testing (recommended). For convenience, the correction method is appended to the name of the column, so you will always know how those P-values were adjusted.

Argumentsโ€‹

  • data : the input PopData

Keyword argumentsโ€‹

  • by : "locus" (default) or "population"
  • correction : a string specifying a P-value adjustment type (default: "none")

correction methodsโ€‹

  • "bonferroni" : Bonferroni adjustment
  • "holm" : Holm adjustment
  • "hochberg" : Hochberg adjustment
  • "bh" : Benjamini-Hochberg adjustment
  • "by" : Benjamini-Yekutieli adjustment
  • "bl" : Benjamini-Liu adjustment
  • "hommel" : Hommel adjustment
  • "sidak" : ล idรกk adjustment
  • "forwardstop" or "fs" : Forward-Stop adjustment
  • "bc" : Barber-Candeฬ€s adjustment

๐Ÿค” For more information on multiple testing adjustments, see MultipleTesting.jl

Examplesโ€‹

julia> hwetest(@gulfsharks)
2209ร—4 DataFrame
Row โ”‚ locus chisq df P
โ”‚ String Float64 Int64 Float64
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ contig_35208 94.5678 6 0.0
2 โ”‚ contig_23109 50.789 2 9.36085e-12
3 โ”‚ contig_4493 40.7903 2 1.38832e-9
4 โ”‚ contig_10742 14.7325 2 0.000632247
5 โ”‚ contig_14898 58.1948 2 2.30704e-13
6 โ”‚ contig_8483 4.05732 2 0.131511
7 โ”‚ contig_8065 18.0799 2 0.000118574
8 โ”‚ contig_14708 13.6264 2 0.00109919
โ‹ฎ โ”‚ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ
2203 โ”‚ contig_18959 106.658 2 0.0
2204 โ”‚ contig_43517 23.8965 2 6.47041e-6
2205 โ”‚ contig_27356 14.4493 2 0.000728417
2206 โ”‚ contig_475 76.6038 2 0.0
2207 โ”‚ contig_19384 13.7915 2 0.00101209
2208 โ”‚ contig_22368 20.3787 2 3.75686e-5
2209 โ”‚ contig_2784 6.13433 2 0.0465531
2194 rows omitted

Interpreting the resultsโ€‹

Since the results are in table form, you can easily process the table using DataFramesMeta.jl or Query.jl to find loci above or below the alpha threshold you want. As an example, let's perform an HWE-test on the nancycats data without any P-value adjustments:

julia> ncats_hwe = hwetest(@nancycats , by = "population") ;

Now, we can now filter this table and leave only what we're interested in:

julia> ncats_hwe[(ncats_hwe.P .!== missing) .& (ncats_hwe.P .<= 0.05), :]

With this command, we filter the table for:

  1. the P-values are not missing
  2. the P-values are less than or equal to 0.05.

Note: You can use DataFramesMeta.jl, Query.jl, SplitApplyCombine.jl and others for more declarative dataframe manipulation.

This results in a table that now only includes non-missing P-values of 0.05 or lower:

71ร—5 DataFrame
Row โ”‚ locus population chisq df P
โ”‚ String String Float64 Int64 Float64?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ fca8 2 74.3426 30 1.24317e-5
2 โ”‚ fca8 3 85.2 42 9.15914e-5
3 โ”‚ fca8 6 70.4136 42 0.00390639
4 โ”‚ fca8 7 63.3673 30 0.00035342
5 โ”‚ fca8 10 26.4489 12 0.00926812
6 โ”‚ fca8 11 60.801 42 0.0302593
7 โ”‚ fca8 16 26.15 12 0.0102213
8 โ”‚ fca8 17 57.2 30 0.00198256
โ‹ฎ โ”‚ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ
65 โ”‚ fca96 16 37.0033 20 0.0116913
66 โ”‚ fca37 1 12.6562 6 0.0488314
67 โ”‚ fca37 3 24.1481 12 0.0194174
68 โ”‚ fca37 5 61.1317 12 1.40268e-8
69 โ”‚ fca37 7 13.0062 6 0.042938
70 โ”‚ fca37 11 61.8056 42 0.024858
71 โ”‚ fca37 12 61.618 42 0.0257958
56 rows omitted

Visualizing the resultsโ€‹

While not strictly necessary, it might sometimes make sense to generate of heatmap of the results for easier visualization. This is feasible for the nancycats data, but when loci are in the hundreds or thousands, this method quickly becomes counterproductive. In any case, here is a simple example of the HWE results for nancycats plotted as a heatmap using VegaLite.jl (other packages like Makie.jl, Plots.jl, Gadfly.jl etc. would work great too!):

using VegaLite

julia> ncats_hwe = hwetest(@nancycats , by = "population", correction = "bonferroni");

julia> ncats_hwe |> @vlplot(:rect, :locus, :population, color=:P_bonferroni)

hwe_test


Acknowledgementsโ€‹

While most of the arithmetic for the Hardy-Weinberg test is written by us, we rely on the Chi-Squared distribution and probability density function provided by Distributions.jl.