Skip to main content

ยท 5 min read
Pavel Dimens

A bit about FST

If you have spent any time being exposed to population genetics, you've likely encountered the term FST, our field's most beloved/maligned one-size-fits all coefficient. That description is a gross oversimplification, but the purpose of this post isn't to dig deep into the world of differentiation statistics, but instead shed light on the general concept and the logic behind its significance testing.

If you're like me, you've used and reported some kind of pairwise XST statistic in your work, where that value was generated using some fantastic software like adegenet or GenoDive, and you trusted that the methods were honest and appropriate (they are!). Part of the motivation behind PopGen.jl was to invest time into learning and understanding some of these methods I use, despite not coming from a mathematics or computer science background. Well, as I actually learn stuff and decode methods (no pun intended), I feel like it might be worthwhile to share some of my basic insights with others, in a way that I wish I could have been taught. So, this time it's FST. The purpose of this post is to cover the larger concepts without getting into the overwhelming math. Let's go!

What is FST?โ€‹

To understand FST, we need to start with F, the inbreeding coefficient devised by brilliant-but-ethically-questionable Sewall Wright. The F coefficient was a mathematically simple way to understand the level of inbreeding in a group, originally described for cattle breeds (the ๐Ÿ„ part isn't relevant, but it's a fun fact). There were three circumstantial versions of F, denoted by very straightforward subscripted letters: FIT, FIS, FST. The I stands for Individual, S for Subpopulation, T for Total. The combination of letters explains what relationship you are measuring: FIT was the comparison of F for an Individual vs the Total, FIS was the comparison of F for an Individual vs a Subpopulation.

Logically, FST would then mean you were looking at F for Subpopulations vs the Total, which is where population geneticists get all excited. The values range from 0 to 1, with 0 being panmixia (fully mixed) and 1 being complete isolation. Although it seems like it would be linear, it's not-- Wright considered 0.125 to be the cutoff between panmictic and divergent. If one was to calculate FST for one large group of individuals and get a value >0.125, this would suggest there is population subdivision happening (exciting!!).

Pairwise FSTโ€‹

Going one step further, you can narrow how you calculate FST to identify different trends. A "pairwise" FST is when you calculate FST for only a pair of populations at a time. With a pairwise FST, you are now testing if two populations are panmictic or divergent with each other. Depending on your study system and the questions you're trying to answer, this can be extremely valuable information.

From a genetic standpoint, it looks kinda like this: fst_diagram

This is an over-simplified system (diploid, single locus, biallelic, ignoring Hardy-Weinberg stuff), but it demonstrates the point. When you are investigating two groups, if they are both completely heterozygous for the locus (for the same alleles), you would assume the groups are fully mixed (FST = 0). And that makes sense, right? Given the information required to calculate FST, it's a reasonable conclusion to say those alleles are constantly shared between the groups. And the opposite then should be true too: if both groups are completely fixed for different alleles (FST = 1), then they clearly aren't sharing alleles with each other.


Like I alluded to in the opening paragraph, FST (and it's derivatives) aren't a golden rule. There are natural phenomena that FST cannot account for, which may lead to misleading results or incorrect interpretations of results. For one, there are Hardy-Weinberg assumptions, so the presence of kin, overlapping generations, etc. kinda messes things up. Another is consideration is that the genetic data we collect now is only a snapshot in time. If two populations are completely isolated from one another and have been for generations, but are long lived and slow to evolve, then FST may mislead you into believing they are panmictic (that happened in my shark study). These are just a few examples and there are more.

Significance testingโ€‹

One way of being more rigorous with FST values is significance testing (you know, generating those p-values everyone loves so much). So the question is, how do we do that? One common (frequentist) solution is permutation testing. The rationale is this: we get some kind of observed FST with our samples arranged in their natural populations, so would that FST look the same if we shuffled the samples up? Honestly, that's the gist of it. We calculate our observed FST value, then we randomly shuffle the samples into two new populations and recalculate the pairwise FST for the new population pair. Then we just keep shuffling and recalculating FST a few thousands of times. After so many iterations, we compare how many times our observed (original) FST value was greater than the permuted FST values. This is known as a one-tailed test, since we're only interested in knowing if our observed FST is greater than randomness, meaning we're making the case that FST (and therefore divergence) is highest with our samples in this non-random population configuration.

Other meansโ€‹

The field of population genetics has expanded tremendously since the Fisher and Wright era, and there are now all sorts of interesting ways to identify population subdivision. Each method has its strengths and weaknesses, which is why it's good practice to try multiple things and find agreement between methods. At the time of this writing, PopGen.jl can calculate pairwise FST and perform significance testing using the Nei 1987, Weir & Cockerham 1984, and Hudson et al. 1992 methods. If you're interested in contributing to PopGen.jl, we'd love to have you!

ยท 11 min read
Pavel Dimens
PopGen.jl <0.9.0

The kinship interface has changed a bit between versions 0.7 and 0.9. This post has not yet been updated for versions 0.9.0+. To follow along, use versions 0.8.0 or lower.

Getting Startedโ€‹

In a population genetics study, you often need to identify if there are kin in your data. This may be necessary because you are trying to remove kin from your data (because of Hardy-Weinberg assumptions), or maybe kinship is a central interest in your study. Either way, the goal of this tutorial is to provide you with a basic tutorial on using PopGen.jl to perform a relatedness analysis, which is sometimes called a kinship analysis. To follow along, you'll need to have Julia, along with the packages PopGen.jl, PopGenSims.jl, and StatsBase.jl installed. We'll be using the nancycats data because it's smaller than gulfsharks, so things should be a lot quicker.

using PopGen, PopGenSims, StatsBase

julia> cats = @nancycats
PopData{Diploid, 9 Microsatellite Loci}
Samples: 237
Populations: 17


Like Coancestry and the R packages that wrap it (i.e. relate, related), PopGen.jl provides a whole bunch of relatedness estimators that you can choose from for your data. Unfortunately, there is no right answer and you will need to use your discretion. Some people choose an estimator based on the heterozygosity of the data, others choose one based on more liberal or conservative values, and there are yet more criteria one can consider for choosing an estimator. To keep things simple, we're going to use LynchLi. Why? Because I'm the one writing this tutorial, and I said so ๐Ÿ˜.

The Stepsโ€‹

1. Calculate pairwise relatednessโ€‹

This seems pretty obvious, but it needs to be said. In order to do the analysis, you need you get the pairwise relatedness values for the individuals of interest in your data. To keep things simple, we're going to do an all-by-all comparison. But, we also want to boostrap the pairs to create confidence intervals ("CI") for each pair, so let's talk about that.

2. Bootstrap to calculate CIโ€‹

It would behoove us to bootstrap the loci in a pairwise comparison n number of times so we can create a confidence interval for the relatedness estimates for each pair. This inflates the runtime of the analysis substantially, but it's a very useful method in making sense of our estimates. If one was to be conservative (and we generally are), then we would reject an estimate for a pair whose CI includes zero. In PopGen.jl, the estimates and bootstrapping are done all at once to minimize processing time, so the command for that would be

julia> rel_out  = kinship(cats, method = LynchLi, iterations = 1000)

By default, the kinship function uses a 95% CI (interval = (0.025, 0.975)), but you can change that with interval = (low,high) where low and high are decimals of your quantiles. The kinship() function returns a NamedTuple of dataframes whenever you are bootstrapping, where each element shares its name with the method used, so in this case, we can access our results with rel_out.LynchLi.

julia> rel_out.LynchLi
27966ร—8 DataFrame
Row โ”‚ sample_1 sample_2 n_loci LynchLi LynchLi_mean LynchLi_median LynchLi_S โ‹ฏ
โ”‚ String String Int64 Float64? Float64? Float64? Float64? โ‹ฏ
1 โ”‚ N215 N216 8 0.743535 0.747288 0.748042 0.75344 โ‹ฏ
2 โ”‚ N215 N217 8 0.230605 0.233593 0.240085 0.34187
3 โ”‚ N215 N218 8 0.230605 0.230507 0.221861 0.32161
4 โ”‚ N215 N219 8 0.230605 0.23601 0.23567 0.32782
5 โ”‚ N215 N220 8 0.333191 0.33798 0.350492 0.39898 โ‹ฏ
6 โ”‚ N215 N221 8 0.589656 0.594223 0.601308 0.61945
7 โ”‚ N215 N222 8 0.0254328 0.0347216 0.0262021 0.21408
8 โ”‚ N215 N223 8 0.333191 0.329983 0.331411 0.38402
9 โ”‚ N215 N224 8 -0.0258602 -0.021062 -0.0301579 0.21112 โ‹ฏ
10 โ”‚ N215 N7 8 -0.282325 -0.27967 -0.288337 0.33611
11 โ”‚ N215 N141 8 -0.0771532 -0.0796867 -0.083113 0.21261
12 โ”‚ N215 N142 8 0.0254328 0.0302549 0.0330718 0.23957
โ‹ฎ โ”‚ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฑ
27955 โ”‚ N295 N289 7 0.322731 0.347021 0.34118 0.41168 โ‹ฏ
27956 โ”‚ N295 N290 7 0.153414 0.160102 0.164866 0.22862
27957 โ”‚ N296 N297 7 -0.0159038 -0.0182747 -0.0187108 0.16981
27958 โ”‚ N296 N281 7 0.0405353 0.037025 0.0422647 0.15294
27959 โ”‚ N296 N289 7 0.322731 0.328379 0.337317 0.35578 โ‹ฏ
27960 โ”‚ N296 N290 7 0.153414 0.152384 0.16194 0.19131
27961 โ”‚ N297 N281 7 -0.0159038 -0.0128349 -0.0303449 0.21030
27962 โ”‚ N297 N289 7 0.37917 0.392517 0.392818 0.45139
27963 โ”‚ N297 N290 7 0.435609 0.437829 0.450044 0.47027 โ‹ฏ
27964 โ”‚ N281 N289 8 0.20428 0.21279 0.207425 0.29611
27965 โ”‚ N281 N290 7 0.37917 0.386583 0.390585 0.45471
27966 โ”‚ N289 N290 7 0.209853 0.217811 0.222894 0.28649
2 columns and 27942 rows omitted

3. Create CI's for the sibshipsโ€‹

There's more???

The next set of steps seem like a lot more work, so allow me to explain. The estimators all generally give you some value between 0-1 (or 0-0.5, same idea) and you can intuit that certain values mean certain things, like that 0 is "unrelated", 0.25 is "half-sib", and 0.5 is "full-sib". However, those are fixed values, so how do we know how far we can deviate from 0.25 (for example) and still call our pair half-siblings? Instead of hand-waving, we can create confidence intervals from simulated data to act as sibship ranges for our data. If this doesn't make sense yet, it will below. Promise!

i. simulate known sibship pairsโ€‹

Next, we need to further contextualize what out estimates actually mean, and we need to devise a rigorous and defensible method to define the sibling-ship ("sibship") of a pair of samples as unrelated, half-sibs, or full-sibs. To do this, we're going to use PopGenSims.jl to simulate data based on the allele frequencies in our data. What simulatekin does is simulate individuals based on the allele frequencies in your PopData, then simulate offspring of a particular siblingship resulting from the "mating" of those individuals. For example, when you specify "fullsib", it generates two offspring resulting from two parents, n number of times. We want to do this for each of the three kinds of sibships.

julia> kin_sims = simulatekin(cats, fullsib = 500, halfsib = 500, unrelated = 500)
PopData{Diploid, 9 Microsatellite Loci}
Samples: 3000
Populations: 3

We can keep all three simulated relationships together, but it will be easier to explain things (for instructional purposes) if we don't, so let's split out each into its own PopData.

julia> fullsib = kin_sims[genodata(kin_sims).population .== "fullsib"] ;
julia> halfsib = kin_sims[genodata(kin_sims).population .== "halfsib"] ;
julia> unrelated = kin_sims[genodata(kin_sims).population .== "unrelated"] ;

ii. get relatedness estimates for the simulated dataโ€‹

Next, we want to get the relatedness estimate for each simulated pair of "known" sibship. We are only interested in the values for the simulated pairs and not samples across pairs. If you aren't sure why that is, think of it this way: we're trying to create a range of values where we can confidently say unknown things are full-sibs (or half-sib, etc.), so we want to know what range of values we get from a bunch of known fullsib pairs, not the unknown relationships of samples between pairs.

It would a nightmare to manually specify only 2 individuals at a time into kinship(). Instead, the function has a shortcut built into it that will recognize the population names generated from simulatekin and only give you relatedness estimates for those pairs. So, we just need to run it once for each of our simulations, this time without bootstrapping because we are only interested in the estimates. Make sure to use the same estimator! The run will be a lot faster this time because it only needs to calculate estimates for 500 pairs each time (our n from above) and without bootstrapping. When not bootstrapping, kinship() returns a dataframe (versus a NamedTuple of dataframes).

julia> un_sims_rel = kinship(unrelated_sims, method = LynchLi)
500ร—4 DataFrame
Row โ”‚ sample_1 sample_2 n_loci LynchLi
โ”‚ String String Int64 Float64?
1 โ”‚ sim001_unrelated_1 sim001_unrelated_2 9 -0.11419
2 โ”‚ sim002_unrelated_1 sim002_unrelated_2 9 -0.337028
3 โ”‚ sim003_unrelated_1 sim003_unrelated_2 9 -0.0696222
โ‹ฎ โ”‚ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ
498 โ”‚ sim498_unrelated_1 sim498_unrelated_2 9 0.019513
499 โ”‚ sim499_unrelated_1 sim499_unrelated_2 9 0.019513
500 โ”‚ sim500_unrelated_1 sim500_unrelated_2 9 0.019513
494 rows omitted

And if we wanted to plot what that looks like (optional):

using Plots, StatsPlots

julia> density(rel_out.LynchLi.LynchLi, label = "real data", color = :grey, fill = (0, :grey))
julia> density!(un_sims_rel.LynchLi, label = "unrelated")
julia> density!(half_sims_rel.LynchLi, label = "halfsib", color = :blue)
julia> density!(full_sims_rel.LynchLi, label = "fullsib", color = :black)
julia> title!("relatedness estimates on simulated and real data")


Hopefully by now you are starting to contextualize why we're doing all of this. The distributions generated from our simulated data are giving us a better indication of what "unrelated", "halfsib", and "fullsib" estimates look like in our data.

iii. sibship intervalsโ€‹

What we just did is create null distributions for each sibship relationship, so now all that's left is to get a confidence interval from each. Keep in mind that your values will be a bit different due to the randomization involved with many of these steps.

julia> unrelated_ci = quantile(un_sims_rel.LynchLi, (0.025, 0.975))
(-0.3380513384854698, 0.33097433075726507)

julia> halfsibs_ci = quantile(half_sims_rel.LynchLi, (0.025, 0.975))
(-0.06652262584414452, 0.5556155725649398)

julia> fullsibs_ci = quantile(full_sims_rel.LynchLi, (0.025, 0.975))
(0.1989727186688347, 0.8219939374819633)

So, given our data and the simulations we made, we can now make a reasonable assumption regarding the ranges for each sibship relationship:

RelationshipLower BoundUpper Bound
half sibling-0.0660.555
full sibling0.1990.82

4. Finally, the data assessmentโ€‹

Now that we have our relatedness estimates and the acceptable sibship ranges given our data, we can see where our data falls. Now, we can say that a particular sample pair is unrelated/halfsib/fullsib if:

  1. that pair's confidence interval does not include zero and
  2. that pair's estimate falls within any of the three calculate ranges

Unfortunately, the ranges overlap quite a bit, which makes interpretation difficult, so it may be suitable to use a different estimator.

Closing remarksโ€‹

There is more that can be done for relatedness, like network analysis. However, this tutorial covers what we consider a reasonable way to assess kinship in population genetic studies. If removing kin is your ultimate goal, consider the effects doing that may have on the analyses you are looking to do. Additionally, consider which individual of a pair you would remove and why. If you were curious as to how many identical loci a pair of samples may share, you can check that using pairwiseidentical(). Good luck!

ยท One min read
Pavel Dimens

If you haven't guessed already, it's the beloved Punnett Square! Nothing screams genetics like everyone's first entry-level genetic diagram!

The Logo Graveyardโ€‹

While this is completely unrelated to anything important about population genetics or Julia, I (Pavel) want you all to understand the logo-development process that led us to our logo so you can feel my struggle. The process is as follows:

  1. Jason and I spitball ideas

  2. I procrastinate real work and open up Inkscape to draft some ideas

  3. I compose 1-3 versions of an idea and send it to Jason

  4. if Jason != veto
    for i in 1:5
    Jason critiques and suggests changes
    I makes the changes
  5. Jason and I finalize the idea!

  6. Weeks pass and I'm not quite satisfied and we start at 1 again

I'm happy to say that we love the Punnett Square and it's for keeps, but have a look at the scrapped ideas as a little walk down memory lane:

logo graveyard