Skip to main content

Comparison

There's a reason we started investing so many hours and so many new grey hairs into writing PopGen.jl when there was an existing ecosystem in R to perform these same tasks. Like we explain in the home page of these docs, we want a platform that is:

  1. fast(er)
  2. written in a single language
  3. easy to use

So, we'd like to prove that Julia and PopGen.jl actually achieves that by showing a few benchmarks comparing PopGen.jl to popular population genetics packages in R. It's worth mentioning that we ourselves use and have published work incorporating these packages, and are incredibly grateful for the work invested in them. We appreciate those folks and have tremendous respect and envy for the work they continue to do! Here are links to adegenet, pegas, hierfstat, and ape.

Benchmarks

To make this a practical comparison, we're going to use the gulfsharks data because it is considerably larger (212 samples x 2209 loci) than nancycats (237 x 9) and a bit more of a "stress test". All benchmarks in R are performed using the microbenchmark package, and BenchmarkTools are used for Julia.

using BenchmarkTools, PopGen

As a note, the reported benchmarks are being performed on a 64-bit Manjaro Linux system on a nothing-special Lenovo Thinkbook 14S with 8gigs of RAM and a 8th gen Intel i5 mobile processor. Note: all of the Julia benchmarks, unless explicitly stated, are performed single-threaded (i.e. not parallel, distributed, or GPU).

Reading in data

Since gulfsharks is provided in PopGenCore.jl, we will just load it with genepop(). If you would like to try this yourself in R, find the gulfsharks.gen file in the package repository under /data/source/gulfsharks.gen. Since the file importer now uses CSV.jl to read files, there is one step of the genepop parser that is multithreaded. However, the majority of the data parsing (formatting the raw data into a correct PopData structure) occurs using a single thread. This R benchmark will take a few minutes. Consider making some tea while you wait.

@benchmark sharks = genepop("data/source/gulfsharks.gen", silent = true)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min … max): 472.671 ms … 526.777 ms ┊ GC (min … max): 0.00% … 5.75%
Time (median): 507.187 ms ┊ GC (median): 5.00%
Time (mean ± σ): 501.984 ms ± 17.301 ms ┊ GC (mean ± σ): 3.47% ± 2.99%

▁ ▁ ▁ ▁ ▁ ▁ █ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
473 ms Histogram: frequency by time 527 ms <

Memory estimate: 172.46 MiB, allocs estimate: 2246870.

import plot

Comparing averages, Julia (via PopGen.jl) clocks in at 507ms versus R (via adegenet) at 6.745s , so ~13.3x faster.

PopData vs genind size

It was pretty tricky to come up with a sensible/efficient/convenient data structure for PopGen.jl, and while the two-DataFrames design might not seem like it took a lot of effort, we ultimately decided that the column-major style and available tools, combined with careful genotype Typing was a decent "middle-ground" of ease-of-use vs performance.

Anyway, it's important to understand how much space your data will take up in memory (your RAM) when you load it in, especially since data's only getting bigger! Keep in mind that gulfsharks in PopGen.jl also provides lat/long data, which should inflate the size of the object somewhat compared to the genind, which we won't add any location data to.

julia> Base.summarysize(sharks)
3452870
#bytes

data structure plot

The original genepop file is 3.2mb (the vertical line), and our PopData object takes up ~3.45mb in memory (300kb larger than the source file) versus the ~5.3mb of a genind, which is ~1.5x larger than the source file. That's quite a big difference!

Summary statistics

The obvious hallmark of population genetics is heterozygosity values and F-statistics. Here we'll compare the basic summary statistics that can be produced using hierfstat and PopGen.jl.

julia> @benchmark summary(sharks, by = "global")
BenchmarkTools.Trial:
memory estimate: 88.42 MiB
allocs estimate: 1307128
--------------
minimum time: 151.963 ms (0.00% GC)
median time: 171.484 ms (7.60% GC)
mean time: 172.456 ms (6.08% GC)
maximum time: 186.606 ms (7.04% GC)
--------------
samples: 29
evals/sample: 1

summary statistics plot

Comparing averages, PopGen.jl clocks in at ~171ms versus hierfstat's 4.6s, which is ~27x faster on these data. However, when testing on a data that was 401 samples x 5331 loci (not shown), PopGen.jl performed 36.6x faster. This gap seems to increase the larger the data is, but we have not tested the upper limits of this.

Chi-squared test for HWE

This is a classic population genetics test and a relatively simple one. The R benchmark will take a while again, so if you're following along, this would be a good time to reconnect with an old friend.

julia> @benchmark hwe_test(sharks)
BenchmarkTools.Trial:
memory estimate: 46.22 MiB
allocs estimate: 1050525
--------------
minimum time: 145.476 ms (0.00% GC)
median time: 179.146 ms (4.35% GC)
mean time: 176.142 ms (3.56% GC)
maximum time: 204.813 ms (0.00% GC)
--------------
samples: 29
evals/sample: 1

chi squared plot

Comparing averages, PopGen.jl clocks in at ~176ms versus adegenet's 6.3s, so ~35.8x faster on these data(!)

Pairwise FST

You all know it, you all love it. What's population genetics without a little pairwise FST sprinkled in? This benchmark compairs the Weir & Cockerham pairwise FST calculation in PopGen.jl against hierfstat

We will add the extra keywords samples and seconds to the benchmark macro so we can get a full 100 evaluations. You will need to start Julia with 1 available threads via julia --threads 1 (julia >= v1.5) or JULIA_NUM_THREADS=1 (< v1.5).

julia> @benchmark pairwise_fst(sharks) samples = 100 seconds = 700
BenchmarkTools.Trial:
memory estimate: 869.93 MiB
allocs estimate: 6090633
--------------
minimum time: 557.995 ms (9.29% GC)
median time: 571.297 ms (11.26% GC)
mean time: 580.627 ms (12.59% GC)
maximum time: 754.451 ms (31.41% GC)
--------------
samples: 100
evals/sample: 1

pairwise fst plot

On a single thread, pairwise FST in PopGen.jl is ~340x faster than in hierfstat, and a whopping 665x faster using 6 threads with the optimized matrix-based implementation.