Skip to main content

Advanced PopData Indexing

In version 0.7.0, we introduce a powerful new way to index PopData... by directly piggy-backing off of the incredible work done in DataFrames.jl. Now, you can index and subset PopData [almost] as though you were directly subsetting the genodata dataframe, and it will return a new subsetted PopData object (or other stuff). We'll go through some examples using the nancycats data. The conceptual syntax (the arrows are for demonstration) looks like:

# return new PopData
popdata[column -> condition]

# return a new genodata table
popdata[column -> condition, :]

# return a specific column
popdata[column -> condition, :column]

# return specific columns
popdata[column -> condition, [:col1, :col2]]
#julia> using PopGen
julia> ncats = @nancycats
PopData{Diploid, 9 Microsatellite loci}
Samples: 237
Populations: 17

Basic conditional indexingโ€‹

Basic conditional indexing is a fancy way of saying "pulling out specific information". Let's say we wanted to omit locus fca8.

julia> ncats[genodata(ncats).locus .!= "fca8"]
PopData{Diploid, 8 Microsatellite loci}
Samples: 237
Populations: 17

Or, maybe we only want loci fca8 and fca23. We use the โˆˆ (\in<TAB>) operator and wrap the loci in Ref() to keep the set from being broadcasted.

julia> ncats[genodata(ncats).locus .โˆˆ  Ref(["fca8", "fca23"])]
PopData{Diploid, 2 Microsatellite loci}
Samples: 237
Populations: 17

Perhaps we want only populations 1 through 5. Again, we bind the set in Ref() to prevent broadcasting over its elements. We also need to change the integers to strings because population names are always strings.

julia> ncats[genodata(ncats).population .โˆˆ  Ref(string.(1:5))]
PopData{Diploid, 9 Microsatellite loci}
Samples: 82
Populations: 5

Maybe we just wanted to know the names of the samples in population 5. Although for something like this you can just as well index the sampleinfo dataframe. Note that we need to use unique here because the genodata table is in long-format, meaning there are as many occurances of each sample name as there are loci.

julia> ncats[genodata(ncats).population .== "5", :name] |> unique
15-element Vector{InlineStrings.String7}:
"N55"
"N56"
"N57"
"N58"
"N59"
"N60"
"N61"
"N62"
"N63"
"N64"
"N65"
"N66"
"N67"
"N68"
"N69"

Advanced conditional indexingโ€‹

Just like in DataFrames.jl, we can chain conditions with a broadcasted "and" operator (.&) and really pull out information of interest. This also works for a broadcasted "or" operator (.|). Something to keep in mind is that each statement needs to be wrapped in parentheses like:

popdata[(statement1) .& (statement2)]

Let's find all the samples in population 2 that are heterozygous for allele 133 in locus fca8 and return just a dataframe. Notice we are using the ishet method ishet(genotype, allele) and broadcasting it with ishet.() over an array of genotypes.

julia> gd = genodata(ncats) ;
julia> ncats[(gd.locus .== "fca8") .& (gd.population .== "2") .& (ishet.(gd.genotype, 133)), :]

6ร—4 DataFrame
Row โ”‚ name population locus genotype
โ”‚ String7โ€ฆ String String Tupleโ€ฆ?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ N141 2 fca8 (129, 133)
2 โ”‚ N142 2 fca8 (129, 133)
3 โ”‚ N146 2 fca8 (129, 133)
4 โ”‚ N151 2 fca8 (129, 133)
5 โ”‚ N154 2 fca8 (133, 135)
6 โ”‚ N155 2 fca8 (131, 133)

How about which samples are missing data for locus fca8?

julia> gd = genodata(ncats) ;
julia> ncats[(gd.locus .== "fca8") .& (ismissing.(gd.genotype)), :name]
20-element PooledArrays.PooledVector{InlineStrings.String7, UInt8, Vector{UInt8}}:
"N215"
"N216"
"N188"
"N189"
"N190"
"N191"
"N192"
โ‹ฎ
"N197"
"N198"
"N199"
"N200"
"N201"
"N206"

This should get you started on thinking of ways to explore your data ๐Ÿ˜„.