Advanced PopData Indexing
In version 0.7.0
, we introduce a powerful new way to index PopData...
by directly piggy-backing off of the incredible work done in DataFrames.jl
.
Now, you can index and subset PopData [almost] as though you were directly
subsetting the genodata
dataframe, and it will return a new subsetted
PopData object (or other stuff). We'll go through some examples using the nancycats
data.
The conceptual syntax (the arrows are for demonstration) looks like:
# return new PopData
popdata[column -> condition]
# return a new genodata table
popdata[column -> condition, :]
# return a specific column
popdata[column -> condition, :column]
# return specific columns
popdata[column -> condition, [:col1, :col2]]
#julia> using PopGen
julia> ncats = @nancycats
PopData{Diploid, 9 Microsatellite loci}
Samples: 237
Populations: 17
Basic conditional indexingโ
Basic conditional indexing is a fancy way of saying "pulling out specific
information". Let's say we wanted to omit locus fca8
.
julia> ncats[genodata(ncats).locus .!= "fca8"]
PopData{Diploid, 8 Microsatellite loci}
Samples: 237
Populations: 17
Or, maybe we only want loci fca8
and fca23
. We use the โ
(\in<TAB>
) operator and wrap the loci in Ref()
to keep the set from being broadcasted.
julia> ncats[genodata(ncats).locus .โ Ref(["fca8", "fca23"])]
PopData{Diploid, 2 Microsatellite loci}
Samples: 237
Populations: 17
Perhaps we want only populations 1 through 5. Again, we bind the set in Ref()
to prevent broadcasting over its elements. We also need to change the integers to strings because population names are always strings.
julia> ncats[genodata(ncats).population .โ Ref(string.(1:5))]
PopData{Diploid, 9 Microsatellite loci}
Samples: 82
Populations: 5
Maybe we just wanted to know the names of the samples in population 5
. Although for something like this you can just as well index the sampleinfo
dataframe. Note that we need to use unique
here because the genodata
table is in long-format, meaning there are as many occurances of each sample name as there are loci.
julia> ncats[genodata(ncats).population .== "5", :name] |> unique
15-element Vector{InlineStrings.String7}:
"N55"
"N56"
"N57"
"N58"
"N59"
"N60"
"N61"
"N62"
"N63"
"N64"
"N65"
"N66"
"N67"
"N68"
"N69"
Advanced conditional indexingโ
Just like in DataFrames.jl
, we can chain conditions with a broadcasted
"and" operator (.&
) and really pull out information of interest. This also works for a broadcasted
"or" operator (.|
). Something to keep in mind is that each statement needs to be wrapped in
parentheses like:
popdata[(statement1) .& (statement2)]
Let's find all the samples in population 2
that are heterozygous for allele 133
in locus fca8
and return just a dataframe.
Notice we are using the ishet
method ishet(genotype, allele)
and broadcasting it with ishet.()
over an array of genotypes.
julia> gd = genodata(ncats) ;
julia> ncats[(gd.locus .== "fca8") .& (gd.population .== "2") .& (ishet.(gd.genotype, 133)), :]
6ร4 DataFrame
Row โ name population locus genotype
โ String7โฆ String String Tupleโฆ?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ N141 2 fca8 (129, 133)
2 โ N142 2 fca8 (129, 133)
3 โ N146 2 fca8 (129, 133)
4 โ N151 2 fca8 (129, 133)
5 โ N154 2 fca8 (133, 135)
6 โ N155 2 fca8 (131, 133)
How about which samples are missing data for locus fca8
?
julia> gd = genodata(ncats) ;
julia> ncats[(gd.locus .== "fca8") .& (ismissing.(gd.genotype)), :name]
20-element PooledArrays.PooledVector{InlineStrings.String7, UInt8, Vector{UInt8}}:
"N215"
"N216"
"N188"
"N189"
"N190"
"N191"
"N192"
โฎ
"N197"
"N198"
"N199"
"N200"
"N201"
"N206"
This should get you started on thinking of ways to explore your data ๐.