Skip to main content

Viewing data

PopGen.jl includes commands to provide obvious methods to inspect and alter PopData. Using standard Julia conventions, only commands ending with a bang ! are mutable, meaning they alter the input data. So, commands like populations will show you population information, whereas populations! will change that information in your PopData. The mutable commands here alter the data in your PopData, but not the source data (i.e. the files used to create the PopData). The "manipulation" commands were separated into smaller sections to make it less overwhelming, and using the gulfsharks data, you can explore each of the sections like a little tutorial. The sections don't follow any particular order, so feel free to jump around however you like.

avoid accessing fields directly

TL;DR: End-users (vs developers) shouldn't access PopData fields directly and use the access functions instead

In earlier versions of PopGen.jl, you were encouraged to directly access the internal fields of PopData. After careful consideration and discussion with other users and developers, it's been decided that we should follow standard-ish convention and provide function wrappers to view PopData fields and discourage direct access (unless you're a developer). This decision is intended to limit unintentional errors, but also means a user has less to learn to get started.

A little hands-on training will probably go a long way, so let's through some of the functions available in PopGen.jl with the included data. This tutorial will include both inputs and outputs so you can be confident what you're seeing in your Julia session is exactly what's supposed to happen. Sometimes the outputs can be a little lengthy, so they will be arranged in code "tabs".

don't manually edit or sort

There are specific relationships between the record entries in PopData objects, so do not use sort, sort!, or manually arrange/add/delete anything in PopData. There are included functions to remove samples or loci, rename things, add location data, etc.

Loading in the dataโ€‹

Let's keep things simple by loading in the nancycats data and calling it ncats.

julia> ncats = @nancycats
PopData{Diploid, 9 Microsatellite loci}
Samples: 237
Populations: 17

Now that we have nancycats loaded in, we can use standard Julia accessor conventions to view the elements within our PopData. The DataFrames uses the convention dataframe.colname to directly access the columns we want.

The metadata (data about the data)โ€‹

Some critical information about the data is front-loaded into a PopData object to eliminate constantly getting these values in calculations. To view this information, use metadata().

julia> metadata(ncats)
ploidy: 2
loci: 9
samples: 237
populations: 17
biallelic: false

Included in metadata are two DataFrames, one for sample information, and another for locus information.

sampleinfoโ€‹

To view the sample information, you can use sampleinfo()

julia> sampleinfo(ncats)
237ร—3 DataFrame
Row โ”‚ name population ploidy
โ”‚ String7โ€ฆ String Int8
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ N217 1 2
2 โ”‚ N218 1 2
3 โ”‚ N219 1 2
4 โ”‚ N220 1 2
5 โ”‚ N221 1 2
6 โ”‚ N222 1 2
โ‹ฎ โ”‚ โ‹ฎ โ‹ฎ โ‹ฎ
232 โ”‚ N197 14 2
233 โ”‚ N198 14 2
234 โ”‚ N199 14 2
235 โ”‚ N200 14 2
236 โ”‚ N201 14 2
237 โ”‚ N206 14 2
222 rows omitted

Using the standard DataFrames getindex methods, we can access these columns like so:

julia> sinfo = sampleinfo(ncats) ;
julia> sinfo.name
237-element Array{String,1}:
"N1"
"N2"
"N3"
"N4"
"N5"
"N6"
"N7"
"N8"
โ‹ฎ
"N230"
"N231"
"N232"
"N233"
"N234"
"N235"
"N236"
"N237"

The genotype tableโ€‹

genodataโ€‹

You can view the genotype information with genodata().

julia> genodata(ncats)
2133ร—4 DataFrame
Row โ”‚ name population locus genotype
โ”‚ String String String Tupleโ€ฆ?
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ N215 1 fca8 missing
2 โ”‚ N216 1 fca8 missing
3 โ”‚ N217 1 fca8 (135, 143)
4 โ”‚ N218 1 fca8 (133, 135)
5 โ”‚ N219 1 fca8 (133, 135)
6 โ”‚ N220 1 fca8 (135, 143)
โ‹ฎ โ”‚ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ
2128 โ”‚ N295 17 fca37 (208, 208)
2129 โ”‚ N296 17 fca37 (208, 220)
2130 โ”‚ N297 17 fca37 (208, 208)
2131 โ”‚ N281 17 fca37 (208, 208)
2132 โ”‚ N289 17 fca37 (208, 208)
2133 โ”‚ N290 17 fca37 (208, 208)
2121 rows omitted

Because the genotype data is in long format (aka "tidy"), accessing genotypes in a meaningful way is fairly straightforward if you have any experience with dataframe manipulation. For a deeper look into indexing PopData, read Advanced PopData Indexing

The functions here help you inspect your PopData and pull information from it easily.

View specific informationโ€‹

sample namesโ€‹

samplenames(data::PopData)

View individual/sample names in a PopData.

julia> samplenames(sharks)
212-element Array{String,1}:
"cc_001"
"cc_002"
"cc_003"
"cc_005"
"cc_007"
โ‹ฎ
"seg_027"
"seg_028"
"seg_029"
"seg_030"
"seg_031"

locus namesโ€‹

loci(data::PopData)

Returns a vector of strings of the loci names in a PopData

julia> loci(sharks)
2213-element Array{String,1}:
"contig_35208"
"contig_23109"
"contig_4493"
"contig_10742"
"contig_14898"
โ‹ฎ
"contig_43517"
"contig_27356"
"contig_475"
"contig_19384"
"contig_22368"
"contig_2784"

View genotypesโ€‹

all genotypes in one locus or sampleโ€‹

genotypes(data::PopData, samplelocus::String)

Returns a vector (view) of genotypes for a locus, or sample, depending on which the function finds in your data. Don't worry too much about the wild type signature of the return vector.

julia> genotypes(sharks, "contig_2784")
212-element view(::PooledArrays.PooledVector{Union{Missing, Tuple{Int8, Int8}}, UInt8, Vector{UInt8}}, [468097, 468098, 468099, 468100, 468101, 468102, 468103, 468104, 468105, 468106 โ€ฆ 468299, 468300, 468301, 468302, 468303, 468304, 468305, 468306, 468307, 468308]) with eltype Union{Missing, Tuple{Int8, Int8}}:
(1, 1)
(1, 1)
(1, 1)
โ‹ฎ
(1, 1)
(1, 1)
(1, 1)


julia> genotypes(sharks, "cc_001")
2209-element view(::PooledArrays.PooledVector{Union{Missing, Tuple{Int8, Int8}}, UInt8, Vector{UInt8}}, [1, 213, 425, 637, 849, 1061, 1273, 1485, 1697, 1909 โ€ฆ 466189, 466401, 466613, 466825, 467037, 467249, 467461, 467673, 467885, 468097]) with eltype Union{Missing, Tuple{Int8, Int8}}:
(1, 2)
(1, 1)
(1, 2)
โ‹ฎ
(2, 2)
(1, 1)
(1, 1)

one sample, one locusโ€‹

genotype(data::PopData, sample::String => locus::String)

Returns the genotype of the sample at the locus. Uses Pair notation.

julia> genotype(sharks, "cc_001" => "contig_2784")
(1, 1)

many samples, one locusโ€‹

genotype(data::PopData, samples::Vector{String} => loci::String)

Returns a subdataframe of the genotypes of the samples at the locus. Uses Pair notation.

julia> genotypes(sharks, samplenames(sharks)[1:3] => "contig_2784")
3ร—4 SubDataFrame
Row โ”‚ name population locus genotype
โ”‚ String7 String String Tupleโ€ฆ?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ cc_001 CapeCanaveral contig_2784 (1, 1)
2 โ”‚ cc_002 CapeCanaveral contig_2784 (1, 1)
3 โ”‚ cc_003 CapeCanaveral contig_2784 (1, 1)

one sample, many lociโ€‹

genotype(data::PopData, sample::String => loci::Vector{String})

Returns a subdataframe of the genotypes of the sample at the loci. Uses Pair notation.

julia> genotypes(sharks, "cc_001" => loci(sharks)[1:3])
3ร—4 SubDataFrame
Row โ”‚ name population locus genotype
โ”‚ String7 String String Tupleโ€ฆ?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ cc_001 CapeCanaveral contig_35208 (1, 2)
2 โ”‚ cc_001 CapeCanaveral contig_23109 (1, 1)
3 โ”‚ cc_001 CapeCanaveral contig_4493 (1, 2)

many samples, many lociโ€‹

genotype(data::PopData, samples::Vector{String} => loci::Vector{String})

Returns a subdataframe of the genotypes of the samples at the loci. Uses Pair notation.

julia> genotypes(sharks, samplenames(sharks)[1:3] => loci(sharks)[1:3])
9ร—4 SubDataFrame
Row โ”‚ name population locus genotype
โ”‚ String7 String String Tupleโ€ฆ?
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1 โ”‚ cc_001 CapeCanaveral contig_35208 (1, 2)
2 โ”‚ cc_002 CapeCanaveral contig_35208 (1, 2)
3 โ”‚ cc_003 CapeCanaveral contig_35208 (1, 1)
4 โ”‚ cc_001 CapeCanaveral contig_23109 (1, 1)
5 โ”‚ cc_002 CapeCanaveral contig_23109 (1, 2)
6 โ”‚ cc_003 CapeCanaveral contig_23109 missing
7 โ”‚ cc_001 CapeCanaveral contig_4493 (1, 2)
8 โ”‚ cc_002 CapeCanaveral contig_4493 (1, 1)
9 โ”‚ cc_003 CapeCanaveral contig_4493 (1, 1)