Clustering
Usually the beginning of a study without prior population information requires guesstimating the number of clusters present in the data. This can be accomplished using a number of methods, like K-means, K-mediods, Fuzzy-C Means, etc. PopGen.jl extends several of the clustering algorithms available in Clustering.jl to work directly with PopData objects.
The cluster wrapper
All of the clustering methods implemented in PopGen.jl (read below) can be accessed using a single function cluster.
cluster(::PopData, method::Function; matrixtype::Symbol, kwargs...)
A convenience wrapper to perform clustering on a PopData object determined by a designated method. The
chosen method must also be supplied with the appropriate keyword arguments for that method. For more information on
a specific method, read more below or see its docstring in a Julia session with ?methodname (e.g., ?kmediods). The keyword argument matrixtype refers to which input matrix you would like to use for clustering, one of either :pca (default, principal components of the scaled allele frequencies) or :freq (scaled allele frequencies).
Clustering Methods
| Method Name | Method Type | Keyword Arguments |
|---|---|---|
kmeans | K-means++ | k, iterations |
kmedoids | K-medoids | k, iterations |
hclust | Hierarchical Clustering | linkage, branchorder, distance |
fuzzycmeans | Fuzzy C-means | c, fuzziness, iterations |
dbscan | Density-based Spatial Clustering of Applications with Noise (DBSCAN) | radius, minpoints, distance |
Examples
julia> cats = @nancycats;
julia> cluster(cats, kmeans, iterations = 100);
julia> cluster(cats, dbscan, matrixtype = :freq)
The results of these clustering methods can then be used for validation using any methods available in Clustering.jl.
Since the clustering methods are exported, you can technically skip the cluster wrapper and use any of the methods directly (e.g. kmeans(PopData, k = 5)), although cluster() is the preferred method.
Clustering Methods
K-means
kmeans(data::PopData; k::Int64, iterations::Int64 = 100, matrixtype::Symbol = :pca)
Perform Kmeans clustering (using Kmeans++ from Arthur & Vassilvitskii 2007) on a PopData object. Returns a KmeansResult
object. Use the keyword argument iterations (default: 100) to set the maximum number of iterations allowed to
achieve convergence. Clustering is performed on the matrixtype principal components of the scaled allele frequencies (:pca),
or just the scaled allele frequencies themselves (:freq). In both cases, missing values are replaced by the global mean allele frequency.
Keyword Arguments
k: the number of desired clusters, given as anIntegeriterations::Int64: the maximum number of iterations to attempt to reach convergence (default:100)matrixtype: type of input matrix to compute (default::pca):pca: matrix of Principal Components of:freq:freq: matrix of scaled allele frequencies
Example
julia> cats = @nancycats ;
julia> km = kmeans(cats, k = 2)
K-medoids
kmedoids(data::PopData; k::Int64, iterations::Int64 = 100, distance::PreMetric = euclidean, matrixtype::Symbol = :pca)
Perform K-medoids (Kaufman & Rousseeuw, 1990) clustering on a PopData object. Returns a KmedoidsResult
object. Use the keyword argument iterations (default: 100) to set the maximum number of iterations allowed to
achieve convergence. Clustering is performed on the matrixtype principal components of the scaled allele frequencies (:pca),
or just the scaled allele frequencies themselves (:freq). In both cases, missing values are replaced by the global mean allele frequency.
Keyword Arguments
k: the number of desired clusters, given as anIntegeriterations::Int64: the maximum number of iterations to attempt to reach convergence (default:100)distance: type of distance matrix to calculate onmatrixtype(default:euclidean)- see Distances.jl for a list of options (e.g.
sqeuclidean, etc.)
- see Distances.jl for a list of options (e.g.
matrixtype: type of input matrix to compute (default::pca):pca: matrix of Principal Components of:freq:freq: matrix of scaled allele frequencies
Example
julia> cats = @nancycats ;
julia> km = kmedoids(cats, k = 2, distance = sqeuclidean)
Hierarchical Clustering
hclust(data::PopData; linkage::Symbol = :single, branchorder::Symbol = :r, distance::PreMetric = euclidean, matrixtype::Symbol = :pca)
Perform hierarchical clustering (Bar-Joseph et al., 2001) on a PopData object. Returns an Hclust object, which contains many metrics but does not include cluster assignments. Use
cutree(::PopData, ::Hclust; krange...) to compute the sample assignments for a range of k clusters. Clustering is performed on the matrixtype principal components of the scaled allele frequencies (:pca),
or just the scaled allele frequencies themselves (:freq). In both cases, missing values are replaced by the global mean allele frequency.
Keyword Arguments
linkage: defines how the distances between the data points are aggregated into the distances between the clusters:single: use the minimum distance between any of the cluster members (default):average: use the mean distance between any of the cluster members:complete: use the maximum distance between any of the members:ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters:ward_presquared: same as:ward, but assumes that the distances in the distance matrix are already squared.
branchorder: algorithm to order leaves and branches (default::r):r: ordering based on the node heights and the original elements order (compatible with R's hclust):optimal: branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm
distance: type of distance matrix to calculate onmatrixtype(default:euclidean)- see Distances.jl for a list of options (e.g.
sqeuclidean, etc.)
- see Distances.jl for a list of options (e.g.
matrixtype: type of input matrix (default::pca):pca: matrix of Principal Components of:freq:freq: matrix of allele frequencies
cutree
cutree(::PopData, hcres::Hclust; krange::UnitRange{Int64}, height::Union{Int64, Nothing} = nothing)
cutree(::PopData, hcres::Hclust; krange::Vector{Int64}, height::Union{Int64, Nothing} = nothing)
An expansion to the Clustering.cutree method (from Clustering.jl) that performs cluster assignments over krange
on the Hclust output from hclust(). Returns a DataFrame of sample names and columns corresponding to assignments
per k in krange. The PopData object is used only for retrieving the sample names.
Keyword Arguments
krange: the number of desired clusters, given as a vector (ex.[2,4,5]) or range (2:5)h::Integer: the height at which the tree is cut (optional)
Example
julia> cats = @nancycats ;
julia> hca = hclust(cats, branchorder = :optimal) ;
julia> cutree(cats, hca, krange = 2:5)
Fuzzy C-means
fuzzycmeans(data::PopData; c::Int64, fuzziness::Int64 = 2, iterations::Int64 = 100, matrixtype::Symbol = :pca)
Perform Fuzzy C-means clustering (Bezdek et al. 1984) on a PopData object. Returns a FuzzyCMeansResult object, which contains the assignment weights in the .weights field. Clustering is performed on the matrixtype principal components of the scaled allele frequencies (:pca),
or just the scaled allele frequencies themselves (:freq). In both cases, missing values are replaced by the global mean allele frequency.
Keyword Arguments
c: the number of desired clusters, given as anIntegerfuzziness::Integer: clusters' fuzziness, must be >1 (default:2)- a fuzziness of 2 is common for systems with unknown numbers of clusters
iterations::Int64: the maximum number of iterations to attempt to reach convergence (default:100)matrixtype: type of input matrix to compute (default::pca):pca: matrix of Principal Components of:freq:freq: matrix of scaled allele frequencies
Example
julia> cats = @nancycats ;
julia> fuzzycats = fuzzycmeans(cats, c = 5) ;
DBSCAN
dbscan(::PopData; radius::Float64, minpoints::Int64 = 2, distance::PreMetric = euclidean, matrixtype::Symbol = :pca)
Perform Density-based Spatial Clustering of Applications with Noise (DBSCAN: Ester et al. 1996)
on a PopData object. Returns a DbscanResult object, which contains the assignments in the
.assignments field. Clustering is performed on the matrixtype principal components of the scaled allele frequencies (:pca),
or just the scaled allele frequencies themselves (:freq). In both cases, missing values are replaced by the global mean allele frequency.
Keyword Arguments
radius::Float64: the radius of a point neighborhoodminpoints::Int: the minimum number of a core point neighbors (default:2)distance: type of distance matrix to calculate onmatrixtype(default:euclidean)- see Distances.jl for a list of options (e.g.
sqeuclidean, etc.)
- see Distances.jl for a list of options (e.g.
matrixtype: type of input matrix (default::pca):pca: matrix of Principal Components:freq: matrix of allele frequencies
Example
julia> cats = @nancycats ;
julia> fuzzycats = dbscan(cats, radius = 0.5) ;
Acknowledgments
Much of the heavy lifting within these clustering methods are actually the result of the amazing authors and contributors of Clustering.jl and the Principal Component Analysis available from MultivariateStats.jl.