Tutorial
For this example we will use PBMC (Peripheral Blood Mononuclear Cell) data from the paper Integrated analysis of multimodal single-cell data by Hao et al. You can find the original data here, in MatrixMarker (.mtx) format. For convenience, you can download the samples recompressed as .h5 files. Direct links:
- Cell annotations (.csv.gz)
- Donor P1 (.h5)
- Donor P2 (.h5)
- Donor P3 (.h5)
- Donor P4 (.h5)
- Donor P5 (.h5)
- Donor P6 (.h5)
- Donor P7 (.h5)
- Donor P8 (.h5)
First we load SingleCellProjections and the packages DataFrames and CSV for handling annotations.
julia> using SingleCellProjections, DataFrames, CSV
Loading Data
Then we load samples "P1" and "P2", by specifiying the paths to the files and naming them.
julia> base_path = "/path/to/downloads/";julia> sample_paths = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P1.h5", "GSE164378_RNA_ADT_3P_P2.h5"]);julia> counts = load_counts(sample_paths; sample_names=["P1","P2"])DataMatrix (33766 variables and 35340 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: cell_id, sampleName, barcode
Data sets in SingleCellProjections are represented as DataMatrix objects, which are matrices with annotations for var (variables/genes/features) and obs (observations, typically cells). Above, counts is a DataMatrix where the counts are stored in a sparse matrix. You can also see the available annotations for variables and observations. To access the different parts, use:
counts.matrix- For the matrixcounts.var- Variable annotations (DataFrame)counts.obs- Observation annotations (DataFrame)
Cell Annotations
Here we compute a new obs annotation where we count the fraction of reads coming from Mitochondrial genes for each cell:
julia> var_counts_fraction!(counts, "name"=>startswith("MT-"), "feature_type"=>isequal("Gene Expression"), "fraction_mt")DataMatrix (33766 variables and 35340 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: cell_id, sampleName, barcode, fraction_mt Models: VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt")
Note that the new annotation fraction_mt is present in the output.
We will also load some more cell annotations from the provided file.
julia> cell_annotations = CSV.read(joinpath(base_path, "GSE164378_RNA_ADT_3P.csv.gz"), DataFrame);To merge, we use the DataFrames function leftjoin!, since it takes care of matching the cells in counts to the cells in cell_annotations based on the :barcode column.
julia> leftjoin!(counts.obs, cell_annotations; on=:barcode);
Let's look at some annotations for the first few cells:
julia> counts.obs[1:5,["cell_id","sampleName","barcode","fraction_mt","celltype.l1"]]5×5 DataFrame Row │ cell_id sampleName barcode fraction_mt celltype.l1 │ String String String Float64 String7? ─────┼─────────────────────────────────────────────────────────────────────────────────── 1 │ P1_L1_AAACCCAAGACATACA P1 L1_AAACCCAAGACATACA 0.0330832 CD4 T 2 │ P1_L1_AAACCCACATCGGTTA P1 L1_AAACCCACATCGGTTA 0.0649309 Mono 3 │ P1_L1_AAACCCAGTGGAACAC P1 L1_AAACCCAGTGGAACAC 0.0541372 NK 4 │ P1_L1_AAACCCATCTGCGGAC P1 L1_AAACCCATCTGCGGAC 0.0712244 CD4 T 5 │ P1_L1_AAACGAAAGTTACTCG P1 L1_AAACGAAAGTTACTCG 0.0625228 CD4 T
Transformation
The raw counts data is not suitable for analyses like PCA, since the data is far from normally distributed. A common strategy to handle this is to transform the data. Here we will use SCTransform (see also original sctransform implementation in R).
julia> transformed = sctransform(counts)DataMatrix (20239 variables and 35340 observations) A+B₁B₂B₃ Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: SCTransformModel(nvar=20239, clip=34.32), VarCountsFraction
From the output, we see that the number of variables have been reduced, since the default sctransform options remove variables present in very few cells and only keeps variables with feature_type set to "Gene Expression".
The matrix is now shown as A+B₁B₂B₃. This is normally not very important from the user's point of view, but it is critical for explaining how SingleCellProjections can be fast and not use too much memory. Instead of storing the SCTransformed matrix as a huge dense matrix, it is stored in memory as a MatrixExpression, in this case a sparse matrix A plus a product of three smaller matrices B₁,B₂ and B₃.
Normalization
After transformation we always want to normalize the data. At the very least, data should be centered for PCA to work properly, this can be achieved by just running normalize_matrix with the default parameters. Here, we also want to regress out "fraction_mt". You can add more obs annotations (categorical and/or numerical) to regress out if you need.
julia> normalized = normalize_matrix(transformed, "fraction_mt")DataMatrix (20239 variables and 35340 observations) A+B₁B₂B₃+(-β)X' Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: NormalizationModel(rank=2, ~1+num(fraction_mt)), SCTransform, VarCountsFraction
Now the matrix is shown as A+B₁B₂B₃+(-β)X', i.e. another low-rank term was added to handle the normalization/regression. The first two terms are reused to make sure memory is not wasted.
Filtering
It is possible to filter variables and observations. Here we keep all cells that are not labeled as "other".
julia> filtered = filter_obs("celltype.l1"=>!isequal("other"), normalized)DataMatrix (20239 variables and 34639 observations) Aᵣ+B₁B₂B₃ᵣ+(-β)Xₗ' Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")), Normalization, SCTransform, VarCountsFraction
Principal Component Analysis (PCA)
Now we are ready to perform Principal Component Analysis (PCA). This is computed by the Singular Value Decomposition (SVD), so we should call the svd function. The number of dimensions is specified using the nsv parameter.
julia> reduced = svd(filtered; nsv=20)DataMatrix (20239 variables and 34639 observations) SVD (20 dimensions) Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: SVDModel(nsv=20), Filter, Normalization, SCTransform, VarCountsFraction
The matrix is now stored as an SVD object, which includes low dimensional representations of the observations and variables. To retrieve the low dimensional coordinates, use obs_coordinates and var_coordinates respectively.
Download interactive PCA plot.
Visualization
Expand this to show some example PlotlyJS plotting code.
You can of course use your own favorite plotting library instead. Use obs_coordinates to get the coordinates for each cell, and data.obs to access cell annotations for coloring.
using PlotlyJS
function plot_categorical_3d(data, annotation; marker_size=3)
points = obs_coordinates(data)
traces = GenericTrace[]
for sub in groupby(data.obs, annotation; sort=true)
value = sub[1,annotation]
ind = parentindices(sub)[1]
push!(traces, scatter3d(;x=points[1,ind], y=points[2,ind], z=points[3,ind], mode="markers", marker_size, name=value))
end
plot(traces, Layout(;legend=attr(itemsizing="constant")))
endUse it like this:
julia> plot_categorical_3d(reduced, "celltype.l1")For visualization purposes, it is often useful to further reduce the dimension after running PCA. (In contrast, analyses are generally run on the PCA/normalized/original data, since the methods below necessarily distort the data to force it down to 2 or 3 dimensions.)
Force Layout
Force Layout plots (also known as SPRING Plots) are created like this:
julia> fl = force_layout(reduced; ndim=3, k=100)DataMatrix (3 variables and 34639 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: NearestNeighborModel(base="force_layout", k=10), SVD, Filter, Normalization, SCTransform, ...
Download interactive Force Layout plot.
UMAP
SingleCellProjections can be used together with UMAP.jl:
julia> using UMAPjulia> umapped = umap(reduced, 3)DataMatrix (3 variables and 34639 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: UMAPModel(n_components=3), SVD, Filter, Normalization, SCTransform, ...
Download interactive UMAP plot.
t-SNE
Similarly, t-SNE plots are supported using TSne.jl. In this example, we just run it one every 10ᵗʰ cell, because t-SNE doesn't scale very well with the number of cells:
julia> using TSnejulia> t = tsne(reduced[:,1:10:end], 3)DataMatrix (3 variables and 3464 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: NearestNeighborModel(base="tsne", k=10), Filter, SVD, Filter, Normalization, ...
Download interactive t-SNE plot.
Other
It is of course possible to use your own favorite dimension reduction method/package. The natural input for most cases are the coordinates after dimension reduction by PCA (obs_coordinates(reduced)).
Projections
SingleCellProjections is built to make it very easy to project one dataset onto another.
Let's load count data for two more samples:
julia> sample_paths_proj = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P5.h5", "GSE164378_RNA_ADT_3P_P6.h5"]);julia> counts_proj = load_counts(sample_paths_proj; sample_names=["P5","P6"]);julia> leftjoin!(counts_proj.obs, cell_annotations; on=:barcode);julia> counts_projDataMatrix (33766 variables and 42553 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: cell_id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, ...
And project them onto the Force Layout we created above:
julia> fl_proj = project(counts_proj, fl)[ Info: Projecting onto VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt") [ Info: Projecting onto SCTransformModel(nvar=20239, clip=34.32) [ Info: - Removed 13527 variables that were not found in Model [ Info: Projecting onto NormalizationModel(rank=2, ~1+num(fraction_mt)) [ Info: Projecting onto FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")) [ Info: Projecting onto SVDModel(nsv=20) [ Info: Projecting onto NearestNeighborModel(base="force_layout", k=10) DataMatrix (3 variables and 41095 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, ... Models: NearestNeighborModel(base="force_layout", k=10), SVD, Filter, Normalization, SCTransform, ...
The result looks similar to the force layout plot above, since the donors "P5" and "P6" are similar to donors "P1" and "P2".
Download interactive Force Layout projection plot.
Under the hood, SingleCellProjections recorded a ProjectionModel for every step of the analysis leading up to the Force Layout. Let's take a look:
julia> fl.models6-element Vector{ProjectionModel}: VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt") SCTransformModel(nvar=20239, clip=34.32) NormalizationModel(rank=2, ~1+num(fraction_mt)) FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")) SVDModel(nsv=20) NearestNeighborModel(base="force_layout", k=10)
When projecting, these models are applied one by one (C.f. output from project above), ensuring that the projected data is processed correctly. In most cases, projecting is not the same as running the same analysis independently, since information about the data set is recorded in the model.