Tutorial
For this example we will use PBMC (Peripheral Blood Mononuclear Cell) data from the paper Integrated analysis of multimodal single-cell data by Hao et al. You can find the original data here, in MatrixMarker (.mtx) format. For convenience, you can download the samples recompressed as .h5 files. Direct links:
- Cell annotations (.csv.gz)
- Donor P1 (.h5)
- Donor P2 (.h5)
- Donor P3 (.h5)
- Donor P4 (.h5)
- Donor P5 (.h5)
- Donor P6 (.h5)
- Donor P7 (.h5)
- Donor P8 (.h5)
First we load SingleCellProjections
and the packages DataFrames
and CSV
for handling annotations.
julia> using SingleCellProjections, DataFrames, CSV
Loading Data
Then we load samples "P1" and "P2", by specifiying the paths to the files and naming them.
julia> base_path = "/path/to/downloads/";
julia> sample_paths = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P1.h5", "GSE164378_RNA_ADT_3P_P2.h5"]);
julia> counts = load_counts(sample_paths; sample_names=["P1","P2"])
DataMatrix (33766 variables and 35340 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: cell_id, sampleName, barcode
Data sets in SingleCellProjections
are represented as DataMatrix
objects, which are matrices with annotations for var
(variables/genes/features) and obs
(observations, typically cells). Above, counts
is a DataMatrix
where the counts are stored in a sparse matrix. You can also see the available annotations for variables and observations. To access the different parts, use:
counts.matrix
- For the matrixcounts.var
- Variable annotations (DataFrame
)counts.obs
- Observation annotations (DataFrame
)
Cell Annotations
Here we compute a new obs
annotation where we count the fraction of reads coming from Mitochondrial genes for each cell:
julia> var_counts_fraction!(counts, "name"=>startswith("MT-"), "feature_type"=>isequal("Gene Expression"), "fraction_mt")
DataMatrix (33766 variables and 35340 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: cell_id, sampleName, barcode, fraction_mt Models: VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt")
Note that the new annotation fraction_mt
is present in the output.
We will also load some more cell annotations from the provided file.
julia> cell_annotations = CSV.read(joinpath(base_path, "GSE164378_RNA_ADT_3P.csv.gz"), DataFrame);
To merge, we use the DataFrames
function leftjoin!
, since it takes care of matching the cells in counts
to the cells in cell_annotations
based on the :barcode
column.
julia> leftjoin!(counts.obs, cell_annotations; on=:barcode);
Let's look at some annotations for the first few cells:
julia> counts.obs[1:5,["cell_id","sampleName","barcode","fraction_mt","celltype.l1"]]
5×5 DataFrame Row │ cell_id sampleName barcode fraction_mt celltype.l1 │ String String String Float64 String7? ─────┼─────────────────────────────────────────────────────────────────────────────────── 1 │ P1_L1_AAACCCAAGACATACA P1 L1_AAACCCAAGACATACA 0.0330832 CD4 T 2 │ P1_L1_AAACCCACATCGGTTA P1 L1_AAACCCACATCGGTTA 0.0649309 Mono 3 │ P1_L1_AAACCCAGTGGAACAC P1 L1_AAACCCAGTGGAACAC 0.0541372 NK 4 │ P1_L1_AAACCCATCTGCGGAC P1 L1_AAACCCATCTGCGGAC 0.0712244 CD4 T 5 │ P1_L1_AAACGAAAGTTACTCG P1 L1_AAACGAAAGTTACTCG 0.0625228 CD4 T
Transformation
The raw counts data is not suitable for analyses like PCA, since the data is far from normally distributed. A common strategy to handle this is to transform the data. Here we will use SCTransform (see also original sctransform implementation in R).
julia> transformed = sctransform(counts)
DataMatrix (20239 variables and 35340 observations) A+B₁B₂B₃ Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: SCTransformModel(nvar=20239, clip=34.32), VarCountsFraction
From the output, we see that the number of variables have been reduced, since the default sctransform
options remove variables present in very few cells and only keeps variables with feature_type
set to "Gene Expression"
.
The matrix is now shown as A+B₁B₂B₃
. This is normally not very important from the user's point of view, but it is critical for explaining how SingleCellProjections
can be fast and not use too much memory. Instead of storing the SCTransformed matrix as a huge dense matrix, it is stored in memory as a MatrixExpression
, in this case a sparse matrix A
plus a product of three smaller matrices B₁
,B₂
and B₃
.
Normalization
After transformation we always want to normalize the data. At the very least, data should be centered for PCA to work properly, this can be achieved by just running normalize_matrix
with the default parameters. Here, we also want to regress out "fraction_mt"
. You can add more obs
annotations (categorical and/or numerical) to regress out if you need.
julia> normalized = normalize_matrix(transformed, "fraction_mt")
DataMatrix (20239 variables and 35340 observations) A+B₁B₂B₃+(-β)X' Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: NormalizationModel(rank=2, ~1+num(fraction_mt)), SCTransform, VarCountsFraction
Now the matrix is shown as A+B₁B₂B₃+(-β)X'
, i.e. another low-rank term was added to handle the normalization/regression. The first two terms are reused to make sure memory is not wasted.
Filtering
It is possible to filter variables and observations. Here we keep all cells that are not labeled as "other"
.
julia> filtered = filter_obs("celltype.l1"=>!isequal("other"), normalized)
DataMatrix (20239 variables and 34639 observations) Aᵣ+B₁B₂B₃ᵣ+(-β)Xₗ' Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")), Normalization, SCTransform, VarCountsFraction
Principal Component Analysis (PCA)
Now we are ready to perform Principal Component Analysis (PCA). This is computed by the Singular Value Decomposition (SVD), so we should call the svd
function. The number of dimensions is specified using the nsv
parameter.
julia> reduced = svd(filtered; nsv=20)
DataMatrix (20239 variables and 34639 observations) SVD (20 dimensions) Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ... Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: SVDModel(nsv=20), Filter, Normalization, SCTransform, VarCountsFraction
The matrix is now stored as an SVD
object, which includes low dimensional representations of the observations and variables. To retrieve the low dimensional coordinates, use obs_coordinates
and var_coordinates
respectively.
Download interactive PCA plot.
Visualization
Expand this to show some example PlotlyJS plotting code.
You can of course use your own favorite plotting library instead. Use obs_coordinates
to get the coordinates for each cell, and data.obs
to access cell annotations for coloring.
using PlotlyJS
function plot_categorical_3d(data, annotation; marker_size=3)
points = obs_coordinates(data)
traces = GenericTrace[]
for sub in groupby(data.obs, annotation; sort=true)
value = sub[1,annotation]
ind = parentindices(sub)[1]
push!(traces, scatter3d(;x=points[1,ind], y=points[2,ind], z=points[3,ind], mode="markers", marker_size, name=value))
end
plot(traces, Layout(;legend=attr(itemsizing="constant")))
end
Use it like this:
julia> plot_categorical_3d(reduced, "celltype.l1")
For visualization purposes, it is often useful to further reduce the dimension after running PCA. (In contrast, analyses are generally run on the PCA/normalized/original data, since the methods below necessarily distort the data to force it down to 2 or 3 dimensions.)
Force Layout
Force Layout plots (also known as SPRING Plots) are created like this:
julia> fl = force_layout(reduced; ndim=3, k=100)
DataMatrix (3 variables and 34639 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: NearestNeighborModel(base="force_layout", k=10), SVD, Filter, Normalization, SCTransform, ...
Download interactive Force Layout plot.
UMAP
SingleCellProjections
can be used together with UMAP.jl:
julia> using UMAP
julia> umapped = umap(reduced, 3)
DataMatrix (3 variables and 34639 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: UMAPModel(n_components=3), SVD, Filter, Normalization, SCTransform, ...
Download interactive UMAP plot.
t-SNE
Similarly, t-SNE plots are supported using TSne.jl. In this example, we just run it one every 10ᵗʰ cell, because t-SNE doesn't scale very well with the number of cells:
julia> using TSne
julia> t = tsne(reduced[:,1:10:end], 3)
DataMatrix (3 variables and 3464 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ... Models: NearestNeighborModel(base="tsne", k=10), Filter, SVD, Filter, Normalization, ...
Download interactive t-SNE plot.
Other
It is of course possible to use your own favorite dimension reduction method/package. The natural input for most cases are the coordinates after dimension reduction by PCA (obs_coordinates(reduced)
).
Projections
SingleCellProjections
is built to make it very easy to project one dataset onto another.
Let's load count data for two more samples:
julia> sample_paths_proj = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P5.h5", "GSE164378_RNA_ADT_3P_P6.h5"]);
julia> counts_proj = load_counts(sample_paths_proj; sample_names=["P5","P6"]);
julia> leftjoin!(counts_proj.obs, cell_annotations; on=:barcode);
julia> counts_proj
DataMatrix (33766 variables and 42553 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: cell_id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, ...
And project them onto the Force Layout we created above:
julia> fl_proj = project(counts_proj, fl)
[ Info: Projecting onto VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt") [ Info: Projecting onto SCTransformModel(nvar=20239, clip=34.32) [ Info: - Removed 13527 variables that were not found in Model [ Info: Projecting onto NormalizationModel(rank=2, ~1+num(fraction_mt)) [ Info: Projecting onto FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")) [ Info: Projecting onto SVDModel(nsv=20) [ Info: Projecting onto NearestNeighborModel(base="force_layout", k=10) DataMatrix (3 variables and 41095 observations) Matrix{Float64} Variables: id Observations: cell_id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, ... Models: NearestNeighborModel(base="force_layout", k=10), SVD, Filter, Normalization, SCTransform, ...
The result looks similar to the force layout plot above, since the donors "P5" and "P6" are similar to donors "P1" and "P2".
Download interactive Force Layout projection plot.
Under the hood, SingleCellProjections
recorded a ProjectionModel
for every step of the analysis leading up to the Force Layout. Let's take a look:
julia> fl.models
6-element Vector{ProjectionModel}: VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt") SCTransformModel(nvar=20239, clip=34.32) NormalizationModel(rank=2, ~1+num(fraction_mt)) FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")) SVDModel(nsv=20) NearestNeighborModel(base="force_layout", k=10)
When projecting, these models are applied one by one (C.f. output from project
above), ensuring that the projected data is processed correctly. In most cases, projecting is not the same as running the same analysis independently, since information about the data set is recorded in the model.