Tutorial

For this example we will use PBMC (Peripheral Blood Mononuclear Cell) data from the paper Integrated analysis of multimodal single-cell data by Hao et al. You can find the original data here, in MatrixMarker (.mtx) format. For convenience, you can download the samples recompressed as .h5 files. Direct links:

First we load SingleCellProjections and the packages DataFrames and CSV for handling annotations.

julia> using SingleCellProjections, DataFrames, CSV

Loading Data

Then we load samples "P1" and "P2", by specifiying the paths to the files and naming them.

julia> base_path = "/path/to/downloads/";
julia> sample_paths = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P1.h5", "GSE164378_RNA_ADT_3P_P2.h5"]);
julia> counts = load_counts(sample_paths; sample_names=["P1","P2"])DataMatrix (33766 variables and 35340 observations)
  SparseMatrixCSC{Int64, Int32}
  Variables: id, feature_type, name, genome, read, pattern, sequence
  Observations: cell_id, sampleName, barcode

Data sets in SingleCellProjections are represented as DataMatrix objects, which are matrices with annotations for var (variables/genes/features) and obs (observations, typically cells). Above, counts is a DataMatrix where the counts are stored in a sparse matrix. You can also see the available annotations for variables and observations. To access the different parts, use:

counts.matrix - For the matrix
counts.var - Variable annotations (DataFrame)
counts.obs - Observation annotations (DataFrame)

Cell Annotations

Here we compute a new obs annotation where we count the fraction of reads coming from Mitochondrial genes for each cell:

julia> var_counts_fraction!(counts, "name"=>startswith("MT-"), "feature_type"=>isequal("Gene Expression"), "fraction_mt")DataMatrix (33766 variables and 35340 observations)
  SparseMatrixCSC{Int64, Int32}
  Variables: id, feature_type, name, genome, read, pattern, sequence
  Observations: cell_id, sampleName, barcode, fraction_mt
  Models: VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt")

Note that the new annotation fraction_mt is present in the output.

We will also load some more cell annotations from the provided file.

julia> cell_annotations = CSV.read(joinpath(base_path, "GSE164378_RNA_ADT_3P.csv.gz"), DataFrame);

To merge, we use the DataFrames function leftjoin!, since it takes care of matching the cells in counts to the cells in cell_annotations based on the :barcode column.

julia> leftjoin!(counts.obs, cell_annotations; on=:barcode);

Let's look at some annotations for the first few cells:

julia> counts.obs[1:5,["cell_id","sampleName","barcode","fraction_mt","celltype.l1"]]5×5 DataFrame
 Row │ cell_id                 sampleName  barcode              fraction_mt  celltype.l1 
     │ String                  String      String               Float64      String7?    
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │ P1_L1_AAACCCAAGACATACA  P1          L1_AAACCCAAGACATACA    0.0330832  CD4 T
   2 │ P1_L1_AAACCCACATCGGTTA  P1          L1_AAACCCACATCGGTTA    0.0649309  Mono
   3 │ P1_L1_AAACCCAGTGGAACAC  P1          L1_AAACCCAGTGGAACAC    0.0541372  NK
   4 │ P1_L1_AAACCCATCTGCGGAC  P1          L1_AAACCCATCTGCGGAC    0.0712244  CD4 T
   5 │ P1_L1_AAACGAAAGTTACTCG  P1          L1_AAACGAAAGTTACTCG    0.0625228  CD4 T

Transformation

The raw counts data is not suitable for analyses like PCA, since the data is far from normally distributed. A common strategy to handle this is to transform the data. Here we will use SCTransform (see also original sctransform implementation in R).

julia> transformed = sctransform(counts)DataMatrix (20239 variables and 35340 observations)
  A+B₁B₂B₃
  Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ...
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: SCTransformModel(nvar=20239, clip=34.32), VarCountsFraction

From the output, we see that the number of variables have been reduced, since the default sctransform options remove variables present in very few cells and only keeps variables with feature_type set to "Gene Expression".

The matrix is now shown as A+B₁B₂B₃. This is normally not very important from the user's point of view, but it is critical for explaining how SingleCellProjections can be fast and not use too much memory. Instead of storing the SCTransformed matrix as a huge dense matrix, it is stored in memory as a MatrixExpression, in this case a sparse matrix A plus a product of three smaller matrices B₁,B₂ and B₃.

Normalization

After transformation we always want to normalize the data. At the very least, data should be centered for PCA to work properly, this can be achieved by just running normalize_matrix with the default parameters. Here, we also want to regress out "fraction_mt". You can add more obs annotations (categorical and/or numerical) to regress out if you need.

julia> normalized = normalize_matrix(transformed, "fraction_mt")DataMatrix (20239 variables and 35340 observations)
  A+B₁B₂B₃+(-β)X'
  Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ...
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: NormalizationModel(rank=2, ~1+num(fraction_mt)), SCTransform, VarCountsFraction

Now the matrix is shown as A+B₁B₂B₃+(-β)X', i.e. another low-rank term was added to handle the normalization/regression. The first two terms are reused to make sure memory is not wasted.

Filtering

It is possible to filter variables and observations. Here we keep all cells that are not labeled as "other".

julia> filtered = filter_obs("celltype.l1"=>!isequal("other"), normalized)DataMatrix (20239 variables and 34639 observations)
  Aᵣ+B₁B₂B₃ᵣ+(-β)Xₗ'
  Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ...
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other")), Normalization, SCTransform, VarCountsFraction

Principal Component Analysis (PCA)

Now we are ready to perform Principal Component Analysis (PCA). This is computed by the Singular Value Decomposition (SVD), so we should call the svd function. The number of dimensions is specified using the nsv parameter.

julia> reduced = svd(filtered; nsv=20)DataMatrix (20239 variables and 34639 observations)
  SVD (20 dimensions)
  Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, ...
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: SVDModel(nsv=20), Filter, Normalization, SCTransform, VarCountsFraction

The matrix is now stored as an SVD object, which includes low dimensional representations of the observations and variables. To retrieve the low dimensional coordinates, use obs_coordinates and var_coordinates respectively.

Principal Component Analysis

Download interactive PCA plot.

Visualization

Expand this to show some example PlotlyJS plotting code.

You can of course use your own favorite plotting library instead. Use obs_coordinates to get the coordinates for each cell, and data.obs to access cell annotations for coloring.

using PlotlyJS
function plot_categorical_3d(data, annotation; marker_size=3)
    points = obs_coordinates(data)
    traces = GenericTrace[]
    for sub in groupby(data.obs, annotation; sort=true)
        value = sub[1,annotation]
        ind = parentindices(sub)[1]
        push!(traces, scatter3d(;x=points[1,ind], y=points[2,ind], z=points[3,ind], mode="markers", marker_size, name=value))
    end
    plot(traces, Layout(;legend=attr(itemsizing="constant")))
end

Use it like this:

julia> plot_categorical_3d(reduced, "celltype.l1")

For visualization purposes, it is often useful to further reduce the dimension after running PCA. (In contrast, analyses are generally run on the PCA/normalized/original data, since the methods below necessarily distort the data to force it down to 2 or 3 dimensions.)

Force Layout

Force Layout plots (also known as SPRING Plots) are created like this:

julia> fl = force_layout(reduced; ndim=3, k=100)DataMatrix (3 variables and 34639 observations)
  Matrix{Float64}
  Variables: id
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: NearestNeighborModel(base="force_layout", k=10), SVD, Filter, Normalization, SCTransform, ...

force_layout

Download interactive Force Layout plot.

UMAP

SingleCellProjections can be used together with UMAP.jl:

julia> using UMAP
julia> umapped = umap(reduced, 3)DataMatrix (3 variables and 34639 observations)
  Matrix{Float64}
  Variables: id
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: UMAPModel(n_components=3), SVD, Filter, Normalization, SCTransform, ...

umap

Download interactive UMAP plot.

t-SNE

Similarly, t-SNE plots are supported using TSne.jl. In this example, we just run it one every 10ᵗʰ cell, because t-SNE doesn't scale very well with the number of cells:

julia> using TSne
julia> t = tsne(reduced[:,1:10:end], 3)DataMatrix (3 variables and 3464 observations)
  Matrix{Float64}
  Variables: id
  Observations: cell_id, sampleName, barcode, fraction_mt, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, ...
  Models: NearestNeighborModel(base="tsne", k=10), Filter, SVD, Filter, Normalization, ...

t-SNE

Download interactive t-SNE plot.

Other

It is of course possible to use your own favorite dimension reduction method/package. The natural input for most cases are the coordinates after dimension reduction by PCA (obs_coordinates(reduced)).

Projections

SingleCellProjections is built to make it very easy to project one dataset onto another.

Let's load count data for two more samples:

julia> sample_paths_proj = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P5.h5", "GSE164378_RNA_ADT_3P_P6.h5"]);
julia> counts_proj = load_counts(sample_paths_proj; sample_names=["P5","P6"]);
julia> leftjoin!(counts_proj.obs, cell_annotations; on=:barcode);
julia> counts_projDataMatrix (33766 variables and 42553 observations)
  SparseMatrixCSC{Int64, Int32}
  Variables: id, feature_type, name, genome, read, pattern, sequence
  Observations: cell_id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, ...

And project them onto the Force Layout we created above:

julia> fl_proj = project(counts_proj, fl)[ Info: Projecting onto VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt")
[ Info: Projecting onto SCTransformModel(nvar=20239, clip=34.32)
[ Info: - Removed 13527 variables that were not found in Model
[ Info: Projecting onto NormalizationModel(rank=2, ~1+num(fraction_mt))
[ Info: Projecting onto FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other"))
[ Info: Projecting onto SVDModel(nsv=20)
[ Info: Projecting onto NearestNeighborModel(base="force_layout", k=10)
DataMatrix (3 variables and 41095 observations)
  Matrix{Float64}
  Variables: id
  Observations: cell_id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, ...
  Models: NearestNeighborModel(base="force_layout", k=10), SVD, Filter, Normalization, SCTransform, ...

The result looks similar to the force layout plot above, since the donors "P5" and "P6" are similar to donors "P1" and "P2".

force_layout_projected

Download interactive Force Layout projection plot.

Under the hood, SingleCellProjections recorded a ProjectionModel for every step of the analysis leading up to the Force Layout. Let's take a look:

julia> fl.models6-element Vector{ProjectionModel}:
 VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt")
 SCTransformModel(nvar=20239, clip=34.32)
 NormalizationModel(rank=2, ~1+num(fraction_mt))
 FilterModel(:, "celltype.l1"=>!Fix2{typeof(isequal), String}(isequal, "other"))
 SVDModel(nsv=20)
 NearestNeighborModel(base="force_layout", k=10)

When projecting, these models are applied one by one (C.f. output from project above), ensuring that the projected data is processed correctly. In most cases, projecting is not the same as running the same analysis independently, since information about the data set is recorded in the model.