Interface

SingleCellProjections.DataMatrixType
struct DataMatrix{T,Tv,To}

A DataMatrix represents a matrix together with annotations for variables and observations.

Fields:

  • matrix::T - The matrix.
  • var::Tv - Variable annotations.
  • obs::To - Observation annotations.
  • models::Vector{ProjectionModel} - Models used in the creation of this DataMatrix.

The first column of the var and obs tables should contain unique IDs.

source
SingleCellProjections.DataMatrixMethod
DataMatrix(matrix, var, obs; kwargs...)

Create a DataMatrix with the given matrix, var and obs.

The first column of var/obs are used as IDs.

Kwargs:

  • duplicate_var - Set to :ignore, :warn or :error to decide what happens if duplicate var IDs are found.
  • duplicate_obs - Set to :ignore, :warn or :error to decide what happens if duplicate obs IDs are found.
source
SingleCellProjections.LowRankType
LowRank

A matrix decomposition UVᵀ where each row of U represents a variable and each column of Vᵀ represents a sample. Intended for situations where the product is low rank, i.e. size(U,2)==size(Vt,1) is small.

source
SingleCellProjections.NormalizationModelMethod
NormalizationModel(data::DataMatrix, design::DesignMatrix;
                   scale=false, min_std=1e-6, annotate=true,
                   rtol=sqrt(eps()), var=:copy, obs=:copy)

Create a NormalizationModel based on data and a design matrix.

  • scale - Set to true to normalize variables to unit standard deviation. Can also be set to a vector with a scaling factor for each variable.
  • min_std - If scale==true, the scale vector is set to 1.0 ./ max.(std, min_std). That is, min_std is used to suppress variables that are very small (and any fluctuations can be assumed to be noise).
  • annotate - Only used if scale!=false. With annotate=true, the scale vector is added as a var annotation.
  • rtol - Singular values of the design matrix that are ≤rtol are discarded. Needed for numerical stability.
  • var - Can be :copy (make a copy of source var) or :keep (share the source var object).
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).

See also: normalize_matrix, designmatrix

source
SingleCellProjections.SCTransformModelMethod
SCTransformModel([T=Float64], counts::DataMatrix;
                 var_filter = hasproperty(counts.var, :feature_type) ? :feature_type => isequal("Gene Expression") : nothing,
                 rtol=1e-3, atol=0.0, annotate=true,
                 post_var_filter=:, post_obs_filter=:,
                 obs=:copy,
                 kwargs...)

Computes the SCTransform parameter estimates for counts and creates a SCTransformModel that can be applied to the same or another data set. Defaults to only using "Gene Expression" features.

Optionally, T can be specified to control the eltype of the sparse transformed matrix. T=Float32 can be used to lower the memory usage, with little impact on the results, since downstream analysis is still done with Float64.

  • var_filter - Control which variables (features) to use for parameter estimation. Defaults to "feature_type" => isequal("Gene Expression"), if a feature_type column is present in counts.var. Can be set to nothing to disable filtering. See DataFrames.filter for how to specify filters.
  • var_filter_cols - Additional columns used to ensure features are unique. Defaults to "feature_type" if present in counts.var. Use a Tuple/Vector for specifying multiple columns. Can be set to nothing to not include any additional columns.
  • rtol - Relative tolerance when constructing low rank approximation.
  • atol - Absolute tolerance when constructing low rank approximation.
  • annotate - Set to true to include SCTransform parameter estimates as feature annotations.
  • post_var_filter - Equivalent to applying variable (feature) filtering after sctransform, but computationally more efficient.
  • post_obs_filter - Equivalent to applying observation (cell) filtering after sctransform, but computationally more efficient.
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).
  • kwargs... - Additional kwargs are passed on to SCTransform.scparams.

Examples

Setup SCTransformModel (Gene Expression features):

julia> SCTransformModel(counts)

Setup SCTransformModel (Antibody Capture features):

julia> SCTransformModel(counts; var_filter = :feature_type => isequal("Antibody Capture"))

See also: sctransform, SCTransform.scparams, DataFrames.filter

source
Base.copyMethod
copy(data::DataMatrix; var=:copy, obs=:copy, matrix=:keep)

Copy DataMatrix data. By default, var and obs annotations are copied, but the matrix is shared. Set kwargs var, obs and matrix to :keep/:copy for fine grained control.

source
LinearAlgebra.svdMethod
svd(data::DataMatrix; nsv=3, var=:copy, obs=:copy, kwargs...)

Compute the Singular Value Decomposition (SVD) of data using the Random Subspace SVD algorithm from [Halko et al. "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions"]. SVD is often used to perform Principal Component Analysis (PCA), which assumes that the data is centered.

  • nsv - The number of singular values.
  • var - Can be :copy (make a copy of source var) or :keep (share the source var object).
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).

Additional kwargs related to numerical precision are passed to SingleCellProjections.implicitsvd.

See also: SingleCellProjections.implicitsvd

source
SCTransform.sctransformMethod
sctransform([T=Float64], counts::DataMatrix; verbose=true, kwargs...)

Compute the SCTransform of the DataMatrix counts. The result is stored as a Matrix Expression with the sum of a sparse and a low-rank term. I.e. no large dense matrix is created.

Optionally, T can be specified to control the eltype of the sparse transformed matrix. T=Float32 can be used to lower the memory usage, with little impact on the results, since downstream analysis is still done with Float64.

See SCTransformModel for description of kwargs....

Examples

Compute SCTransform (Gene Expression features):

julia> sctransform(counts)

Compute SCTransform (Antibody Capture features):

julia> sctransform(counts; var_filter = :feature_type => isequal("Antibody Capture"))

Compute SCTransform (Gene Expression features), using eltype Float32 to lower memory usage:

julia> sctransform(Float32, counts)

See also: SCTransformModel, SCTransform.scparams

source
SingleCellProjections.adjacency_distancesFunction
adjacency_distances(adj, X, Y=X)

For each structural non-zero in adj, compute the Euclidean distance between the point in the DataMatrix Y and the point in the DataMatrix X.

Can be useful when adj is created using e.g. a lower dimensional representation and we want to know the distances in the original, high dimensional space.

At the moment all points in Y are required to have the same number of neighbors in X, for computation reasons.

source
SingleCellProjections.covariateFunction
covariate(src, group_a, [group_b])

Create a two-group covariate referring to src, comparing group_a to group_b.

src is one of:

  • String - referring to a column in the DataMatrix obs.
  • DataFrame - with exactly two columns, the first should contain IDs matching IDs in obs, and the second should be the covariate.
  • Annotations (experimental) - with ID matching the DataMatrix obs and a second column for the covariate.

If src is a String it will refer to a column in the DataMatrix obs. src can also be an Annotations object, with ID matching the DataMatrix obs. group_a and group_b must be values occuring in the column src.

If group_b is not given, group_a will be compared to all other observations.

See also: designmatrix

source
SingleCellProjections.covariateFunction
covariate(src, type=:auto)

Create a covariate referring to src.

src is one of:

  • String - referring to a column in the DataMatrix obs.
  • DataFrame - with exactly two columns, the first should contain IDs matching IDs in obs, and the second should be the covariate.
  • Annotations (experimental) - with ID matching the DataMatrix obs and a second column for the covariate.

type must be one of :auto, :numerical, :categorical, :twogroup and :intercept. :auto means auto-detection by checking if the values in the column are numerical or categorical. type==:intercept adds an intercept to the model (in which case the src parameter is ignored).

See also: designmatrix

source
SingleCellProjections.designmatrixMethod
designmatrix(data::DataMatrix, [covariates...]; center=true, max_categories=100)

Creates a design matrix from data.obs and the given covariates. Covariates can be specied using strings (column name in data.obs), with autodetection of whether the covariate is numerical or categorical, or using the covariate function for more control.

  • center - If true, an intercept is added to the design matrix. (Should only be set to false in very rare circumstances.)
  • max_categories - Safety parameter, an error will be thrown if there are too many categories. In this case, it is likely a mistake that the covariate was used as a categorical covariate. Using a very large number of categories is also bad for performance and memory consumption.

Examples

Centering only:

julia> designmatrix(data)

Regression model with intercept (centering) and "fraction_mt" (numerical annotation):

julia> designmatrix(data, "fraction_mt")

As above, but also including "batch" (categorical annotation):

julia> designmatrix(data, "fraction_mt", "batch")

See also: normalize_matrix, NormalizationModel, covariate

source
SingleCellProjections.filter_matrixMethod
filter_matrix(fvar, fobs, data::DataMatrix)

Return a new DataMatrix, containing only the variables and observations passing the filters.

fvar/fobs can be:

  • An AbstractVector of indices to keep.
  • A AbstractVector of booleans (true to keep, false to discard).
  • : indicating that all variables/observations should be kept.
  • Anything you can pass on to DataFrames.filter (see DataFrames documentation for details).

Also note that indexing of a DataMatrix supports AbstractVectors of indices/booleans and :, and is otherwise identical to filter_matrix.

Examples

Keep every 10th variable and 3rd observation:

julia> filter_matrix(1:10:size(data,1), 1:3:size(data,2), data)

Or, using indexing syntax:

julia> data[1:10:end, 1:3:end]

For more examples, see filter_var and filter_obs.

See also: filter_var, filter_obs, DataFrames.filter

source
SingleCellProjections.filter_obsMethod
filter_obs(f, data::DataMatrix)

Return a new DataMatrix, containing only the observations passing the filter.

f can be:

  • An AbstractVector of indices to keep.
  • A AbstractVector of booleans (true to keep, false to discard).
  • : indicating that all observations should be kept.
  • Anything you can pass on to DataFrames.filter (see DataFrames documentation for details).

Examples

Keep every 10th observation:

julia> filter_obs(1:10:size(data,2), data)

Remove observations where "celltype" equals "other":

julia> filter_obs("celltype"=>!isequal("other"), data)

See also: filter_matrix, filter_var, DataFrames.filter

source
SingleCellProjections.filter_varMethod
filter_var(f, data::DataMatrix; kwargs...)

Return a new DataMatrix, containing only the variables passing the filter.

f can be:

  • An AbstractVector of indices to keep.
  • A AbstractVector of booleans (true to keep, false to discard).
  • : indicating that all variables should be kept.
  • Anything you can pass on to DataFrames.filter (see DataFrames documentation for details).

Examples

Keep every 10th variable:

julia> filter_var(1:10:size(data,1), data)

Keep only variables of the type "Gene Expression":

julia> filter_var("feature_type"=>isequal("Gene Expression"), data)

See also: filter_matrix, filter_obs, DataFrames.filter

source
SingleCellProjections.force_layoutMethod
force_layout(data::DataMatrix;
             ndim=3,
             k,
             adj,
             kprojection=10,
             obs=:copy,
             adj_out,
             niter = 100,
             link_distance = 4,
             link_strength = 2,
             charge = 5,
             charge_min_distance = 1,
             theta = 0.9,
             center_strength = 0.05,
             velocity_decay = 0.9,
             initialAlpha = 1.0,
             finalAlpha = 1e-3,
             initialScale = 10,
             seed,
             rng)

Compute the Force Layout (also known as a force directed knn-graph or SPRING plots) for data. Usually, data is a DataMatrix after reduction to 10-100 dimensions by svd.

A Force Layout is computed by running a physics simulation were the observations are connected by springs (such that connected observations are attracted), a general "charge" force repelling all observations from each other and a centering force that keep the observations around the origin. The implementation is based on d3-force: https://github.com/d3/d3-force, also see LICENSE.md.

Exactly one of the kwargs k and adj must be provided. See details below.

General parameters:

  • k - Number of nearest neighbors to connect each observation to (computes adj below).
  • adj - An sparse, symmetric, adjacency matrix with booleans. true if two observations are connected by a spring and false otherwise.
  • kprojection - The number of nearest neighbors used when projecting onto the resulting force layout. (Not used in the computation of the layout, only during projection.)
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).
  • adj_out - Optional Ref. If specified, the (computed) adj matrix will be assigned to adj_out.

Paramters controlling the physics simulation:

  • niter - Number of iterations to run the simulation.
  • link_distance - The length of each spring.
  • link_strength - The strength of the spring force.
  • charge - The strength of the charge force.
  • charge_min_distance - Used to avoid numerical instabilities by limiting the charge force for observations that are very close.
  • theta - Parameter controlling accuracy in the Barnes-Hut approximation for charge forces.
  • center_strength - Strength of the centering force.
  • velocity_decay - At each iteration, the current velocity for an observations is multiplied by velocity_decay.
  • initialAlpha - The alpha value decreases over time and allows larger changes to happen early, while being more stable towards the end.
  • finalAlpha - See initialAlpha
  • initialScale - The simulation is initialized by randomly drawing each observation from a multivariate Gaussian, and is scaled by initialScale.
  • seed - Optional random seed used to init rng. NB: This requires the package StableRNGs to be loaded.
  • rng - Optional RNG object. Useful for reproducibility.

Examples

julia> force_layout(data; ndim=3, k=100)
source
SingleCellProjections.ftest!Method
ftest!(data::DataMatrix, h1; h0, kwargs...)

Performs an F-Test with the given h1 (alternative hypothesis) and h0 (null hypothesis). Examples of F-Tests are ANOVA and Quadratic Regression, but any linear model can be used.

ftest! adds a F-statistic and a p-value column to data.var.

See ftest_table for usage examples and more details on computations and parameters.

In addition ftest! supports the kwarg:

  • prefix - Output column names for F-statistics and p-values will be prefixed with this string. If none is given, it will be constructed from h1 and h0.

See also: ftest_table, ftest, ttest!

source
SingleCellProjections.ftestMethod
ftest(data::DataMatrix, h1; h0, var=:copy, obs=:copy, matrix=:keep, kwargs...)

Performs an F-Test with the given h1 (alternative hypothesis) and h0 (null hypothesis). Examples of F-Tests are ANOVA and Quadratic Regression, but any linear model can be used.

ftest creates a copy of data and adds a F-statistic and a p-value column to data.var.

See ftest_table and ftest! for usage examples and more details on computations and parameters.

See also: ftest!, ftest_table, ttest

source
SingleCellProjections.ftest_tableMethod
ftest_table(data::DataMatrix, h1; h0, kwargs...)

Performs an F-Test with the given h1 (alternative hypothesis) and h0 (null hypothesis). Examples of F-Tests are ANOVA and Quadratic Regression, but any linear model can be used. (See "Examples" below for concrete examples.)

F-tests can be performed on any DataMatrix, but it is almost always recommended to do it directly after transforming the data using e.g. sctransform, logtransform or tf_idf_transform.

Normalization

Do not use ftest_table after normalizing the data using normalize_matrix: ftest_table needs to know about the h0 model (regressed out covariates) for correction computations. Failing to do so can result in incorrect results. If you want to correct for the same covariates, pass them as h0 to ftest_table.

h1 can be:

  • A string specifying a column name of data.obs. Auto-detection determines if the column is categorical (ANOVA) or numerical.
  • A covariate for more control of how to interpret the values in a column.
  • A tuple or vector of the above for compound models.

ftest_table returns a Dataframe with columns for variable IDs, F-statistics and p-values.

Supported kwargs are:

  • h0 - Use a non-trivial h0 (null) model. Specified in the same way as h1 above.
  • center=true - Add an intercept to the h0 (null) model.
  • statistic_col="F" - Name of the output column containing the F-statistics. (Set to nothing to remove from output.)
  • pvalue_col="pValue" - Name of the output column containing the p-values. (Set to nothing to remove from output.)
  • h1_missing=:skip - One of :skip and :error. If skip, missing values in h1 columns are skipped, otherwise an error is thrown.
  • h0_missing=:error - One of :skip and :error. If skip, missing values in h0 columns are skipped, otherwise an error is thrown.
  • allow_normalized_matrix=false - Set to true to accept running on a DataMatrix that has been normalized.

Examples

Perform an ANOVA using the "celltype" annotation.

julia> ftest_table(transformed, "celltype")

Perform an ANOVA using the "celltype" annotation, while correcting for fraction_mt (a linear covariate).

julia> ftest_table(transformed, "celltype"; h0="fraction_mt")

Perform an ANOVA using the "celltype" annotation, while correcting for fraction_mt (a linear covariate) and "phase" (a categorical covariate).

julia> ftest_table(transformed, "celltype"; h0=("fraction_mt","phase"))

Perform Quadractic Regression using the covariate x, by first creating an annotation for x squared, and then using a compound model.

julia> data.obs.x2 = data.obs.x.^2;

julia> ftest_table(transformed, ("x","x2"))

See also: ftest!, ftest, ttest_table, covariate

source
SingleCellProjections.implicitsvdMethod
implicitsvd(A; nsv=3, subspacedims=4nsv, niter=2, stabilize_sign=true, seed, rng)

Compute the SVD of A using Random Subspace SVD. [Halko et al. "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions"]

  • nsv - Number of singular values/vectors to compute
  • subspacedims - Number of dimensions used for the subspace approximating the action of A.
  • niter - Number of iterations. In each iteration, one multiplication of A with a matrix and one multiplication of A' with a matrix will be performed.
  • stabilize_sign - If true, handles the problem that the SVD is only unique up to the sign of each component (for real matrices), by ensuring that the l1 norm of the positive entires for each column in U is larger than the l1 norm of the negative entries.
  • seed - Use a random seed to init the rng. NB: This requires the package StableRNGs to be loaded.
  • rng - Specify a custom RNG.
source
SingleCellProjections.load10xMethod
load10x(filename; lazy=false, var_id=nothing, var_id_delim='_')

Load a CellRanger ".h5" or ".mtx[.gz]" file as a DataMatrix.

  • lazy - If true, the count matrix itself will not be loaded, only features and barcodes. This is used internally in load_counts to merge samples more efficiently. Use load_counts to later load the count data.
  • var_id - If a pair var_id_col=>cols, the contents of columns cols will be merged to create new IDs. Useful to ensure that IDs are unique.
  • var_id_delim - Delimiter used to when merging variable columns to create the variable id column.

Examples

Load counts from a CellRanger ".h5" file. (Recommended.)

julia> counts = load10x("filtered_feature_bc_matrix.h5")

Load counts from a CellRanger ".mtx" file. Tries to find barcode and feature annotation files in the same folder.

julia> counts = load10x("matrix.mtx.gz")

Lazy loading followed by loading.

julia> counts = load10x("filtered_feature_bc_matrix.h5");
julia> counts = load_counts(counts)

See also: load_counts

source
SingleCellProjections.load_countsMethod
load_counts([loadfun=load10x], filenames;
            sample_names,
            sample_name_col,
            obs_id_col = "cell_id",
            lazy,
            lazy_merge = false,
            obs_id_delim = '_',
            obs_id_prefixes = sample_names,
            extra_var_id_cols::Union{Nothing,String,Vector{String}},
            duplicate_var,
            duplicate_obs,
            callback=nothing)

Load and merge multiple samples efficiently.

Defaults to loading 10x CellRanger files. The files are first loaded lazily, then the merged count matrix is allocated and finally each sample is loaded directly into the merged count matrix. (This strategy greatly reduces memory usage, since only one copy of data is needed instead of two.)

filenames specifies which files to load. (It can be a vector of filenames or a single filename string.) For each file, loadfun is called.

  • sample_names - Specify the sample names. Should be a vector of the same length as filenames. Set to nothing to not create a sample name annotation.
  • sample_name_col - Column for sample names in obs, defaults to "sampleName".
  • obs_id_col - Colum for merged ids in obs.
  • lazy - Enable lazy loading. Defaults to true if load10x is used, and false otherwise.
  • lazy_merge - Enable lazy merging, i.e. var and obs are created, but the count matrix merging is postponed until a second call to load_counts.
  • obs_id_delim - Delimiter used when creating merged obs IDs.
  • obs_id_prefixes - Prefix (one per sample) used to create new IDs. Set to nothing to keep old IDs. Defaults to sample_names.
  • extra_var_id_cols - Additional columns to use to ensure variable IDs are unique during merging. Defaults to "feature_type" if that column is present for all samples. Can be a Vector{String} to include multiple columns. Set to nothing to disable.
  • duplicate_var - Set to :ignore, :warn or :error to decide what happens if duplicate var IDs are found.
  • duplicate_obs - Set to :ignore, :warn or :error to decide what happens if duplicate obs IDs are found.
  • callback - Experimental callback functionality. The callback function is called between samples during merging. Return true to abort loading and false to continue.
  • Additional kwargs (including duplicate_var/duplicate_obs if specified) are passed to loadfun.

Examples

Load and name samples:

julia> counts = load_counts(["s1.h5", "s2.h5"]; sample_names=["Sample A", "Sample B"])

See also: load10x, merge_counts

source
SingleCellProjections.loadh5adMethod
loadh5ad(filename; var_id_column=:id, obs_id_column=:id)

Experimental loading of .h5ad files.

Note

This function is deprecated. Load Muon.jl and see help for create_datamatrix.

source
SingleCellProjections.local_outlier_factor!Method
local_outlier_factor!(data::DataMatrix, full::DataMatrix; k=10, col="LOF")

Compute the Local Outlier Factor (LOF) for each observation in data, and add as column to data.obs with the name col.

When working with projected DataMatrices, use local_outlier_factor_projection! instead.

NB: This function might be very slow for high values of k.

First, the k nearest neighbors are found for each observation in data. Then, the Local Outlier Factor is computed by considering the distance between the neighbors, but this time in the full DataMatrix. Thus full must have the same observations as are present in data.

A LOF value smaller than or close to one is means that the observation is an inlier, but a LOF value much larger than one means that the observation is an inlier.

By specifiying full=data, this is coincides with the standard definition for Local Outlier Factor. However, it is perhaps more useful to find neighbors in a dimension reduced space (after e.g. svd (PCA) or umap), but then compute distances in the high dimensional space (typically after normalization). This way, an observation is concidered an outlier if the reduction to a lower dimensional space didn't properly represent the neighborhood of the observation.

Note

The interface is not yet fully decided and subject to change.

Examples

Compute the Local Outlier Factor, with nearest neighbors based only on reduced, but later using distances in full for the actual LOF computation.

julia> reduced = svd(normalized; nsv=10)

julia> local_outlier_factor!(reduced, normalized; k=10)

See also: local_outlier_factor, local_outlier_factor_table, local_outlier_factor_projection!

source
SingleCellProjections.local_outlier_factor_projection!Method
local_outlier_factor_projection!(data::DataMatrix, full::DataMatrix, base::DataMatrix, base_full::DataMatrix; k=10, col="LOF_projection")

Compute the Local Outlier Factor (LOF) for each observation in data, and add as column to data.obs with the name col.

Use local_outlier_factor_projection! if you are working with projected data, and local_outlier_factor! otherwise.

Parameters:

  • data - A DataMatrix for which we compute LOF for each observation. Expected to be a DataMatrix projected onto base, so that the data and base use the same coordinate system.
  • full - A DataMatrix with the same observations as data, used to compute distances in the LOF computation. Expected to be a DataMatrix projected onto base_full, so that the full and base_full use the same coordinate system.
  • base - The base DataMatrix.
  • base_full - The base DataMatrix.
  • k - The number of nearest neighbors to use. NB: This function might be very slow for high values of k.

First, for each observation in data, the k nearest neighbors in base are found. Then, the distance to each neighbor is computed using full and base_full. Thus full must have the same observations as are present in data, and base_full must have the same observations as base.

A LOF value smaller than or close to one is means that the observation is an inlier, but a LOF value much larger than one means that the observation is an inlier.

By specifiying full=data and base_full=base, this is coincides with the standard definition for Local Outlier Factor. However, it is perhaps more useful to find neighbors in a dimension reduced space (after e.g. svd (PCA) or umap), but then compute distances in the high dimensional space (typically after normalization). This way, an observation is concidered an outlier if the reduction to a lower dimensional space didn't properly represent the neighborhood of the observation.

Note

The interface is not yet fully decided and subject to change.

Examples

Compute the Local Outlier Factor (LOF) for each observation in a data set reduced, which has been projected onto base_reduced.

The nearest neighbors are computed between observations in reduced and base_reduced, but the distances in the actual LOF computation are between the same observations in normalized and base_normalized.

julia> base_reduced = svd(base_normalized; nsv=10)

julia> normalized = project(counts, base_normalized);

julia> reduced = project(normalized, base_reduced);

julia> local_outlier_factor!(reduced, normalized, base_reduced, base_normalized; k=10)

See also: local_outlier_factor_projection, local_outlier_factor_projection_table, local_outlier_factor!

source
SingleCellProjections.logtransformMethod
logtransform([T=Float64], counts::DataMatrix;
             var_filter = hasproperty(counts.var, "feature_type") ? "feature_type" => isequal("Gene Expression") : nothing,
             var_filter_cols = hasproperty(counts.var, "feature_type") ? "feature_type" : nothing,
             scale_factor=10_000,
             var=:copy,
             obs=:copy)

Log₂-transform counts using the formula:

  log₂(1 + cᵢⱼ*scale_factor/(∑ᵢcᵢⱼ))

Optionally, T can be specified to control the eltype of the sparse transformed matrix. T=Float32 can be used to lower the memory usage, with little impact on the results, since downstream analysis is still done with Float64.

  • var_filter - Control which variables (features) to use for parameter estimation. Defaults to "feature_type" => isequal("Gene Expression"), if a feature_type column is present in counts.var. Can be set to nothing to disable filtering. See DataFrames.filter for how to specify filters.
  • var_filter_cols - Additional columns used to ensure features are unique. Defaults to "feature_type" if present in counts.var. Use a Tuple/Vector for specifying multiple columns. Can be set to nothing to not include any additional columns.
  • var - Can be :copy (make a copy of source var) or :keep (share the source var object).
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).

Examples

julia> transformed = logtransform(counts)

Use eltype Float32 to lower memory usage:

julia> transformed = logtransform(Float32, counts)

See also: sctransform

source
SingleCellProjections.mannwhitney!Method
mannwhitney!(data::DataMatrix, column, [group_a, group_b]; kwargs...)

Perform a Mann-Whitney U-test (also known as a Wilcoxon rank-sum test) between two groups of observations. The U statistic is corrected for ties, and p-values are computed using a normal approximation.

Note that data must be a DataMatrix containing a sparse matrix only. It is recommended to first logtransform (or tf_idf_transform) the raw counts before performing the Mann-Whitney U-test.

mannwhitney! adds a U statistic and a p-value column to data.var. See mannwhitney_table for more details on groups and kwargs.

In addition mannwhitney! supports the kwarg:

  • prefix - Output column names for U statistics and p-values will be prefixed with this string. If none is given, it will be constructed from column, group_a and group_b.

See also: mannwhitney_table, mannwhitney

source
SingleCellProjections.mannwhitneyMethod
mannwhitney(data::DataMatrix, column, [group_a, group_b]; var=:copy, obs=:copy, matrix=:keep, kwargs...)

Perform a Mann-Whitney U-test (also known as a Wilcoxon rank-sum test) between two groups of observations. The U statistic is corrected for ties, and p-values are computed using a normal approximation.

Note that data must be a DataMatrix containing a sparse matrix only. It is recommended to first logtransform (or tf_idf_transform) the raw counts before performing the Mann-Whitney U-test.

mannwhitney creates a copy of data and adds a U statistic and a p-value column to data.var. See mannwhitney! and mannwhitney_table for more details on groups and kwargs.

See also: mannwhitney!, mannwhitney_table

source
SingleCellProjections.mannwhitney_tableMethod
mannwhitney_table(data::DataMatrix, column, [group_a, group_b]; kwargs...)

Perform a Mann-Whitney U-test (also known as a Wilcoxon rank-sum test) between two groups of observations. The U statistic is corrected for ties, and p-values are computed using a normal approximation.

Note that data must be a DataMatrix containing a sparse matrix only. It is recommended to first logtransform (or tf_idf_transform) the raw counts before performing the Mann-Whitney U-test.

column specifies a column in data.obs and is used to determine which observations belong in which group.

If group_a and group_b are not given, the column must contain exactly two unique values (except missing). If group_a is given, but not group_b, the observations in group A are compared to all other observations (except missing). If both group_a and group_b are given, the observations in group A are compared the observations in group B.

mannwhitney_table returns a Dataframe with columns for variable IDs, U statistics and p-values.

Supported kwargs are:

  • statistic_col="U" - Name of the output column containing the U statistics. (Set to nothing to remove from output.)
  • pvalue_col="pValue" - Name of the output column containing the p-values. (Set to nothing to remove from output.)
  • h1_missing=:skip - One of :skip and :error. If skip, missing values in column are skipped, otherwise an error is thrown.

The following kwargs determine how the computations are threaded:

  • nworkers - Number of worker threads used in the computation. Set to 1 to disable threading.
  • chunk_size - Number of variables processed in each chunk.
  • channel_size - Max number of unprocessed chunks in queue.

See also: mannwhitney!, mannwhitney

source
SingleCellProjections.merge_countsMethod
merge_counts(samples, sample_names;
             lazy=false,
             sample_name_col = sample_names===nothing ? nothing : "sampleName",
             obs_id_col = "cell_id",
             obs_id_delim = '_',
             obs_id_prefixes = sample_names,
             extra_var_id_cols::Union{Nothing,String,Vector{String}},
             duplicate_var,
             duplicate_obs,
             callback=nothing)

Merge samples to create one large DataMatrix, by concatenating the obs. The union of the variables from the samples is used, and if a variable is not present in a sample, the count will be set to zero.

The obs IDs are created by concatenating the current obs ID columns, together with the sample_names (if provided).

  • lazy - Lazy merging. Use load_counts to actually perform the merging.
  • sample_name_col - Column in which the sample_names are stored.
  • obs_id_col - Name of obs ID column after merging. (Set to nothing to keep old column name.)
  • obs_id_delim - Delimiter used when merging obs IDs.
  • obs_id_prefixes - Prefix (one per sample) used to create new IDs. Set to nothing to keep old IDs. Defaults to sample_names.
  • extra_var_id_cols - Additional columns to use to ensure variable IDs are unique during merging. Defaults to "feature_type" if that column is present for all samples. Can be a Vector{String} to include multiple columns. Set to nothing to disable.
  • duplicate_var - Set to :ignore, :warn or :error to decide what happens if duplicate var IDs are found.
  • duplicate_obs - Set to :ignore, :warn or :error to decide what happens if duplicate obs IDs are found.
  • callback - Experimental callback functionality. The callback function is called between samples during merging. Return true to abort loading and false to continue.

See also: load_counts

source
SingleCellProjections.normalize_matrixMethod
normalize_matrix(data::DataMatrix, [covariates...]; center=true, scale=false, kwargs...)

Normalize data. By default, the matrix is centered. Any covariates specified (using column names of data.obs) will be regressed out.

  • center - Set to true to center the data matrix.
  • scale - Set to true to scale the variables in the data matrix to unit standard deviation.

For other kwargs and more detailed descriptions, see NormalizationModel and designmatrix.

Examples

Centering only:

julia> normalize_matrix(data)

Regression model with intercept (centering) and "fraction_mt" (numerical annotation):

julia> normalize_matrix(data, "fraction_mt")

As above, but also including "batch" (categorical annotation):

julia> normalize_matrix(data, "fraction_mt", "batch")

See also: NormalizationModel, designmatrix

source
SingleCellProjections.obs_coordinatesFunction
obs_coordinates(data::DataMatrix)

Returns a matrix with coordinates for the observations. Not available for all types of DataMatrices. Mostly useful for data matrices after dimension reduction such as svd or force_layout has been applied.

In the case of SVD (PCA), obs_coordinates returns the principal components, scaled by the singular values. This is a a good starting point for downstream analysis, since it is the optimal linear approximation of the original data for the given number of dimensions.

source
SingleCellProjections.projectMethod
project(data::DataMatrix, models, args...; verbose=true, kwargs...)

Convenience function for projection onto multiple models. Essentially calls foldl and prints some @info messages (if verbose=true). In most cases, it is better to call project(data, base::DataMatrix) instead of using this method directly.

source
SingleCellProjections.projectMethod
project(data::DataMatrix, base::DataMatrix, args...; from=nothing, kwargs...)

Project data onto base, by applying ProjectionModels from base one by one.

Since data already might have some models applied, project will try to figure out which models from base to use. See "Examples" below for concrete examples. Here's a more technical overview:

Consider a base data matrix with four models:

base: A -> B -> C -> D

Given some new data (typically counts), we can project that onto base, given the result proj by applying all four models:

data:
proj: A -> B -> C -> D

If data already has some models applied (e.g. we already projected onto A and B above), project will look for the last model in data (in this case B) in the list of models in base, and only apply models after that (in this case C and D).

data: A -> B
proj: A -> B -> C -> D

It is also possible to use the from kwarg to specify exactly which models to apply. (The models in from must be a prefix of the models in base, or in other words, base was created by applying additional operations to from.)

data: X
base: A -> B -> C -> D
from: A -> B
proj: X -> C -> D

Note that it is necessary to use the from kwarg if the last model in data does not occurr in base, because project cannot figure out on its own which models it makes sense to apply.

Examples

First, we construct a "base" by loading counts, SCTransforming, normalizing, computing the svd and finally computing a force layout:

julia> fp = ["GSE164378_RNA_ADT_3P_P1.h5", "GSE164378_RNA_ADT_3P_P2.h5"];
julia> counts = load_counts(fp; sample_names=["P1","P2"]);
julia> transformed = sctransform(counts);
julia> normalized = normalize_matrix(transformed);
julia> reduced = svd(normalized; nsv=10);
julia> fl = force_layout(reduced; ndim=3, k=100)
  DataMatrix (3 variables and 35340 observations)
  Matrix{Float64}
  Variables: id
  Observations: id, sampleName, barcode
  Models: NearestNeighborModel(base="force_layout", k=10), SVD, Normalization, SCTransform

Note how the last line lists all ProjectionModels used in the creation of fl.

Next, let's load some more samples for projection:

julia> fp2 = ["GSE164378_RNA_ADT_3P_P5.h5", "GSE164378_RNA_ADT_3P_P6.h5"];
julia> counts2 = load_counts(fp2; sample_names=["P5","P6"]);

It is easy to project the newly loaded counts2 onto the "base" force layout fl:

julia> project(counts2, fl)
DataMatrix (3 variables and 42553 observations)
  Matrix{Float64}
  Variables: id
  Observations: id, sampleName, barcode
  Models: NearestNeighborModel(base="force_layout", k=10), SVD, Normalization, SCTransform

We can also project in two or more steps, to get access to intermediate results:

julia> reduced2 = project(counts2, reduced)
DataMatrix (20239 variables and 42553 observations)
  SVD (10 dimensions)
  Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, beta0, ...
  Observations: id, sampleName, barcode
  Models: SVDModel(nsv=10), Normalization, SCTransform

julia> project(reduced2, fl)
DataMatrix (3 variables and 42553 observations)
  Matrix{Float64}
  Variables: id
  Observations: id, sampleName, barcode
  Models: NearestNeighborModel(base="force_layout", k=10), SVD, Normalization, SCTransform

If the DataMatrix we want to project is modified, we need to use the from kwarg to tell project which models to use:

julia> filtered = counts2[:,1:10:end]
DataMatrix (33766 variables and 4256 observations)
  SparseArrays.SparseMatrixCSC{Int64, Int32}
  Variables: id, feature_type, name, genome, read, pattern, sequence
  Observations: id, sampleName, barcode
  Models: FilterModel(:, 1:10:42551)

julia> reduced2b = project(filtered2, reduced; from=counts)
DataMatrix (20239 variables and 4256 observations)
  SVD (10 dimensions)
  Variables: id, feature_type, name, genome, read, pattern, sequence, logGeneMean, outlier, beta0, ...
  Observations: id, sampleName, barcode
  Models: SVDModel(nsv=10), Normalization, SCTransform, Filter

After that, it is possible to continue without specifying from:

julia> project(reduced2b, fl)
DataMatrix (3 variables and 4256 observations)
  Matrix{Float64}
  Variables: id
  Observations: id, sampleName, barcode
  Models: NearestNeighborModel(base="force_layout", k=10), SVD, Normalization, SCTransform, Filter
source
SingleCellProjections.projectMethod
project(data::DataMatrix, model::ProjectionModel, args...; verbose=true, kwargs...)

Core projection function. Project data based on the single ProjectionModel model. In most cases, it is better to call project(data, base::DataMatrix) instead of using this method directly.

source
SingleCellProjections.pseudobulkMethod
pseudobulk(data::DataMatrix, obs_col, [additional_columns...]; var=:copy)

Create a new DataMatrix by averging over groups, as specified by the categorical annotation obs_col (and optionally additional columns).

  • var - Can be :copy (make a copy of source var) or :keep (share the source var object).

Examples

Create a pseudobulk representation of each sample:

julia> pseudobulk(transformed, "sampleName")

Create a pseudobulk representation for each celltype in each sample:

julia> pseudobulk(transformed, "sampleName", "celltype")
source
SingleCellProjections.set_obs_id_col!Method
set_obs_id_col!(data::DataMatrix, obs_id_col::String; duplicate_obs=:error)

Set which column to use as observation IDs. It will be moved to the first column of data.obs. The rows of this column in data.obs must be unique.

  • duplicate_obs - Set to :ignore, :warn or :error to decide what happens if duplicate IDs are found.
source
SingleCellProjections.set_var_id_col!Method
set_var_id_col!(data::DataMatrix, var_id_col::String; duplicate_var=:error)

Set which column to use as variable IDs. It will be moved to the first column of data.var. The rows of this column in data.var must be unique.

  • duplicate_var - Set to :ignore, :warn or :error to decide what happens if duplicate IDs are found.
source
SingleCellProjections.tf_idf_transformMethod
tf_idf_transform([T=Float64], counts::DataMatrix;
                 var_filter = hasproperty(counts.var, "feature_type") ? "feature_type" => isequal("Gene Expression") : nothing,
                 var_filter_cols = hasproperty(counts.var, "feature_type") ? "feature_type" : nothing,
                 scale_factor = 10_000,
                 idf = vec(size(counts,2) ./ max.(1,sum(counts.matrix; dims=2))),
                 annotate = true,
                 var = :copy,
                 obs = :copy)

Compute the TF-IDF (term frequency-inverse document frequency) transform of counts, using the formula log( 1 + scale_factor * tf * idf ) where tf is the term frequency counts.matrix ./ max.(1, sum(counts.matrix; dims=1)).

Optionally, T can be specified to control the eltype of the sparse transformed matrix. T=Float32 can be used to lower the memory usage, with little impact on the results, since downstream analysis is still done with Float64.

  • var_filter - Control which variables (features) to use for parameter estimation. Defaults to "feature_type" => isequal("Gene Expression"), if a feature_type column is present in counts.var. Can be set to nothing to disable filtering. See DataFrames.filter for how to specify filters.
  • var_filter_cols - Additional columns used to ensure features are unique. Defaults to "feature_type" if present in counts.var. Use a Tuple/Vector for specifying multiple columns. Can be set to nothing to not include any additional columns.
  • annotate - If true, idf will be added as a var annotation.
  • var - Can be :copy (make a copy of source var) or :keep (share the source var object).
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).
source
SingleCellProjections.ttest!Method
ttest!(data::DataMatrix, h1, [group_a], [group_b]; h0, kwargs...)

Performs a t-Test with the given h1 (alternative hypothesis) and h0 (null hypothesis). Examples of t-Tests are Two-Group tests and Linear Regression.

ttest! adds a t-statistic, a p-value and a difference column to data.var.

See ttest_table for usage examples and more details on computations and parameters.

In addition ttest! supports the kwarg:

  • prefix - Output column names for t-statistics, p-values and differences will be prefixed with this string. If none is given, it will be constructed from h1, group_a, group_b and h0.

See also: ttest_table, ttest, ftest!, mannwhitney!

source
SingleCellProjections.ttestMethod
ttest(data::DataMatrix, h1, [group_a], [group_b]; h0, var=:copy, obs=:copy, matrix=:keep, kwargs...)

Performs a t-Test with the given h1 (alternative hypothesis) and h0 (null hypothesis). Examples of t-Tests are Two-Group tests and Linear Regression.

ttest creates a copy of data and adds a t-statistic, a p-value and a difference column to data.var.

See ttest_table and ttest! for usage examples and more details on computations and parameters.

See also: ttest!, ttest_table, ftest, mannwhitney

source
SingleCellProjections.ttest_tableMethod
ttest_table(data::DataMatrix, h1, [group_a], [group_b]; h0, kwargs...)

Performs a t-Test with the given h1 (alternative hypothesis) and h0 (null hypothesis). Examples of t-Tests are Two-Group tests and Linear Regression.

T-tests can be performed on any DataMatrix, but it is almost always recommended to do it directly after transforming the data using e.g. sctransform, logtransform or tf_idf_transform.

Normalization

Do not use ttest_table after normalizing the data using normalize_matrix: ttest_table needs to know about the h0 model (regressed out covariates) for correction computations. Failing to do so can result in incorrect results. If you want to correct for the same covariates, pass them as h0 to ttest_table.

h1 can be:

  • A string specifying a column name of data.obs. Auto-detection determines if the column is categorical (Two-Group) or numerical (linear regression).
    • If group_a and group_b are specified, a Two-Group test between group_a and group_b is performed.
    • If group_a is specified, but not group_b, a Two-Group test between group_a and all other observations is performed.
  • A covariate for more control of how to interpret the values in the column.

ttest_table returns a Dataframe with columns for variable IDs, t-statistics, p-values and differences. For Two-group tests, difference is the difference in mean between the two groups. For linear regression, the difference corresponds to the rate of change.

Supported kwargs are:

  • h0 - Use a non-trivial h0 (null) model. Specified in the same way as h1 above.
  • center=true - Add an intercept to the h0 (null) model.
  • statistic_col="t" - Name of the output column containing the t-statistics. (Set to nothing to remove from output.)
  • pvalue_col="pValue" - Name of the output column containing the p-values. (Set to nothing to remove from output.)
  • difference_col="difference" - Name of the output column containing the differences. (Set to nothing to remove from output.)
  • h1_missing=:skip - One of :skip and :error. If skip, missing values in h1 columns are skipped, otherwise an error is thrown.
  • h0_missing=:error - One of :skip and :error. If skip, missing values in h0 columns are skipped, otherwise an error is thrown.
  • allow_normalized_matrix=false - Set to true to accept running on a DataMatrix that has been normalized.

Examples

Perform a Two-Group t-test between celltypes "Mono" and "DC".

julia> ttest_table(transformed, "celltype", "Mono", "DC")

Perform a Two-Group t-test between celltype "Mono" and all other cells.

julia> ttest_table(transformed, "celltype", "Mono")

Perform a Two-Group t-test between celltypes "Mono" and "DC", while correcting for "fraction_mt" (a linear covariate).

julia> ttest_table(transformed, "celltype", "Mono", "DC")

Perform Linear Regression using the covariate "fraction_mt".

julia> ttest_table(transformed, "fraction_mt")

See also: ttest!, ttest, ftest_table, mannwhitney_table, covariate

source
SingleCellProjections.update_matrixFunction
update_matrix(data::DataMatrix, matrix, model=nothing;
              var::Union{Symbol,String,DataFrame} = "",
              obs::Union{Symbol,String,DataFrame} = "")

Create a new DataMatrix by replacing parts of data with new values. Mostly useful when implementing new ProjectionModels.

  • matrix - the new matrix.
  • model - will be appended to the list of models from data. If set to nothing, the resulting list of models will be empty.

Kwargs:

  • var - One of:
    • :copy - Copy from data.
    • :keep - Share var with data.
    • ::DataFrame - Replace with a new table with variable annotations.
    • prefix::String - Prefix, the new variables will be named prefix1, prefix2, etc.
  • obs See var.
source
SingleCellProjections.ustatistic_singleMethod
ustatistic_single(X, j, groups, n1, n2)

NB: Assumes all sparse non-zeros are positive.

X is a sparse matrix where each column is a variable. j is the current variable. groups is a vector with values: 1 for each sample in group 1, 2 for each sample in group 2 and 0 for samples in neither group. n1 number of elements in group 1 (precomputed from groups) n2 number of elements in group 2 (precomputed from groups)

source
SingleCellProjections.var_coordinatesFunction
var_coordinates(data::DataMatrix)

Returns a matrix with coordinates for the variables. Only available for DataMatrices that have a dual representation (e.g. SVD/PCA).

In the case of SVD (PCA), var_coordinates returns the principal components as unit vectors.

source
SingleCellProjections.var_counts_fraction!Method
var_counts_fraction!(counts::DataMatrix, sub_filter, tot_filter, col; check=true, var=:keep, obs=:keep)

For each observation, compute the fraction of counts that match a specific variable pattern.

  • sub_filter decides which variables are counted.
  • tot_filter decides which variables to include in the total.

kwargs:

  • var - Use this to set var in the ProjectionModel.
  • obs - Use this to set obs in the ProjectionModel. Note that counts.obs is changed in place, regardless of the value of obs.

If check=true, an error will be thrown if no variables match the patterns.

For more information on filtering syntax, see examples below and the documentation on DataFrames.filter.

Examples

Compute the fraction of reads in MT- genes, considering only "Gene Expression" features (and not e.g. "Antibody Capture").

var_counts_fraction!(counts, "name"=>startswith("MT-"), "feature_type"=>isequal("Gene Expression"), "fraction_mt")

Compute the fraction of reads in MT- genes, when there is no feature_type annotation (i.e. all variables are genes).

var_counts_fraction!(counts, "name"=>startswith("MT-"), Returns(true), "fraction_mt")

See also: var_counts_fraction

source
SingleCellProjections.var_counts_fractionMethod
var_counts_fraction(counts::DataMatrix, sub_filter, tot_filter, col; check=true, var=:copy, obs=:copy)

For each observation, compute the fraction of counts that match a specific variable pattern.

  • sub_filter decides which variables are counted.
  • tot_filter decides which variables to include in the total.

kwargs:

  • var - Can be :copy (make a copy of source var) or :keep (share the source var object).
  • obs - Can be :copy (make a copy of source obs) or :keep (share the source obs object).

If check=true, an error will be thrown if no variables match the patterns.

For more information on filtering syntax, see examples below and the documentation on DataFrames.filter.

Examples

Compute the fraction of reads in MT- genes, considering only "Gene Expression" features (and not e.g. "Antibody Capture").

var_counts_fraction(counts, "name"=>startswith("MT-"), "feature_type"=>isequal("Gene Expression"), "fraction_mt")

Compute the fraction of reads in MT- genes, when there is no feature_type annotation (i.e. all variables are genes).

var_counts_fraction(counts, "name"=>startswith("MT-"), Returns(true), "fraction_mt")

See also: var_counts_fraction!

source
SingleCellProjections.var_counts_sum!Method
var_counts_sum!([f=identity], counts::DataMatrix, filter, col; check=true, var=:keep, obs=:keep)

For each observation, compute the sum of counts matching the filter.

If f is specified, it is applied to each element before summing. (Similar to sum.)

kwargs:

  • var - Use this to set var in the ProjectionModel.
  • obs - Use this to set obs in the ProjectionModel. Note that counts.obs is changed in place, regardless of the value of obs.

If check=true, an error will be thrown if no variables match the pattern.

For more information on filtering syntax, see examples below and the documentation on DataFrames.filter.

Examples

Sum all "Gene Expression" counts:

var_counts_sum!(counts, "feature_type"=>isequal("Gene Expression"), "totalRNACount")

Compute the number of "Gene Expression" variables that are expressed (i.e. nonzero):

var_counts_sum!(!iszero, counts, "feature_type"=>isequal("Gene Expression"), "nonzeroRNACount")

See also: var_counts_sum

source
SingleCellProjections.var_counts_sumMethod
var_counts_sum([f=identity], counts::DataMatrix, filter, col; check=true, var=:keep, obs=:keep)

For each observation, compute the sum of counts matching the filter.

If f is specified, it is applied to each element before summing. (Similar to sum.)

kwargs:

  • var - Use this to set var in the ProjectionModel.
  • obs - Use this to set obs in the ProjectionModel. Note that counts.obs is changed in place, regardless of the value of obs.

If check=true, an error will be thrown if no variables match the pattern.

For more information on filtering syntax, see examples below and the documentation on DataFrames.filter.

Examples

Sum all "Gene Expression" counts:

var_counts_sum(counts, "feature_type"=>isequal("Gene Expression"), "totalRNACount")

Compute the number of "Gene Expression" variables that are expressed (i.e. nonzero):

var_counts_sum(!iszero, counts, "feature_type"=>isequal("Gene Expression"), "nonzeroRNACount")

See also: var_counts_sum!

source