Data Matrices
DataMatrix
objects – annotated matrices where rows are variables and columns are observations – are central in SingleCellProjections.jl. A DataMatrix
is also sometimes called an "Assay", in other software packages.
An overview of a DataMatrix
is shown when the object is displayed:
julia> data
DataMatrix (33766 variables and 35340 observations) SparseMatrixCSC{Int64, Int32} Variables: id, feature_type, name, genome, read, pattern, sequence Observations: id, sampleName, barcode
Here we see the matrix size (number of variables and observations), a brief description of the matrix contents, and an overview of available variable and observation annotations. The underlined annotation names are the ID columns (see IDs below for more details).
Variables
Variables, or var
for short, are typically genes, features (such as CITE-seq features) or variables after dimension reduction (e.g. "UMAP1"). The variables are stored as a DataFrame
and can be accessed by:
julia> data.var
33766×7 DataFrame Row │ id feature_type name genome read pattern sequence │ String String String String String String String ───────┼─────────────────────────────────────────────────────────────────────────────── 1 │ MIR1302-2HG Gene Expression MIR1302-2HG hg19 2 │ FAM138A Gene Expression FAM138A hg19 3 │ OR4F5 Gene Expression OR4F5 hg19 4 │ AL627309.1 Gene Expression AL627309.1 hg19 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 33764 │ CD144 Antibody Capture CD144 hg19 33765 │ CD202b Antibody Capture CD202b hg19 33766 │ CD11c Antibody Capture CD11c hg19 33759 rows omitted
Observations
Observations, or obs
for short, are typically cells, but can in theory be any kind of observation. The observations are stored as a DataFrame
and can be accessed by:
julia> data.obs
35340×3 DataFrame Row │ id sampleName barcode │ String String String ───────┼───────────────────────────────────────────────────────── 1 │ P1_L1_AAACCCAAGACATACA P1 L1_AAACCCAAGACATACA 2 │ P1_L1_AAACCCACATCGGTTA P1 L1_AAACCCACATCGGTTA 3 │ P1_L1_AAACCCAGTGGAACAC P1 L1_AAACCCAGTGGAACAC 4 │ P1_L1_AAACCCATCTGCGGAC P1 L1_AAACCCATCTGCGGAC ⋮ │ ⋮ ⋮ ⋮ 35338 │ P2_L5_TTTGTTGTCAACACCA P2 L5_TTTGTTGTCAACACCA 35339 │ P2_L5_TTTGTTGTCATGCATG P2 L5_TTTGTTGTCATGCATG 35340 │ P2_L5_TTTGTTGTCCGTGCGA P2 L5_TTTGTTGTCCGTGCGA 35333 rows omitted
IDs
Each variable and each observation must have a unique ID, that is, each row in the DataFrame
should be unique if we consider the ID columns only. As seen above, the ID columns are underlined when displaying a DataMatrix. We can also access them directly:
julia> data.var_id_cols
ERROR: type DataMatrix has no field var_id_cols
julia> data.obs_id_cols
ERROR: type DataMatrix has no field obs_id_cols
Most of the time, IDs are handled automatically by SingleCellProjections.jl. Sometimes, you need to make sure IDs are unique when loading or merging data matrices. In particular, when loading a DataMatrix
that should be projected onto another DataMatrix
, the user must ensure that relevant IDs are matching.
Matrix
The matrix can be accessed by data.matrix
. Depending on the stage of analysis, different kinds of matrices (or matrix-like objects) are used. Most of this complexity is hidden from the user, but internally SingleCellProjections.jl depends on this functionality to be fast and to reduce memory usage.
SingleCellProjections.jl will reuse matrices when possible, in order to reduce memory usage. E.g. normalize_matrix
will reuse and extend the Matrix Expression of the source DataMatrix
, without creating a copy of the actual data. When matrices are reused/copied is considered an implementation detail, and can change at any time. Users of SingleCellProjections.jl should thus consider the matrices to be "read-only". This should rarely present problems in practice.
Roughly, the matrix types used at different stages are:
- Counts -
SparseMatrixCSC
- Transformed and normalized data - Matrix Expressions
- SVD (PCA) result -
SVD
- ForceLayout/UMAP/t-SNE result -
Matrix{Float64}