Data Matrices

DataMatrix objects – annotated matrices where rows are variables and columns are observations – are central in SingleCellProjections.jl. A DataMatrix is also sometimes called an "Assay", in other software packages.

An overview of a DataMatrix is shown when the object is displayed:

julia> dataDataMatrix (33766 variables and 35340 observations)
  SparseMatrixCSC{Int64, Int32}
  Variables: id, feature_type, name, genome, read, pattern, sequence
  Observations: id, sampleName, barcode

Here we see the matrix size (number of variables and observations), a brief description of the matrix contents, and an overview of available variable and observation annotations. The underlined annotation names are the ID columns (see IDs below for more details).

Variables

Variables, or var for short, are typically genes, features (such as CITE-seq features) or variables after dimension reduction (e.g. "UMAP1"). The variables are stored as a DataFrame and can be accessed by:

julia> data.var33766×7 DataFrame
   Row  id           feature_type      name         genome  read    pattern  sequence  String       String            String       String  String  String   String   
───────┼───────────────────────────────────────────────────────────────────────────────
     1 │ MIR1302-2HG  Gene Expression   MIR1302-2HG  hg19
     2 │ FAM138A      Gene Expression   FAM138A      hg19
     3 │ OR4F5        Gene Expression   OR4F5        hg19
     4 │ AL627309.1   Gene Expression   AL627309.1   hg19
   ⋮   │      ⋮              ⋮               ⋮         ⋮       ⋮        ⋮        ⋮
 33764 │ CD144        Antibody Capture  CD144        hg19
 33765 │ CD202b       Antibody Capture  CD202b       hg19
 33766 │ CD11c        Antibody Capture  CD11c        hg19
                                                                     33759 rows omitted

Observations

Observations, or obs for short, are typically cells, but can in theory be any kind of observation. The observations are stored as a DataFrame and can be accessed by:

julia> data.obs35340×3 DataFrame
   Row  id                      sampleName  barcode              String                  String      String              
───────┼─────────────────────────────────────────────────────────
     1 │ P1_L1_AAACCCAAGACATACA  P1          L1_AAACCCAAGACATACA
     2 │ P1_L1_AAACCCACATCGGTTA  P1          L1_AAACCCACATCGGTTA
     3 │ P1_L1_AAACCCAGTGGAACAC  P1          L1_AAACCCAGTGGAACAC
     4 │ P1_L1_AAACCCATCTGCGGAC  P1          L1_AAACCCATCTGCGGAC
   ⋮   │           ⋮                 ⋮                ⋮
 35338 │ P2_L5_TTTGTTGTCAACACCA  P2          L5_TTTGTTGTCAACACCA
 35339 │ P2_L5_TTTGTTGTCATGCATG  P2          L5_TTTGTTGTCATGCATG
 35340 │ P2_L5_TTTGTTGTCCGTGCGA  P2          L5_TTTGTTGTCCGTGCGA
                                               35333 rows omitted

IDs

Each variable and each observation must have a unique ID, that is, each row in the DataFrame should be unique if we consider the ID columns only. As seen above, the ID columns are underlined when displaying a DataMatrix. We can also access them directly:

julia> data.var_id_colsERROR: type DataMatrix has no field var_id_cols
julia> data.obs_id_colsERROR: type DataMatrix has no field obs_id_cols

Most of the time, IDs are handled automatically by SingleCellProjections.jl. Sometimes, you need to make sure IDs are unique when loading or merging data matrices. In particular, when loading a DataMatrix that should be projected onto another DataMatrix, the user must ensure that relevant IDs are matching.

Matrix

The matrix can be accessed by data.matrix. Depending on the stage of analysis, different kinds of matrices (or matrix-like objects) are used. Most of this complexity is hidden from the user, but internally SingleCellProjections.jl depends on this functionality to be fast and to reduce memory usage.

Read-only

SingleCellProjections.jl will reuse matrices when possible, in order to reduce memory usage. E.g. normalize_matrix will reuse and extend the Matrix Expression of the source DataMatrix, without creating a copy of the actual data. When matrices are reused/copied is considered an implementation detail, and can change at any time. Users of SingleCellProjections.jl should thus consider the matrices to be "read-only". This should rarely present problems in practice.

Roughly, the matrix types used at different stages are:

  1. Counts - SparseMatrixCSC
  2. Transformed and normalized data - Matrix Expressions
  3. SVD (PCA) result - SVD
  4. ForceLayout/UMAP/t-SNE result - Matrix{Float64}