dsb (denoised and scaled by background) is a lightweight R package developed in John Tsang’s Lab (NIH-NIAID) for removing noise and normalizing protein data from single cell methods such as CITE-seq, REAP-seq, and Mission Bio Tapestri. See the dsb Biorxiv preprint for details on the method and please consider citing the paper if you use this package or find the protein noise modeling results useful.
This vignette outlines how to use dsb to normalize droplet-based single cell protein data from antibody counts and how to integrate the normalized values in an analysis workflow using Seurat, as well as integration with Bioconductor and Scanpy. If you have a question please first see the FAQ section then open an issue on the github site https://github.com/niaid/dsb/ so the community can benefit.
CITE-seq protein data suffers from substantial background noise (for example, see supplementary fig 5a in Stoeckius et. al. 2017 Nat. Methods). We performed experiments and analysis to dissect this noise and the dsb method is based on 3 key findings outlined in our paper.
Based on unstained control experiments and modeling we found a major source of protein background noise comes from ambient, unbound antibody encapsulated in droplets.
“Empty” droplets (containing ambient mRNA and antibody but no cell), outnumber cell-containing droplets by 10-100 fold and capture the ambient component of protein background noise.
Cell-to-cell technical variations such as stochastic differences in cell lysis/capture, RT efficiency, sequencing depth, and non-specific antibody binding can be estimated and removed by defining the “technical component” of each cell’s protein library.
DSBNormalizeProtein() function to
1) normalize raw protein counts of a cells (columns) by proteins (rows) matrix
cells_citeseq_mtx estimating the ambient component of noise with
empty_drop_citeseq_mtx, a matrix of background/empty droplets. 2) model and remove the ‘technical component’ of each cell’s protein library by setting
denoise.counts = TRUE and include isotype controls in the calculation of the technical component with
use.isotype.control = TRUE.
# install dsb package library(dsb) adt_norm = DSBNormalizeProtein( # remove ambient protien noise reflected in counts from empty droplets cell_protein_matrix = cells_citeseq_mtx, # cell-containing droplet raw protein count matrix empty_drop_matrix = empty_drop_citeseq_mtx, # empty/background droplet raw protein counts # recommended step II: model and remove the technical component of each cell's protein library denoise.counts = TRUE, # model and remove each cell's technical component use.isotype.control = TRUE, # use isotype controls to define the technical component isotype.control.name.vec = rownames(cells_citeseq_mtx)[67:70] # vector of isotype control names )
Download RAW (not filtered!) feature / cellmatrix raw from public 10X Genomics CITE-seq data here. The tutorial below uses use R 3.6. We emphasize normalized protein data and raw RNA data from this workflow at step III can be used with Seurat, Bioconductor or Python’s AnnData class in Scanpy. We use a convenience function from Seurat
Read10X to load the raw data. We then provide a suggested workflow after normalization with dsb based on the CITE-seq analysis used in our paper on baseline immune system states: Kotliarov et. al. 2020 Nat. Medicine.
Please see multiplexing experiments section if your experiment used superloading/demultiplexing and FAQ section for guidance if you have multiple 10X lanes and or batches.
Below we load the raw output from the Cell Ranger count alignment. The raw output is a sparse matrix of possible cell barcodes vs proteins / mRNA. The number of cell barcodes ranges 500k-6M depending on the kit/chemistry version.
library(dsb) library(Seurat) # Seurat Version 3 used below for the Read10X function library(tidyverse) # used for ggplot and %>% (tidyverse is not a dsb dependency) # read raw data using the Seurat function "Read10X" raw = Read10X("data/10x_data/10x_pbmc5k_V3/raw_feature_bc_matrix/") # Read10X formats the output as a list for the ADT and RNA assays: split this list prot = raw$`Antibody Capture` rna = raw$`Gene Expression` # create a metadata dataframe of simple qc stats for each droplet rna_size = log10(Matrix::colSums(rna)) prot_size = log10(Matrix::colSums(prot)) ngene = Matrix::colSums(rna > 0) mtgene = grep(pattern = "^MT-", rownames(rna), value = TRUE) propmt = Matrix::colSums(rna[mtgene, ]) / Matrix::colSums(rna) md = as.data.frame(cbind(propmt, rna_size, ngene, prot_size)) md$bc = rownames(md)
Now that we loaded the raw output, we define cells and background droplets. Methods for defining droplets depend on the experiment design. The number of droplets you define as cell-containing should be in line with the number of cells loaded and the expected cell recovery. In this experiment, 5000 cells were loaded in a single 10X lane. In the protein library size histogram below, large peaks < 1 and ~2 contain > 30,000 and 60,000 drops respectively. The smaller peaks > 3 (for both mRNA and protein) contain ~5000 drops and therefore the rightmost peaks > 3 are the cell-containing droplets we next perform quality control as in any single cell preprocessing workflow. Again, see multiplexing experiments section and FAQ if you have many lanes / batches and or superloading.
p1 = ggplot(md[md$rna_size > 0, ], aes(x = rna_size)) + geom_histogram(fill = "dodgerblue") + ggtitle("RNA library size \n distribution") p2 = ggplot(md[md$prot_size> 0, ], aes(x = prot_size)) + geom_density(fill = "firebrick2") + ggtitle("Protein library size \n distribution") cowplot::plot_grid(p1, p2, nrow = 1)
Below we define background droplets as the major peak in the background distribution between 1.4 and 2.5 log total protein counts-note one could also define the empty/ background droplets as the entire population from 0 to 2.5 with little impact on normalized values (see the dsb paper section on sensitivity analysis to different definitions of background for details). In addition, we add mRNA quality control-based filters to remove potential low quality cells from both the background drops and the cells as you would in any scRNAseq workflow (e.g. see Seurat tutorials and Luecken et. al. 2019 Mol Syst Biol).
# define a vector of background / empty droplet barcodes based on protein library size and mRNA content background_drops = md[md$prot_size > 1.4 & md$prot_size < 2.5 & md$ngene < 80, ]$bc negative_mtx_rawprot = prot[ , background_drops] %>% as.matrix() # define a vector of cell-containing droplet barcodes based on protein library size and mRNA content positive_cells = md[md$prot_size > 2.8 & md$ngene < 3000 & md$ngene > 200 & propmt <0.2, ]$bc cells_mtx_rawprot = prot[ , positive_cells] %>% as.matrix()
Check: are the number of cells in line with the expected recovery from the experiment?
Yes. After quality control above we have 3481 cells which is in line with the 5000 cells loaded in this experiment and closely matches the estimated cells from the filtered output from cell ranger (not shown). The protein library size distribution of the cells (orange) and background droplets (blue) used for normalization are highlighted below. >67,000 negative droplets are used to estimate the ambient background for normalizing the 3,481 single cells. (see FAQ for code to generate this plot below)
If you have isotype control proteins, set
denoise.counts = TRUE and
use.isotype.control = TRUE and provide a vector containing names of isotype control proteins (the rownames of the protein matrix that are isotype controls). For example, in this experiment, rows 30:32 of the data are the isotype control proteins so we set
isotype.control.name.vec = rownames(cells_mtx_rawprot)[30:32]. If you don’t have Isotype controls see the section ‘Quick overview for experiments without isotype controls’.
#normalize protein data for the cell containing droplets with the dsb method. dsb_norm_prot = DSBNormalizeProtein( cell_protein_matrix = cells_mtx_rawprot, # cell containing droplets empty_drop_matrix = negative_mtx_rawprot, # estimate ambient noise with the background drops denoise.counts = TRUE, # model and remove each cell's technical component use.isotype.control = TRUE, # use isotype controls to define the technical component isotype.control.name.vec = rownames(cells_mtx_rawprot)[30:32] # names of isotype control abs )
The function returns a matrix of normalized protein values which can be integrated with any single cell analysis software. We provide an example with Seurat, Bioconductor and Scanpy below.
# filter raw protein, RNA and metadata to only include cell-containing droplets cells_rna = rna[ ,positive_cells] md = md[positive_cells, ] # create Seurat object with cell-containing drops (min.cells is a gene filter, not a cell filter) s = Seurat::CreateSeuratObject(counts = cells_rna, meta.data = md, assay = "RNA", min.cells = 20) # add DSB normalized "dsb_norm_prot" protein data to a assay called "CITE" created in step II s[["CITE"]] = Seurat::CreateAssayObject(data = dsb_norm_prot)
This object can be used in downstream analysis using Seurat.
This is similar to the workflow used in our paper Kotliarov et. al. 2020 Nat. Medicine where we analyzed mRNA states within interpretable clusters defined by dsb normalized protein data. We first run spectral clustering using Seurat directly on the dsb normalized protein values without reducing dimensionality of the cells x protein matrix with PCA.
# define Euclidean distance matrix on dsb normalized protein data (without isotype controls) dsb = s@assays$CITE@data[1:29, ] p_dist = dist(t(dsb)) p_dist = as.matrix(p_dist) # Cluster using Seurat s[["p_dist"]] = Seurat::FindNeighbors(p_dist)$snn s = Seurat::FindClusters(s, resolution = 0.5, graph.name = "p_dist")
A heatmap of average dsb normalized values in each cluster help in annotating clusters results. The values for each cell represent the number of standard deviations of each protein from the expected noise from reflected by the protein’s distribution in empty droplets, +/- the residual of the fitted model to the cell-intrinsic technical component.
# calculate the average of each protein separately for each cluster prots = rownames(s@assays$CITE@data) adt_plot = adt_data %>% group_by(seurat_clusters) %>% summarize_at(.vars = prots, .funs = mean) %>% column_to_rownames("seurat_clusters") # plot a heatmap of the average dsb normalized values for each cluster pheatmap::pheatmap(t(adt_plot), color = viridis::viridis(25, option = "B"), fontsize_row = 8, border_color = NA)