Using dsb to normalize single cell protein data: analysis workflow and integration with Seurat, Bioconductor and Scanpy

Matt Mulè

dsb (denoised and scaled by background) is a lightweight R package developed in John Tsang’s Lab (NIH-NIAID) for removing noise and normalizing protein data from single cell methods such as CITE-seq, REAP-seq, and Mission Bio Tapestri. See the dsb Biorxiv preprint for details on the method and please consider citing the paper if you use this package or find the protein noise modeling results useful.

This vignette outlines how to use dsb to normalize droplet-based single cell protein data from antibody counts and how to integrate the normalized values in an analysis workflow using Seurat, as well as integration with Bioconductor and Scanpy. If you have a question please first see the FAQ section then open an issue on the github site so the community can benefit.

Background and motivation for the dsb method

CITE-seq protein data suffers from substantial background noise (for example, see supplementary fig 5a in Stoeckius et. al. 2017 Nat. Methods). We performed experiments and analysis to dissect this noise and the dsb method is based on 3 key findings outlined in our paper.

  1. Based on unstained control experiments and modeling we found a major source of protein background noise comes from ambient, unbound antibody encapsulated in droplets.

  2. “Empty” droplets (containing ambient mRNA and antibody but no cell), outnumber cell-containing droplets by 10-100 fold and capture the ambient component of protein background noise.

  3. Cell-to-cell technical variations such as stochastic differences in cell lysis/capture, RT efficiency, sequencing depth, and non-specific antibody binding can be estimated and removed by defining the “technical component” of each cell’s protein library.

Installation and quick overview with pre-loaded package data

Use the DSBNormalizeProtein() function to
1) normalize raw protein counts of a cells (columns) by proteins (rows) matrix cells_citeseq_mtx estimating the ambient component of noise with empty_drop_citeseq_mtx, a matrix of background/empty droplets. 2) model and remove the ‘technical component’ of each cell’s protein library by setting denoise.counts = TRUE and include isotype controls in the calculation of the technical component with use.isotype.control = TRUE.

# install dsb package

adt_norm = DSBNormalizeProtein(
  # remove ambient protien noise reflected in counts from empty droplets 
  cell_protein_matrix = cells_citeseq_mtx, # cell-containing droplet raw protein count matrix
  empty_drop_matrix = empty_drop_citeseq_mtx, # empty/background droplet raw protein counts
  # recommended step II: model and remove the technical component of each cell's protein library
  denoise.counts = TRUE, # model and remove each cell's technical component
  use.isotype.control = TRUE, # use isotype controls to define the technical component = rownames(cells_citeseq_mtx)[67:70] # vector of isotype control names

Tutorial with public 10X genomics data

Download RAW (not filtered!) feature / cellmatrix raw from public 10X Genomics CITE-seq data here. The tutorial below uses use R 3.6. We emphasize normalized protein data and raw RNA data from this workflow at step III can be used with Seurat, Bioconductor or Python’s AnnData class in Scanpy. We use a convenience function from Seurat Read10X to load the raw data. We then provide a suggested workflow after normalization with dsb based on the CITE-seq analysis used in our paper on baseline immune system states: Kotliarov et. al. 2020 Nat. Medicine.

Please see multiplexing experiments section if your experiment used superloading/demultiplexing and FAQ section for guidance if you have multiple 10X lanes and or batches.

Step Ia: load raw count alignment (e.g. Cell Ranger) output and define cell metadata variables

Below we load the raw output from the Cell Ranger count alignment. The raw output is a sparse matrix of possible cell barcodes vs proteins / mRNA. The number of cell barcodes ranges 500k-6M depending on the kit/chemistry version.

Step Ib: Quality control and defining cell-containing and background droplets

Now that we loaded the raw output, we define cells and background droplets. Methods for defining droplets depend on the experiment design. The number of droplets you define as cell-containing should be in line with the number of cells loaded and the expected cell recovery. In this experiment, 5000 cells were loaded in a single 10X lane. In the protein library size histogram below, large peaks < 1 and ~2 contain > 30,000 and 60,000 drops respectively. The smaller peaks > 3 (for both mRNA and protein) contain ~5000 drops and therefore the rightmost peaks > 3 are the cell-containing droplets we next perform quality control as in any single cell preprocessing workflow. Again, see multiplexing experiments section and FAQ if you have many lanes / batches and or superloading.

Below we define background droplets as the major peak in the background distribution between 1.4 and 2.5 log total protein counts-note one could also define the empty/ background droplets as the entire population from 0 to 2.5 with little impact on normalized values (see the dsb paper section on sensitivity analysis to different definitions of background for details). In addition, we add mRNA quality control-based filters to remove potential low quality cells from both the background drops and the cells as you would in any scRNAseq workflow (e.g. see Seurat tutorials and Luecken et. al. 2019 Mol Syst Biol).

Check: are the number of cells in line with the expected recovery from the experiment?

[1] 3481

Yes. After quality control above we have 3481 cells which is in line with the 5000 cells loaded in this experiment and closely matches the estimated cells from the filtered output from cell ranger (not shown). The protein library size distribution of the cells (orange) and background droplets (blue) used for normalization are highlighted below. >67,000 negative droplets are used to estimate the ambient background for normalizing the 3,481 single cells. (see FAQ for code to generate this plot below)

Step II: normalize protein data with the DSBNormalizeProtein Function.

If you have isotype control proteins, set denoise.counts = TRUE and use.isotype.control = TRUE and provide a vector containing names of isotype control proteins (the rownames of the protein matrix that are isotype controls). For example, in this experiment, rows 30:32 of the data are the isotype control proteins so we set = rownames(cells_mtx_rawprot)[30:32]. If you don’t have Isotype controls see the section ‘Quick overview for experiments without isotype controls’.

The function returns a matrix of normalized protein values which can be integrated with any single cell analysis software. We provide an example with Seurat, Bioconductor and Scanpy below.

Integrated workflow with Seurat

Step III (option a) Integrate with Seurat

This object can be used in downstream analysis using Seurat.

Suggested Step IV: Protein based clustering + cluster annotation

This is similar to the workflow used in our paper Kotliarov et. al. 2020 Nat. Medicine where we analyzed mRNA states within interpretable clusters defined by dsb normalized protein data. We first run spectral clustering using Seurat directly on the dsb normalized protein values without reducing dimensionality of the cells x protein matrix with PCA.

Suggested Step V: cluster annotation based on average dsb normalized protein values

A heatmap of average dsb normalized values in each cluster help in annotating clusters results. The values for each cell represent the number of standard deviations of each protein from the expected noise from reflected by the protein’s distribution in empty droplets, +/- the residual of the fitted model to the cell-intrinsic technical component.