Installation

## install from source
## library(devtools)
## devtools::install_github("YY-SONG0718/scOntoMatch")
library(scOntoMatch)
library(ontologyIndex)

Load data

We use the Tabula Muris and Tabula Sapiens Smartseq-2 lung dataset as example. scOntoMatch works on any number of input datasets. Two demo seurat object are attached in inst/extdata, where we sampled two cells per cell type (original annotation) and focus on the cell type hierarchy in the two datasets.

metadata = '../inst/extdata/metadata.tsv'

anno_col = 'cell_ontology_class'
onto_id_col = 'cell_ontology_id'

obo_file = '../inst/extdata/cl-basic.obo'
propagate_relationships = c('is_a', 'part_of')
ont <- ontologyIndex::get_OBO(obo_file, propagate_relationships = propagate_relationships)

Organize the data name and path as first and second column in a metadata file. Store the seurat object in RDS format and use getSeuratRds to read them in.

obj_list = getSeuratRds(metadata = metadata, sep = "\t")
## 
  |                                                        
  |                                                  |   0%
## start loading seurat rds objects
## 
  |                                                        
  |=========================                         |  50%
  |                                                        
  |==================================================| 100%
levels(factor((obj_list$TM_lung@meta.data$cell_ontology_class)))
##  [1] "B cell"                                         
##  [2] "NA"                                             
##  [3] "T cell"                                         
##  [4] "ciliated columnar cell of tracheobronchial tree"
##  [5] "classical monocyte"                             
##  [6] "epithelial cell of lung"                        
##  [7] "leukocyte"                                      
##  [8] "lung endothelial cell"                          
##  [9] "monocyte"                                       
## [10] "myeloid cell"                                   
## [11] "natural killer cell"                            
## [12] "stromal cell"
levels(factor((obj_list$TS_lung@meta.data$cell_ontology_class)))
##  [1] "adventitial cell"                      
##  [2] "alveolar fibroblast"                   
##  [3] "b cell"                                
##  [4] "basal cell"                            
##  [5] "basophil"                              
##  [6] "bronchial smooth muscle cell"          
##  [7] "capillary aerocyte"                    
##  [8] "capillary endothelial cell"            
##  [9] "cd4-positive, alpha-beta t cell"       
## [10] "cd8-positive, alpha-beta t cell"       
## [11] "classical monocyte"                    
## [12] "club cell"                             
## [13] "dendritic cell"                        
## [14] "endothelial cell of artery"            
## [15] "endothelial cell of lymphatic vessel"  
## [16] "fibroblast"                            
## [17] "lung ciliated cell"                    
## [18] "lung microvascular endothelial cell"   
## [19] "macrophage"                            
## [20] "mesothelial cell"                      
## [21] "neutrophil"                            
## [22] "nk cell"                               
## [23] "non-classical monocyte"                
## [24] "pericyte cell"                         
## [25] "plasma cell"                           
## [26] "plasmacytoid dendritic cell"           
## [27] "respiratory goblet cell"               
## [28] "serous cell of epithelium of bronchus" 
## [29] "type i pneumocyte"                     
## [30] "type ii pneumocyte"                    
## [31] "vascular associated smooth muscle cell"
## [32] "vein endothelial cell"

Match ontology annotation

Trim the ontology tree per dataset

It is common that within each dataset, there will be parent-children relationship between cell types. This is because some cells are able to be further classified into more fine-grained groups, while some other cells are only recognized as the respective parental cell type.

This is not a problem for analyzing individual datasets - we do want to keep those rare, identifiable cell populations distinct. However it could be a problem when we want to map annotation cross-dataset, since it is obscure what population the parent term contains in different datasets.

We provide ontoMultiMinimal for Merging descendant terms to existing ancestor terms in one dataset, to get a minimum ontology representation of the cell type tree.

Note it is optional to trim the ontology tree, and it is always possible to get back to the original annotation later during analysis.

obj_list_minimal = scOntoMatch::ontoMultiMinimal(obj_list = obj_list, ont = ont, anno_col = anno_col, onto_id_col = onto_id_col)
## translate annotation to ontology id
## translating TM_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## NA
## translating TS_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## alveolar fibroblast, capillary aerocyte, nk cell
## Loading required package: SeuratObject
## Attaching sp
## mapping from name: lung endothelial cell to name: epithelial cell of lung
## mapping from name: classical monocyte to name: myeloid cell
## mapping from name: T cell to name: leukocyte
## mapping from name: B cell to name: leukocyte
## mapping from name: monocyte to name: myeloid cell
## mapping from name: natural killer cell to name: leukocyte
## after matching to base level ontology, TM_lung has cell types:
## NA, ciliated columnar cell of tracheobronchial tree, epithelial cell of lung, leukocyte, myeloid cell, stromal cell
## mapping from name: plasmacytoid dendritic cell to name: dendritic cell
## after matching to base level ontology, TS_lung has cell types:
## adventitial cell, alveolar fibroblast, b cell, basal cell, basophil, bronchial smooth muscle cell, capillary aerocyte, capillary endothelial cell, cd4-positive, alpha-beta t cell, cd8-positive, alpha-beta t cell, classical monocyte, club cell, dendritic cell, endothelial cell of artery, endothelial cell of lymphatic vessel, fibroblast, lung ciliated cell, lung microvascular endothelial cell, macrophage, mesothelial cell, neutrophil, nk cell, non-classical monocyte, pericyte cell, plasma cell, respiratory goblet cell, serous cell of epithelium of bronchus, type i pneumocyte, type ii pneumocyte, vascular associated smooth muscle cell, vein endothelial cell

We can see that some cell types in TS_lung cannot match to an ontology term. Consider manual re-annotate. We advise that do always check literature before manual curation and make sure you want the ontology annotation!

obj_list$TS_lung@meta.data[[anno_col]] = as.character(obj_list$TS_lung@meta.data[[anno_col]])

## nk cell can certainly be matched
obj_list$TS_lung@meta.data[which(obj_list$TS_lung@meta.data[[anno_col]] == 'nk cell'), anno_col] = 'natural killer cell'

## there are type 1 and type 2 alveolar fibroblast which both belongs to fibroblast of lung

obj_list$TS_lung@meta.data[which(obj_list$TS_lung@meta.data[[anno_col]] == 'alveolar fibroblast'), anno_col] = 'fibroblast of lung'

## capillary aerocyte is a recently discovered new lung-specific cell type that is good to keep it
## Gillich, A., Zhang, F., Farmer, C.G. et al. Capillary cell-type specialization in the alveolus. Nature 586, 785–789 (2020). https://doi.org/10.1038/s41586-020-2822-7

Now we can trim again

obj_list_minimal = scOntoMatch::ontoMultiMinimal(obj_list = obj_list, ont = ont, anno_col = anno_col, onto_id_col = onto_id_col)
## translate annotation to ontology id
## translating TM_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## NA
## translating TS_lung
## warning: some cell type annotations do not have corresponding ontology id, consider manual re-annotate
## capillary aerocyte
## mapping from name: lung endothelial cell to name: epithelial cell of lung
## mapping from name: classical monocyte to name: myeloid cell
## mapping from name: T cell to name: leukocyte
## mapping from name: B cell to name: leukocyte
## mapping from name: monocyte to name: myeloid cell
## mapping from name: natural killer cell to name: leukocyte
## after matching to base level ontology, TM_lung has cell types:
## NA, ciliated columnar cell of tracheobronchial tree, epithelial cell of lung, leukocyte, myeloid cell, stromal cell
## mapping from name: fibroblast of lung to name: fibroblast
## mapping from name: plasmacytoid dendritic cell to name: dendritic cell
## after matching to base level ontology, TS_lung has cell types:
## adventitial cell, b cell, basal cell, basophil, bronchial smooth muscle cell, capillary aerocyte, capillary endothelial cell, cd4-positive, alpha-beta t cell, cd8-positive, alpha-beta t cell, classical monocyte, club cell, dendritic cell, endothelial cell of artery, endothelial cell of lymphatic vessel, fibroblast, lung ciliated cell, lung microvascular endothelial cell, macrophage, mesothelial cell, natural killer cell, neutrophil, non-classical monocyte, pericyte cell, plasma cell, respiratory goblet cell, serous cell of epithelium of bronchus, type i pneumocyte, type ii pneumocyte, vascular associated smooth muscle cell, vein endothelial cell

Ontology tree for individual dataset

Functions are provided to plot cell type tree. Before trimming, there are parental-children relationships within both datasets.

plotOntoTree(ont = ont, 
                          onts = names(getOntologyId(obj_list$TM_lung@meta.data[['cell_ontology_class']], ont = ont)), 
                          ont_query = names(getOntologyId(obj_list$TM_lung@meta.data[['cell_ontology_class']], ont = ont)),
                          plot_ancestors = TRUE,  roots = 'CL:0000548',
                          fontsize=25)

plot of chunk plotOntoTree

plotOntoTree(ont = ont, 
                          onts = names(getOntologyId(obj_list$TS_lung@meta.data[['cell_ontology_class']], ont = ont)), 
                          ont_query = names(getOntologyId(obj_list$TS_lung@meta.data[['cell_ontology_class']], ont = ont)),
                          plot_ancestors = TRUE,  roots = 'CL:0000548',
                          fontsize=25)