CB2 Tutorial

Hyun-Hwan Jeong

2019-02-19

CRISPRBetaBinomial (CB2) is a package for designing a statistical hypothesis test for robust target identification, developing an accurate mapping algorithm to quantify sgRNA abundances, and minimizing the parameters necessary for CRISPR pooled screen data analysis. This document shows how to use CB2 for the CRISPR pooled screen data analysis.

First, import CB2 package using library(), and it will be helpful to import other packages as follows:

library(CB2)
library(magrittr)
library(glue)
library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:glue':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

There are three different basic functions in CB2. The first function provides a quantification of counts of sgRNAs from the NGS samples. It requires a library file (.fasta or .fa) and a list of samples (.fastq). The library file must contain an annotation of sgRNAs in the library used in the screen. A sgRNA annotation consists of a barcode sequence (20nt sequence where sgRNA would target) and a name of a gene which the sgRNA suppose to target.

Here is an example of the loading data for the screen analysis. Files in the example are contained in the CB2 package.

# load the file path of the annotation file.
FASTA <- system.file("extdata", "toydata",
                     "small_sample.fasta",
                     package = "CB2")
system("tail -6 {FASTA}" %>% glue)

The first two lines of the annotation file indicate the annotation of the first sgRNA, and the next two lines are the annotation of the second sgRNA, and so on. The first line of an annotation is formatted as ><genename>_<id>, where <genename> is an id of a symbol of the target gene for the sgRNA and <id> is the unique identifier for the sgRNA. ><genename>_<id> is the completed identifier of the sgRNA and a completed identifier should not appear more than once. The second line of an annotation is the 20nt sequence, and it indicates which part of the target gene will be targeted by the sgRNA.

The first annotation indicates the library contains a sgRNA and RAB_3 is the identifier of the sgRNA. This sgRNA is supposed to target RAB gene and the intended target locus is CTGTAGAAGCTACATCGGCT

We also have an example of the NGS sample file. The following snippet will display the contents of an NGS sample file.

FASTQ <- system.file("extdata", "toydata",
                     "Base1.fastq",
                     package = "CB2")
system("head -8 {FASTQ}" %>% glue)

The NGS sample file contains multiple reads, and each read consists of four sequential lines. The first line is the id of the reads, and the second line includes a sequence of the read, and we assume that a read contains a nucleotide sequence of the sgRNA as a substring of the read. The third line only includes ‘+’ and the last line includes the quality of each nucleotide of the read (Phread quality score).

Let’s get start the analysis. Before running the analysis we will see the list of FASTQ files we can use from the toydata example.

ex_path <- system.file("extdata", "toydata", package = "CB2")
Sys.glob("{ex_path}/*.fastq" %>% glue) %>% basename()
## [1] "Base1.fastq" "Base2.fastq" "High1.fastq" "High2.fastq" "Low1.fastq" 
## [6] "Low2.fastq"

From the example directly above, we can recognize there are three groups (Base, High, and Low) in the example data, and each of them has two replicates. We will perform an analysis between Base and High. The first thing we need to do is creating a design matrix. The below code shows how to build it.

df_design <- tribble(
  ~group, ~sample_name,
  "Base", "Base1",  
  "Base", "Base2", 
  "High", "High1",
  "High", "High2"
) %>% mutate(
  fastq_path = glue("{ex_path}/{sample_name}.fastq")
)

df_design
## # A tibble: 4 x 3
##   group sample_name fastq_path                                             
##   <chr> <chr>       <S3: glue>                                             
## 1 Base  Base1       /private/var/folders/v4/006474k92950c1m5lvg4nxn80000gp…
## 2 Base  Base2       /private/var/folders/v4/006474k92950c1m5lvg4nxn80000gp…
## 3 High  High1       /private/var/folders/v4/006474k92950c1m5lvg4nxn80000gp…
## 4 High  High2       /private/var/folders/v4/006474k92950c1m5lvg4nxn80000gp…

df_design contains three columns and each row contains information of a sample. The first column is group where the sample belongs to, and the sample_name is the name of the sample for a convenience. fastq_path is the place where you will have the NGS sample file.

After creating the df_design, and we can run a sgRNA quantification by calling run_sgrna_quant.

cb2_count <- run_sgrna_quant(FASTA, df_design)
head(cb2_count$count)
##           Base1 Base2 High1 High2
## RAB_4        24     5    12     3
## NEGCTRL_3     4    27     0     0
## TRIM28_3     24    44     2     4
## POSCTRL_3    48     9     0     0
## TRIM28_2     14    25     1     1
## RAB_2        35     9    35     9

After running run_sgrna_quant, we will have a data frame (cb2_count$count) and a numeric vector (cb2_count$total). The data frame contains sgRNA counts for each sample, and the numeric vector contains the number of reads for each sample. In the data frame, each row corresponds to a sgRNA and each column belongs to a sample. Each value in the data frame indicates read counts of the corresponded sgRNA and sample, and it implies how many reads have been aligned to the sgRNA from the sample file. We assume the number will be used to approximate the number of knock-out cells of the target gene of the sgRNA.

head(cb2_count$total)
## Base1 Base2 High1 High2 
##   688   608   703   659

We are also able to lookup CPM (Count Per Million mapped read counts) using get_CPM.

get_CPM(cb2_count$count)
##               Base1     Base2      High1      High2
## RAB_4     34883.721  8223.684  17069.701   4552.352
## NEGCTRL_3  5813.953 44407.895      0.000      0.000
## TRIM28_3  34883.721 72368.421   2844.950   6069.803
## POSCTRL_3 69767.442 14802.632      0.000      0.000
## TRIM28_2  20348.837 41118.421   1422.475   1517.451
## RAB_2     50872.093 14802.632  49786.629  13657.056
## TRIM28_4  20348.837 67434.211   1422.475   3034.901
## PARK2_4   34883.721 52631.579 149359.886 212443.096
## PARK2_2   56686.047 42763.158 120910.384  84977.238
## NEGCTRL_4  8720.930 18092.105      0.000   1517.451
## POSCTRL_5 58139.535 46052.632      0.000      0.000
## NEGCTRL_5 40697.674 23026.316      0.000      0.000
## NEGCTRL_2 34883.721  9868.421   1422.475      0.000
## RAB_1     68313.953 23026.316 133712.660  42488.619
## POSCTRL_1 66860.465 77302.632  12802.276  13657.056
## NEGCTRL_1 63953.488 19736.842      0.000      0.000
## RAB_5     21802.326 34539.474  42674.253  63732.929
## RAB_3     10174.419 34539.474  29871.977  95599.393
## POSCTRL_4 52325.581 27960.526   2844.950   1517.451
## TRIM28_5  61046.512 31250.000   1422.475      0.000
## TRIM28_1  33430.233 77302.632      0.000      0.000
## PARK2_1   36337.209  6578.947 172119.488  28831.563
## PARK2_3   26162.791 70723.684  82503.556 210925.645
## PARK2_5   58139.535 74013.158 174964.438 209408.194
## POSCTRL_2 30523.256 67434.211   2844.950   6069.803

There are four functions we can use to check the quality of the input data. The first function (plot_count_distribution) will give the mappability (the success rate of sgRNA identification from reads) for each sample.

plot_count_distribution(cb2_count$count %>% get_CPM, df_design)
## Warning: Removed 15 rows containing non-finite values (stat_density).

We can also check the mappability (The proportion of the number of reads successfully aligned to a sgRNA in the library among the entire reads) using calc_mappability function.

calc_mappability(cb2_count, df_design)
## # A tibble: 4 x 3
##   group sample_name mappability
##   <chr> <chr>             <dbl>
## 1 Base  Base1               100
## 2 Base  Base2               100
## 3 High  High1               100
## 4 High  High2               100

plot_PCA can be a way of checking data quality.

plot_PCA(cb2_count$count %>% get_CPM, df_design)

The last function (plot_corr_heatmap) display a sgRNA-level correlation heatmap of NGS samples. We assume that samples in the same group clustered together if the data quality is good.

plot_corr_heatmap(cb2_count$count %>% get_CPM, df_design)

After we find the data quality is good to move to the next step, then we can perform an analysis for a sgRNA-level using run_estimation.

sgrna_stat <- run_estimation(cb2_count$count, df_design, "High", "Base")
sgrna_stat
## # A tibble: 25 x 19
##    sgRNA gene    n_a   n_b phat_a[,1] vhat_a[,1] phat_b[,1] vhat_b[,1]
##    <chr> <chr> <int> <int>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 NEGC… NEGC…     2     2   0           0.          0.0419  0.000496 
##  2 NEGC… NEGC…     2     2   0.000734    5.39e-7     0.0225  0.000160 
##  3 NEGC… NEGC…     2     2   0           0.          0.0251  0.000369 
##  4 NEGC… NEGC…     2     2   0.000734    5.39e-7     0.0132  0.0000175
##  5 NEGC… NEGC…     2     2   0           0.          0.0320  0.0000876
##  6 PARK… PARK2     2     2   0.101       5.14e-3     0.0215  0.000224 
##  7 PARK… PARK2     2     2   0.103       3.46e-4     0.0499  0.0000707
##  8 PARK… PARK2     2     2   0.147       4.11e-3     0.0483  0.000487 
##  9 PARK… PARK2     2     2   0.181       9.53e-4     0.0432  0.0000319
## 10 PARK… PARK2     2     2   0.192       1.14e-4     0.0656  0.0000473
## # … with 15 more rows, and 11 more variables: cpm_a <dbl>, cpm_b <dbl>,
## #   logFC <dbl>, t_value[,1] <dbl>, df[,1] <dbl>, p_ts[,1] <dbl>,
## #   p_pa[,1] <dbl>, p_pb[,1] <dbl>, fdr_ts <dbl>, fdr_pa <dbl>,
## #   fdr_pb <dbl>

As you can see above, we need four different parameters for the function. The first is a matrix of the read count, and the second parameter is the design data frame. The last two are the groups we are interested in performing differential abundance test for each sgRNA.

Here is the information of each column in the data.frame of the sgRNA-level statistics:

Once we finish the sgRNA-level test, we can perform a gene-level test using measure_gene_stats.

gene_stats <- measure_gene_stats(sgrna_stat)
gene_stats
## # A tibble: 5 x 11
##   gene  n_sgrna  cpm_a  cpm_b   logFC   p_ts    p_pa    p_pb fdr_ts fdr_pa
##   <chr>   <int>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
## 1 NEGC…       5 2.94e2 26920.  10.8   0.227  0.999   0.0304   0.283 1.000 
## 2 PARK2       5 1.45e5 45892.  -1.69  0.0352 0.00329 1.000    0.102 0.0164
## 3 POSC…       5 3.97e3 51117.   8.23  0.0410 1.000   0.00391  0.102 1.000 
## 4 RAB         5 4.93e4 30118.  -0.462 0.678  0.256   0.874    0.678 0.639 
## 5 TRIM…       5 1.77e3 45953.   6.81  0.155  0.999   0.0187   0.258 1.000 
## # … with 1 more variable: fdr_pb <dbl>

Here is the information of each column in the data.frame of the gene-level statistics:

After we have a result of the gene-level test, we can filter out a list of genes using different measures. For example, if you are considering to find genes has a differential abundance between two groups, you can use the value fdr_ts for the hit selection. If you want to see some genes has enrichment of abundance in the first group (i.e., depiction in the opposite group), you lookup fdr_pa value, and fdr_pb can be used to see an enrichment of abundance in the second group. Here, we use fdr_ts to identify the hit genes.

gene_stats %>% 
  filter(fdr_ts < 0.1)
## # A tibble: 0 x 11
## # … with 11 variables: gene <chr>, n_sgrna <int>, cpm_a <dbl>,
## #   cpm_b <dbl>, logFC <dbl>, p_ts <dbl>, p_pa <dbl>, p_pb <dbl>,
## #   fdr_ts <dbl>, fdr_pa <dbl>, fdr_pb <dbl>

CB2 also supports a useful dot plot function to lookup the read counts for a gene, and this function can be used to clarify an interesting hit is valid.

plot_dotplot(cb2_count$count, df_design, "PARK2")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.