The CaRpools (CRISPR AnalyzeR for Pooled Screens) package allows users to analyze raw NGS read count data from pooled CRISPR Screens in an end-to-end fashion and serves as a basis for creating customized reports for more advanced users.
These pooled screens must contain lentivirus-based libraries as they can be obtained via Addgene e.g. . It provides functions to create different quality control plots, normalize the data, compare the data and perform hit identification via three different methods.
Furthermore it can be used to completely analyze the date in a streamlined workflow with the provided analysis template.
With CaRpools, the user can analyze pooled CRISPR/Cas9 screens end-to-end in a standardized fashion allowing for reproducible data analysis.
This includes:
CaRpools is available as an R package caRpools without the scripts and template files.
The complete package with the PERL scripts and all template files can be obtained from Github (https://github.com/boutroslab/carpools) and our website CRISPR-AnalyzeR.de.
We recommend to download the template files and Scripts from Github and install caRpools in R using the package installer `install.packages(“caRpools”).
Available Quality Control plots include (for sgRNAs or summarized for genes):
carpools.read.distribution
carpools.read.depth
carpools.reads.genedesigns
carpools.read.ballmap
carpools.read.count.vs
carpools.raw.genes
Since several gene identifiers are used for generating CRISPR pooled KO libraries, e.g. EnsemblID,
loaded data-sets can automatically be annotated with further gene annotation like official gene symbol or descriptions
using the biomaRt
interface.
More details can be found below in the section of get.gene.info
or via ?get.gene.info
/ ?biomaRt
.
Furthermore, sgRNA read-count data can be aggregated (summed up) to corresponding genes aggregatetogenes
or gene data can be excluded for the analysis using gene.remove
.
Moreover, this package can be used to identify screening hits using either
stat.DESeq
stat.wilcox
stat.mageck
and hit calling / data analysis can be compared between all the methods with compare.analysis
or carpools.hit.overview
. Finally evaluated data analysis can be visualized in different ways with carpools.hitident
for all methods listed above.
Moreover, you will also be provided with in-depth information about the sgRNAs of your genes.
These can be derived from carpools.raw.genes
for any gene or carpools.hit.sgrna
for your hit candidates in an automated way.
All this data can be used to automatically generate a standardized report using the provided R Markdown template file including all plots and tables for you data analysis.
Two different templates are provides:
The R Markdown templates provide the user with the ability to generate HTML output as well as PDF output (LATEX installation necessary) including high-quality plots.
Provided PDF templates:
Provided HTML templates:
We also included a Virtual Box Image that already includes all necessary software and package files.
You just need to install Virtual Box 5 from the Website.
You can download the caRpools virtual box image from our website crispr-analyzer.de or Github (https://github.com/boutroslab/carpools).
Please see the VirtualBox tutorial for instructions.
CaRpools is available as an R package caRpools without the scripts and template files.
The complete package with the PERL scripts and all template files can be obtained from Github (https://github.com/boutroslab/carpools) and our website.
We recommend to download the template files and Scripts from Github and install caRpools in R using the package installer `install.packages(“caRpools”).
For CRISPR-Libraries of 12 K size (12K sgRNAs), caRpools will work on any laptop/PC with at least 4GB of RAM and a modern dual-core CPU.
CRISPR-Libraries with a size of more than 100 K (100 K sgRNAs) run best with at least 8 GB of RAM.
CaRpools was tested on MacOSX Yosemite and Ubuntu 14.04 LTS.
However, it should work on any operating system that fulfills the software requirements.
The following software needs to be installed:
The following R packages need be installed (can be done via load.packages()
):
Please note that for any annotation, biomaRt needs full access to the internet. In case of incorrect proxy settings, the report generation will fail with a biomaRt error.
This means that if any proxy server is used, this has to be configured before using caRpools as described in the following articles:
Install all software listed above according to the installation information stated on the software website.
All necessary R packages can be installed automatically by load.packages()
.
Since CaRpools fosters reproducibility of CRISPR/Cas9 screens, the following requirements for pooled screening data must be fullfilled to analyze data:
The usage of more than two replicates at once is not yet implemented, but will be in the near future.
The following files are required for data analysis:
Either
* NGS FASTQ file for each sample * Library reference in FASTA format
or
* Final read count file for each sample * Library reference in FASTA format
CaRpools accepts either FASTQ files or read count files for each sample. FASTQ file are then extracted and mapped using Bowtie2 against the library reference. Finally, read count files for each sample are generated.
As an alternative, these final read count files can be provided as well, so that no extraction or mapping is necessary.
In addition, a library reference file in FASTA format is necessary, usually this is the file that was used for ordering custom oligo libraries.
File structures are shown below.
General information about the FASTQ file format can be obtained in an easy-to-understand article from Wikipedia.
maschine.pattern
The machine pattern used for extracting the sequences is a regular expression to identify the read ID including your sequencing machine.
in the case of the above sample, the PERL regular expression used must be M01100.
CaRpools extracts the integrated DNA sequence of your target sequence as a DNA barcode.
In order to extract this sequence, a PERL regular expression pattern is used to identify the desired nucleotide sequence, which is called seq.pattern.
As an example, part of a U6 promoter-driven sgRNA cassette is given as follows:
Since we want to extract the target sequence, the regular expression will use a part of the U6 promoter and a part of the sgRNA backbone to identify the target sequence.
CACC (.{20}) GTTTTAGAGC
The parenthesis are necessary to extract the target sequence, for more information please see RegExR.
The library reference file must be in FASTA format and include ALL sgRNAs present in the data-set with exactly the same naming.
e.g.
CaRpools also takes read count files. If FASTQ files are provided and extracted/mapped with CaRpools, read count files will be created for each sample.
Data for each sample must be formatted in a tab-separated way as follows:
As shown in the above file, Gene1 is the Gene identifier, **_3423** is a unique part for identification of this sgRNA for the given gene and Gene1_3423 is the whole identifier which uniquely identifies this sgRNA within the data-set.
The name of each complete sgRNA must contain the gene it refers to either as gene symbol or any other identifier which shares the same separator in addition to a part of identifier that is unique for each sgRNA for that gene.
The name can therefore be anything, as long as the identifier is the same for each gene and the separator is the same for all data.
In principle a sgRNA identifier must consist of a gene identifier, followed by a seperator (e.g. _ or -) as well as a secondary sgRNA identifier.
Please note:
* Read counts MUST be numeric only. * Within the sgRNA/Gene identifer, no special characters except for _ and - are allowed!
Files can be loaded via load.file(filename, header, sep)
with the following arguments
\t
for tab-separated files.Please see ?read.table
for a detailed description of the arguments.
Please note: the MAIN FOLDER must be the R working directory!
Data and Script paths can be adjusted in the MIACCS file.
The following files are necessary to use CaRpools for report generation:
MIACCS.xls
Minimum Information About CRISPR/Cas Screens. This file needs to be filled out to provide all necessary information about the screen.
R Markdown Template files
Either CaRpools-extended-PDF.rmd, CaRpools-PDF.rmd or CaRpools-extended-HTML.rmd or CaRpools-HTML.rmd. Is the template for report generation.
Data Files
Two replicates per Control and Treated. Can be FASTQ files OR already mapped, not normalized read count files.
CRISPR-mapping.pl
PERL script to map your extracted FASTQ files, if desired (as indicated in the MIACCS.xls)
CRISPR-extract.pl
PERL script to extract 20 nt target sequence from FAST files, if desired (as indicated in the MIACCS.xls)
CaRpools.png
The logo file
The following files are necessary to use single CaRpools functions:
Data Files
Either raw read count files or FASTQ files (that need to be extracted and mapped using CaRpools)
Please note that CaRpools always starts with loading data files. For raw-read-count files, use load.file
. For FASTQ files, please see the sections below.
CaRpools folder structure for Report Generation using raw Read Count files:
CaRpools folder structure for Report Generation using raw Read Count files AFTER REPORT GENERATION:
CaRpools folder structrue for Report Generation using FASTQ files:
CaRpools folder structure for Report Generation using FASTQ files AFTER REPORT GENERATION: