In a standard RNA-seq experiment, differential expression analysis is incredibly common. The idea is to compare the expression of thousands of transcripts between two conditions, which I’ll refer to as the reference and experimental samples. Once you have compiled a list of genes whose expressions differ significantly between the two conditions, a reasonable follow-up question is “what caused the change in each of the genes”? The three broad answers to that question are 1) Something altered the RNA synthesis rates; 2) Something altered the RNA degradation rates; or 3) A little bit of both 1) and 2). Unfortunately, standard RNA-seq analysis cannot answer this mechanistic question, as RNA-seq only tells you about changes in expression level and cannot deduce the kinetic causes of those changes.
If the kinetics is of interest to you, then a recently developed solution to this problem is nucleotide recoding RNA-seq, which exists in several fundamentally similar flavors, the primary ones being TimeLapse-seq, SLAM-seq, and TUC-seq. The strategy involves feeding cells with a metabolic label (typically 4-thiouridine, a.k.a. 4sU), extracting the RNA as you normally would for RNA-seq, and then using a unique chemistry to convert the metabolic label into something that a sequencer will read as a different base (e.g., a cytosine analog for 4sU). Thus, labeled and unlabeled reads can be separated bioinformatically upon sequencing by the presence of mutations (e.g., U-to-C mutations).
One problem though; the metabolic label must 1st get into cells, and then it must compete with the normal nucleotide for incorporation into RNA. The result of this is that mutation rates are typically low (often around 5% in RNA produced during metabolic labeling, meaning that each U in a sequencing read from these RNAs has about a 5% chance of being mutated to a C), making simple and definitive separation of labeled and unlabeled reads impossible. The solution: advanced statistical modeling implemented by bakR, an R package for analyzing nucleotide recoding sequencing data. The analyses implemented in this package will enable you to perform differential kinetic analysis just as easily as you can perform differential expression analysis, allowing you to elucidate the mechanisms of regulated gene expression.
Of course you will want to load the bakR package for this vignette; instructions to do so can be found at this link. Also, you’ll want to set the seed so as to ensure the results you get reproduce those presented in the vignette.
The 1st step to using bakR is to create a bakRData object. A bakRData object consists of two components: a cB dataframe and a metadf dataframe. cB stands for counts binomial and contains all of the information about mutations seen in sequencing reads in each sample sequenced. metadf stands for metadata data frame and contains important information about the experimental details of each sample (i.e., how long the metabolic label feed was, which samples are reference samples, and which are experimental samples).
Lets get an idea for what a cB looks like by taking a peak at an example:
# Load small example cB data("cB_small") # Observe contents of cB head(cB_small)
One key aspect of the cB dataframe to keep in mind is that each row corresponds to a set of reads providing identical data. To help that make more sense, and to understand what information the cB tracks, let’s go through the contents of each column:
The structure of the cB file most closely reflects the output of a pipeline for analyzing nucleotide recoding RNA-seq data developed by the Simon lab (available here), but consists of data provided by another published pipeline, SLAMDUNK (see the alleyoop dump function in SLAMDUNK).
Now let’s look at the corresponding metadf dataframe:
# Load metadf data frame; will be loaded as metadf in global environment data("metadf") # Print the transpose of metadf # Rows will be the columns and columns will be the rows print(t(metadf))
The transpose of metadf is displayed above to save space (so the columns depicted are the rows of metadf, and vice versa). Metadf has significantly less information than the cB data frame, but it’s crucial information. One extremely important feature of metadf is its row names (column names as depicted above). Examine the row names and compare them to the sample names
# metadf row names print(rownames(metadf)) #>  "WT_ctl" "WT_2" "WT_1" "KO_ctl" "KO_2" "KO_1" # cB sample names print(unique(cB_small$sample)) #>  "WT_ctl" "WT_2" "WT_1" "KO_ctl" "KO_2" "KO_1"
They’re the same, and they HAVE TO BE!! Each row of metadf corresponds to a sample, and the order in which the sample names appear as the row names of metadf MUST correspond to the order they appear in cB sample column.
Keeping in mind that each row represents a sample, the content of the columns are:
Once you have these two data frames correctly constructed, you can create a bakRData object in one line of code.
# Create bakRData object <- bakRData(cB_small, metadf)bakRData
bakR implements several blends of fundamentally similar statistical models to perform differential kinetic analysis. No matter your intentions though, you must first fit data with the most efficient model available. This model (implemented in the fast_analysis() function, so see ?fast_analysis for more details) estimates mutation rates and then finds the maximum likelihood estimate (MLE) for the fraction of sequencing reads from new RNA (new meaning RNA synthesized after the start of metabolic labeling) for each feature in each 4sU fed sample. The reason you have to run this model is that some part of its output will be used by every other model you can run with bakR. More on that later though; first, let’s create a simulated dataset and fit it with the efficient model. Using a simulated dataset will allow us to compare the results to the ground truth and validate its accuracy.
# Simulate a nucleotide recoding dataset <- Simulate_bakRData(500) sim_data # This will simulate 500 features, 2 experimental conditions # and 3 replicates for each experimental condition # See ?Simulate_bakRData for details regarding tunable parameters # Extract simulated bakRData object <- sim_data$bakRData bakRData # Extract simualted ground truths <- sim_data$sim_list sim_truth # Run the efficient model <- bakRFit(bakRData) Fit #> Finding reliable Features #> Filtering out unwanted or unreliable features #> Processing data... #> Estimating labeled mutation rate #> Estimated pnews for each sample are: #> # A tibble: 6 x 3 #> # Groups: mut  #> mut reps pnew #> <int> <int> <dbl> #> 1 1 1 0.0501 #> 2 1 2 0.0499 #> 3 1 3 0.0500 #> 4 2 1 0.0501 #> 5 2 2 0.0502 #> 6 2 3 0.0500 #> Estimating unlabeled mutation rate #> Estimated pold is: 0.000999785857498937 #> Estimating fraction labeled #> Estimating per replicate uncertainties #> Estimating read count-variance relationship #> Averaging replicate data and regularizing estimates #> Assessing statistical significance
bakRFit() is used here as a wrapper for two functions in bakR: cBprocess() and fast_analysis(). cBprocess() will extract and properly format the information needed for all statistical models from the bakRData object and fast_analysis() will perform the efficient analysis. This should take no more than a minute to run. Messages are printed along the way to keep you updated as to the progress of the analysis.
One of the most important messages is that regarding the estimated pnews and pold. The pnews are the estimated mutation rates of reads from new RNAs (new meaning RNAs synthesized after the start of 4sU labeling) in each sample (muts = Exp_ID, and reps = a numerical replicate ID that corresponds to the order replicates appear in the cB), and pold is a global estimate of the background mutation rate used in all analyses. The simulated mutation rates are 0.05 for new reads and 0.001 for old reads in all samples this simulation, so the estimates are pretty darn close. You can also input mutation rate estimates from elsewhere (say a run of the full Stan model implemented in bakR) to bakRFit():
# Run efficient model with known mutation rates # Pass the Fit object rather than the bakRData object and set FastRerun to TRUE <- bakRFit(Fit, Fit FastRerun = TRUE, pnew = rep(0.05, times = 6), pold = 0.001) #> Estimating fraction labeled #> Estimating per replicate uncertainties #> Estimating read count-variance relationship #> Averaging replicate data and regularizing estimates #> Assessing statistical significance
Since we’ve already run the fast_analysis once, we can speed up the analysis by passing the Fit object from the first run rather than the bakRData object. This allows bakR to skip the time consuming pre-processing steps. Just make sure to set FastRerun to TRUE.
If you run into problems with the default mutation rate strategy (e.g., the mutation rate estimates don’t seem to make sense) another slightly less efficient solution is to use MCMC implemented in Stan to estimate mutation rates. The idea is to fit a binomial mixture model to only a small subset of features with sufficient sequencing depth so that the model has enough data to work with. This strategy used to be my go-to strategy, but after I made improvements to the default mutation rate strategy I found myself using this approach less and less. Still, here is an example of its use for posterity’s sake:
# Set StanRateEst to TRUE to use Stan to estimate rates # low_reads and high_reads defines the read count cutoffs used to select features # default = between 1000 and 5000 reads # RateEst_size determines the number of features to use (default = 7) <- bakRFit(Fit, Fit FastRerun = TRUE, StanRateEst = TRUE, low_reads = 500, high_reads = 5500, RateEst_size = 5)
No matter how you decide to run the fast model, you might wonder how well it worked. As mentioned earlier, one benefit of using simulated data is that you can easily answer that question by comparing the simulated parameters (e.g., the fraction new) to the model’s estimates:
# Features that made it past filtering <- unique(Fit$Fast_Fit$Effects_df$XF) XFs # Simulated logit(fraction news) from features making it past filtering <- sim_truth$Fn_rep_sim$Logit_fn[sim_truth$Fn_rep_sim$Feature_ID %in% XFs] true_fn # Estimated logit(fraction news) <- Fit$Fast_Fit$Fn_Estimates$logit_fn est_fn # Compare estimate to truth plot(true_fn, est_fn, xlab = "True logit(fn)", ylab = "Estimated logit(fn)") abline(0, 1, col = "red")
Looks pretty good to me!
To run the heavier, more trustworthy models, just rerun bakRFit() on the Fit object, but with either the StanFit or HybridFit parameters set to true.
# Load options that will make running models more efficient ::rstan_options(auto_write = TRUE) rstanoptions(mc.cores = parallel::detectCores()) # Run Hybrid model (This might take several minutes to run) <- bakRFit(Fit, HybridFit = TRUE) Fit # Run Full model (This might take ~10-30 minutes to run) <- bakRFit(Fit, StanFit = TRUE)Fit
You might rightfully ask what it is that makes these models “more trustworthy”? Didn’t we just show that the logit(fn) estimates for the fast model were pretty good? Why bother with these computationally expensive models?? The answer is that a lot of the magic happens at the level of uncertainty quantification. While all models are going to agree pretty well on estimates for kinetic parameters, they won’t necessarily agree on how uncertain those estimates are. The fastest model takes a lot of shortcuts and uses a lot of approximations that make it fast but also make its uncertainty quantification less rigorous. I like to say that the fast model should be used for initial investigations of your data, a rapid means by which to satisfy your desire to dive into whatever exciting biology you might be exploring. When you have a sense of how promising your data is and what it is you want to analyze, switch on over to the heavier models. These will give you more peace of mind and allow you to comfortably dig a little deeper into your datasets as you prepare for stunning presentations and groundbreaking papers. The full Stan model (StanFit = TRUE) should be your go-to for this, but if you are working with an unusually large dataset, look to the hybrid model. A general rule of thumb for what constitutes a large dataset in this case is if (number of experimental conditions) x (number of replicates of each condition) is greater than 15. For example, 3 replicates of 6 different conditions (3 x 6 = 18) is going to take a while for the full model to process, likely on the order of weeks.
The Fit objects contain lists pertaining to the fits of each of the models. The possible contents include:
bakR provides a variety of easy to use functions for beginning to investigate your data. The visualizations are particularly aimed at revealing trends in RNA stabilization or destabilization. Analyzing changes in RNA synthesis rate will require pairing the output of bakR with that of a differential expression analysis tool, like DESeq2 or limma. More on that later.
One visualization powered by bakR is a L2FC(kdeg) MA plot. Each point in these plots will represent a feature-condition combination. The x-axis is the average number of sequencing reads mapping to that feature, averaging over all replicates in both the experimental and reference condition. The y-axis is the difference in stability between the reference and experimental condition, quantified as a log2-fold change (L2FC) in the degradation rate constant (kdeg). Positive values of L2FC(kdeg) represent features that are less stable (have a higher kdeg) in the experimental condition relative to the reference condition. Negative values thus represent features that are more stable in the experimental condition. The plotMA function in bakR allows you to make this plot with two inputs, the bakRFit object and the model fit to use (MLE, Hybrid, or MCMC):
## MA Plot with Fast Fit ::plotMA(Fit, Model = "MLE")bakR
If you have more than one experimental condition, you can choose to plot all comparisons or just a subset of those in which you are interested. See the Exps and Exp_shape parameters in the ?plotMA help file for details.
Another common plot to make with the bakR fits is a volcano plot. The x-axis in a volcano plot is the same as the y-axis of an MA plot (so L2FC in a kinetic parameter), and the y-axis is a measure of statistical significance (e.g., FDR adjusted p-value), usually on a -log10 scale. The plotVolcano function is here to help you with this one too, containing many of the same parameters as plotMA:
## Volcano Plot with Fast Fit; significance assessed relative to an FDR control of 0.05 plotVolcano(Fit$Fast_Fit)
While volcano and MA plots are great for visualizing the results of bakR’s model fits, another important class of plots are those to detect anomalies. Sometimes, high-throughput sequencing datasets are plagued by batch effects, biases that effect one or a few of your samples. These can lead to inflated false discovery rates and/or reduced power and are thus important to look out for. One simple way to see if there are large-scale biases present in any of your samples is to perform principle component analysis (PCA), a dimension reduction algorithm that will project your high-dimensional sequencing data onto a few dimensions. bakR specifically implements a novel fraction new PCA, where the high-dimensional object to be compressed is a matrix of fraction new estimates, with each column of the matrix corresponding to a sample and each row a feature (e.g., a transcript). If replicates don’t cluster together in the 2D PCA plot produced by bakR’s FnPCA, that could indicate the presence of batch effects:
## 2D PCA plot with replicate fraction news FnPCA(Fit$Fast_Fit)
In this case, replicates separate from each other fairly well, which is promising and the difference between the red point in the lower left and those in the upper left is not significant enough to cause concern. I have found that clustering is improved with the full MCMC fit, so if you’ve taken the time to run that model, I suggest using it for any PCA plots.
I hope that this vignette has provided a useful introduction to using bakR with your nucleotide recoding RNA-seq datasets. Inevitably, not all of your questions will be answered in the Getting-Started vignette. Some of your remaining quandaries may be challenging problems specific to your exciting research, the unavoidable roadblocks of doing science that you will eventually conquer, but with little help from any documentation I could dream of writing. That being said, I can still imagine many important implementation questions not addressed here, like:
If the first two questions are of interest to you, head over to the “Going beyond the defaults” vignette!
If the second question is of interest, then I would suggest heading over to some of the amazing documentation written by the Stan team. In short though, warnings from Stan are a sign that the model did not converge. The cause of this is likely anomalies in your data. Some challenges that I have come across that could lead to such warnings include:
Sometimes, small warnings from Stan (like a few max treedepth warnings here and there) are no cause for concern. How do you know whether or not there is a serious problem though? I would suggest comparing the output of the Hybrid/MCMC implementation to that of the MLE implementation. In particular, check the correlation between the L2FC(kdeg) estimates. If the correlation is good, then there probably is nothing to worry about. If the estimates for a handful of transcripts differ wildly between the implementations though, that is strong evidence that the model didn’t converge and that one of the anomalies mentioned above is afflicting your data.