# Biopeak R Package Implementation

## 1 Overview

The Biopeak R package enables the user to systematically identify and visualize impulse-like gene expression changes within short genomic series experiments. On a systems level, this tool allows to characterize dynamic gene expression signatures underlying biological processes. In order to detect such activation peaks, the gene expression is treated as a signal that propagates along the experimental axis (time, temperature or other series conditions).

## 2 Peak Detection with the Biopeak package

To demonstrate the functionality of this package, we are using a dataset comprising transcriptional profiles of human epithelial cells in response to heat available on GEO (GSE7458) (J. M. Laramie 2008). In vitro cultured HEp2 cells were heated at 37 to 43 °C for 60 min and microarray gene expression profiles were acquired at 37, 40, 41, 42 and 43 °C.

In a first step, we load the heat-shock data:

library(Biopeak)
# load the heat-shock data
exprmat <- heat
# Show first rows and columns of the expression matrix
head(exprmat)
##            Hep2_37    Hep2_40    Hep2_41   Hep2_42    Hep2_43
## DAZL      25.64188   26.44370   48.65872  47.51223   17.54344
## RAD23A   776.89917  609.15550  602.97183 823.58900  502.01850
## RAP1B   1033.05450 1232.86550 1028.10150 900.49950 1074.90267
## ARHGAP6   35.05837   53.72763   26.60629  31.21045   29.21352
## GYPB     157.53467  167.29190  159.62667 157.61000  196.08050
## UBE2N    464.30233  517.65315  549.37597 682.46653  521.46451
# Transform data frame to numeric matrix
exprmat <- as.matrix(exprmat)

One of the strengths of the peak detection algorithm lays in its great flexibility to process data source-independently. While the main focus of this package is aimed at the characterization of genomic data, in theory, any type of series experiment that can be described as a signal could be analyzed. Since peaks are assessed for each gene relative to its expression over time or over the course of different stimuli strenghts, prior normalization steps are usualy not required. Therefore, we can directly subject the Rna-Seq or microarray count expression matrix to the peak detection algorithm: the peakDetection function.

The peakDetection function takes as input the expression matrix and the series description, a numeric vector which defines the conditions/time-points of sample acquisition. In order to identify biomarkers which are specific to a distinct phase of the observed biological process, we provide a set of tunable filters that define the characterstics of a peak. Peak characteristics that have to be defined are:

• activation strength (actstrength): Threshold for minimal activation relative to the mean expression across all conditions.
• prominence: Threshold for minimal peak prominence relative to the second highest peak. In simple terms: how much higher does the mainpeak have to be relative to the second highest peak. If set to 1, multiple peaks per gene are allowed.
• type: Definition of study type with possible values ‘microarray’ or ‘rnaseq’.

Furthermore, the function supports optional parameters for more complex operations:

• minimum expression (minexpr): Threshold for minimal mean expression across all condition for a given gene.
• background correction (bgcorr): Background noise correction. Genes with an expression lower than the 5 % quantile of the entire expression matrix are discarded.
• peakwidth: Minimal number of conditions/time-points that a peak spans (based on sustact threshold).
• sustained activation (sustact): Threshold for minimal peakheight relative to the main peak to be considered as sustained activation.

In this case study, we are primarily interested in finding genes that peak at a single condition, in order to identify gene activation thresholds under heat-shock conditions. Thus, we set the prominence parameter to 1.3, to select for genes that feature a peak that is at least 30 % higher in magnitude than any other peak. Moreover, we want to filter out peaks with low heights relative to the mean expression. Therefore the actstrength parameter is set to 1.5, selecting for peaks that feature at least 50 % higher expression than the mean expression of that gene. Lastly, we selected only the top 50 % expressed genes, to reduce the computational load.

# define condition series
series <- c(37,40,41,42,43)
# Find the expression limit for the 50 % highest expressing genes and selec them
exprlim <- quantile(exprmat, probs <- seq(0,1,0.01))[51]
exprmat <- exprmat[which(rowMeans(exprmat) > exprlim),]
# run the peak detection algorithm
peakdet <- peakDetection(exprmat, series, type ='rnaseq', actstrength = 1.5, prominence = 1.3,
bgcorr = F)

The peakDetection function returns a set of vectors and matrices:

• peakgenes: A character vector of gene symbols for which a peak has been identified
• peakloc: A numeric vector containing the location of each peak.
• peakheight: A numeric vector containing the absolute height of each peak.
• neighbors: A numeric matrix containing time-points with sustained activation. Conditions/time-points with sustained activation are ranked by their distance to the main peak which always has value 1. Thus, a time-point with a value of 3 is 3-1 = 2 conditions/time-points away from the main peak. Conditions with value 0 show no sustained activation. If the peakwidth parameter is set greater than 0, the output of the peakDetection function can be used to filter for genes with sustained activation within the main peak neighborhood.

## 3 Discovering the temporal regulation of activated biomarkers

The output of the peakDetection function provides a format that facilitates queries on a gene-level and systems-level.

# number of genes detected by the peakDetection function
length(peakdet$peakgenes) ## [1] 205 # filter for genes which peak at 43 °C peakdet$peakgenes[which(peakdet$peakloc == 43)] ## [1] "CNNM2" "SFI1" ## [3] "PNN" "KRT7" ## [5] "BASP1" "RAF1" ## [7] "GBA /// GBAP1 /// LOC100510710" "GADD45B" ## [9] "MKL1" "CDK13" ## [11] "NRG1" "XKRY /// XKRY2" ## [13] "H3F3A /// H3F3B /// MIR4738" "RCAN2" ## [15] "LOC100507577 /// LONP2" "SLC7A5" ## [17] "RPL23 /// RPL23P8 /// SNORA21" "AKR7A2" ## [19] "ZNF787" "NUP188" ## [21] "ACTA2" "APOF" ## [23] "HOXD13" "BSN" ## [25] "PPAP2B" "CAPN1" ## [27] "CNKSR2" "USP6NL /// USP6NL-IT1" ## [29] "FLRT2 /// LOC100506718" "TNFRSF8" ## [31] "MAGEC1" "ARSF" ## [33] "BRS3" "C2CD2L" ## [35] "KRT18" "DLEC1" ## [37] "SYT17" "ANXA11" ## [39] "KIR3DL1 /// KIR3DL2" "ARHGAP29" ## [41] "MXRA5" "IFI44L" ## [43] "HAAO" "CKMT2" ## [45] "DBI" "CUL3" ## [47] "DOPEY2" "CYTH2" ## [49] "TUBB2B" "TNFRSF14" ## [51] "C5orf45" "CAPRIN1" ## [53] "ITGA5" "GBF1" ## [55] "HEXIM1" "P2RX5" ## [57] "ID2" "CHD9" ## [59] "MT1F /// MT1G /// MT1M /// MT1P3" "GATA6" ## [61] "MSX2P1" "Fas" # return the peakheight and location of the gene CDK13 c(peakdet$peakheight[which(peakdet$peakgenes == 'CDK13')], peakdet$peakloc[which(peakdet$peakgenes == 'CDK13')]) ## [1] 4114.631 43.000 # assess the main peak neighborhood for sustained activation peakdet$neighbors[which(peakdet$peakgenes == 'CDK13'),] ## [1] 0 4 0 0 1 The plotExpression function allows to plot the expression signal of individual genes: ### 3.1 Plot the expression signal of individual genes plotExpression(exprmat,'CDK13',series,peakdet) ### 3.2 Gene co-expression analysis Gene co-expression analyses are a widely used tool in genomic studies. In this package we provide the getCormat function which calculates a gene-wise correlation matrix and plots a bi-clustered heatmap. The function returns both the heatmap object and the re-ordered correlation matrix: dev.off() ## null device ## 1 corobjects <- getCormat(peakdet, exprmat, method = 'spearman') # extract heatmap object corheatmap <- corobjects$hm
# extract re-ordered correlation matrix

### 3.5 Save the peak detection output to a text file

The function saveOutput saves the output of the peakDetetection funtion (peakgenes, peaklocation and peakheight) to a text file.

saveOutput(peakdet,file.path(tempdir(),'heat_out.txt'))

## 4 References

J. M. Laramie, B. Brownstein, T. P. Chung. 2008. Transcriptional Profiles of Human Epithelial Cells in Response to Heat. Shock.