`pencal`

?`pencal`

is an `R`

package that was developed
to make it easy and efficient to estimate and apply **Penalized
Regression Calibration**, a statistical method proposed in Signorelli et
al. (2021).

Longitudinal and high-dimensional measurements have become increasingly common in biomedical research. However, methods to predict survival outcomes using a large number of longitudinal covariates as predictors are currently missing.

Penalized Regression Calibration (PRC) is a statistical method that
has been developed to fill this methodological gap, making it possible
to **predict survival using as predictors a large (potentially
high-dimensional) number of longitudinal covariates**.

PRC is described in detail in the following scientific article:

Signorelli, M., Spitali, P., Al-Khalili Sgyziarto, C., The Mark-MD Consortium, Tsonaka, R. (2021). Penalized regression calibration: A method for the prediction of survival outcomes using complex longitudinal and high-dimensional data. Statistics in Medicine, 40 (27), 6178-6196. DOI: 10.1002/sim.9178.

In short, PRC comprises three modelling steps:

in the first step, we

**model the trajectories described by the longitudinal covariates using mixed effects models**;in the second step, we

**compute subject-specific summaries of the longitudinal trajectories**. In practice, these summaries are the predicted random effects from the mixed models estimated in step 1, and they allow us to summarize the way in which the trajectories evolve over time across subjects;in the third step, we

**estimate a penalized Cox model**where the summaries computed in step 3 (alongside with any relevant time-independent covariate, such as baseline age, gender, …) are employed as predictors of the survival outcome.

The final output of PRC is a penalized Cox model that allows to compute predicted survival probabilities for each individual.

Additionally, one may want to quantify the predictive performance of
the fitted model. To achieve this aim, in `pencal`

we have
implemented a **Cluster Bootstrap Optimism Correction
Procedure** (CBOCP) that can be used to obtain optimism-corrected
estimates of the C index and time-dependent AUC associated to the fitted
model. Depending on the dimensionality of your dataset, computing the
CBOCP might be time consuming; for this reason, we offer the possibility
to parallelize the CBOCP using multiple cores.

Below you can see a graphical representation of the steps involved in the estimation of PRC (see the elements in the lightblue box) and in the computation of the CBOCP (elements in the salmon box).

Even though `pencal`

is available from `CRAN`

,
it includes a `Bioconductor`

packages among its dependencies
(`survcomp`

). This can create problems in the installation
phase, since the usual `install.packages( )`

only fetches
`CRAN`

dependencies, and not `Bioconductor`

ones.
Below I explain two alternative ways to successfully install
`pencal`

.

The `R`

package `pencal`

and all of its
dependencies can be installed all in one go using:

```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install('pencal') BiocManager
```

Alternatively, you may choose to first install `survcomp`

using `BiocManager::install( )`

, and then to install
`pencal`

and all of its CRAN dependencies using
`install.packages( )`

. To do so, you need to use

```
# step 1: install survcomp
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install('survcomp')
BiocManager# step 2: install pencal and CRAN dependencies
install.packages('pencal')
```

After having installed `pencal`

, you can load the package
with

`library(pencal)`

`pencal`

Currently, `pencal`

contains functions to estimate the
PRC-LMM and PRC-MLPMM models proposed in Signorelli et al. (2021). An
overview of the different functions is presented in the following
subsections.

Estimation of the PRC-LMM model can be performed by running sequentially the following three functions:

`fit_lmms`

, which implements the first step of the estimation of the PRC-LMM;`summarize_lmms`

, which carries out the second step;`fit_prclmm`

, which performs the third step.

As already mentioned, these functions have to be run sequentially,
with the output of `fit_lmms`

used as input for
`summarize_lmms`

, and the output of
`summarize_lmms`

as input for `fit_prclmm`

.

Similarly, estimation of the PRC-MLPMM model can be performed using:

`fit_mlpmms`

, which implements the first step of the estimation of the PRC-MLPMM;`summarize_mlpmms`

, which carries out the second step;`fit_prcmlpmpm`

, which performs the third step.

Computation of the predicted survival probabilities from the PRC-LMM
and PRC-MLPMM models can be done with the functions
`survpred_prclmm`

and `survpred_prcmlpmm`

.

The evaluation of the predicted performance of the estimated PRC-LMM
and PRC-MLPMM can be done using the function
`performance_prc`

, which returns the naive and
optimism-corrected estimates of the C index and of the time-dependent
AUC. The optimism-corrected estimates are based on the CBOCP proposed in
Signorelli et
al. (2021).

A technical note on how the CBOCP is implemented in
`pencal`

: most of the computations required by the CBOCP are
performed by `fit_lmms`

, `summarize_lmms`

and
`fit_prclmm`

for the PRC-LMM, and by `fit_mlpmms`

,
`summarize_mlpmms`

and `fit_prcmlpmm`

for the
PRC-MLPMM. Such computations may be time-consuming, and for this reason
these functions are designed to work with parallel computing (this can
be easily done by setting the argument `n.cores`

to a value
\(> 1\)). The last step of the CBOCP
is performed by the function `performance_prc`

, which
aggregates the outputs of the previous functions and computes the naive
and optimism-corrected estimates of the C index and of the
time-dependent AUC.

**Important note**: if you just want to estimate the PRC
model, and you do not wish to compute the CBOCP, simply set
`n.boots = 0`

as argument of `fit_lmms`

. If,
instead, you do want to compute the CBOCP, set `n.boots`

equal to the desired number of bootstrap samples (e.g., 100).

In addition to the functions mentioned above, `pencal`

comprises also 3 functions that can be used to simulate example
datasets:

`simulate_t_weibull`

to simulate survival data from a Weibull model;`simulate_prclmm_data`

to simulate an example dataset for PRC-LMM that is comprehensive of a number of longitudinal biomarkers, a survival outcome and a censoring indicator;`simulate_prcmlpmm_data`

to simulate an example dataset for PRC-MLPMM;`pencox_baseline`

and`performance_pencox_baseline`

to estimate a penalized Cox model with baseline predictors only (no longitudinal information used for prediction).

To illustrate how `pencal`

works, let us simulate an
example dataset from the PRC-LMM model. Hereafter we generate a dataset
that comprises \(n = 100\) subjects,
\(p = 10\) longitudinal biomarkers that
are measured at \(t = 0, 0.2, 0.5, 1, 1.5,
2\) years from baseline, and a survival outcome that is
associated with 5 (`p.relev`

) of the 10 biomarkers:

```
set.seed(1234)
= 10
p = simulate_prclmm_data(n = 100, p = p, p.relev = 5,
simdata lambda = 0.2, nu = 1.5,
seed = 1234, t.values = c(0, 0.2, 0.5, 1, 1.5, 2))
ls(simdata)
```

`## [1] "censoring.prop" "long.data" "surv.data" "theta.true"`

Note that in this example we are setting \(n > p\), but `pencal`

can
handle both low-dimensional (\(n >
p\)) and high-dimensional (\(n \leq
p\)) datasets.

In order to estimate the PRC-LMM, you need to **provide the
following two inputs**:

**a dataset in long format**, which should contain (at least) the following variables: a subject identifier that should be named`id`

, the longitudinal biomarkers (here called`marker1`

, …,`marker10`

), and the relevant time variables (in this example we will use`age`

as covariate in the LMMs estimated in step 1, and`baseline.age`

as covariate in the penalized Cox model estimated in step 3):

```
# view the dataset in long format
head(simdata$long.data)
```

```
## id base.age t.from.base age marker1 marker2 marker3 marker4
## 1 1 4.269437 0.0 4.269437 1.5417408 4.452282 15.13419 5.809207
## 2 1 4.269437 0.2 4.469437 1.2346437 5.252873 15.98066 5.896591
## 3 1 4.269437 0.5 4.769437 2.0773929 3.714174 18.44501 6.634897
## 4 1 4.269437 1.0 5.269437 0.2137868 4.092887 20.23194 6.179779
## 5 1 4.269437 1.5 5.769437 1.3354611 5.032044 18.99531 5.914160
## 6 1 4.269437 2.0 6.269437 0.7811953 4.946483 22.19621 5.981212
## marker5 marker6 marker7 marker8 marker9 marker10
## 1 13.62807 -7.041495 15.19982 10.12011 2.7166023 15.16749
## 2 15.03320 -5.763194 16.25356 10.20624 1.2764132 13.11855
## 3 14.90197 -6.478355 17.40369 11.74692 1.9369628 13.91899
## 4 16.48330 -8.994558 18.44549 11.91967 1.5949944 15.50285
## 5 16.38560 -9.034169 19.61104 12.59247 1.3730259 15.86172
## 6 17.23651 -10.220797 19.71013 12.79891 -0.2164965 15.89041
```

```
# visualize the trajectories for a randomly picked biomarker
# (the code in the if statement below relies on the ptmixed package)
if ('ptmixed' %in% rownames(installed.packages())) {
library(ptmixed)
::make.spaghetti(x = age, y = marker5,
ptmixedid = id, group = id,
data = simdata$long.data,
margins = c(4, 4, 2, 2),
legend.inset = - 1)
}
```

**a dataset with information on the survival outcome**, which should contain (at least) the following variables: a subject identifier that should be named`id`

, the time to event outcome called`time`

, and the binary event indicator called`event`

(NB: make sure that the variable names associated to these three variables are indeed`id`

,`time`

and`event`

!)

```
# view the dataset with the survival data
head(simdata$surv.data)
```

```
## id baseline.age time event
## 1 1 4.269437 0.8368389 0
## 2 2 4.705434 1.4288656 1
## 3 3 3.220979 1.6382975 1
## 4 4 4.379400 0.5809532 1
## 5 5 4.800329 0.2441706 1
## 6 6 3.394879 0.4901404 1
```

```
# what is the proportion of censoring in this dataset?
$censoring.prop simdata
```

`## [1] 0.22`

```
# visualize an estimate of the survival function
library(survival)
library(survminer)
```

`## Loading required package: ggplot2`

`## Loading required package: ggpubr`

```
##
## Attaching package: 'survminer'
```

```
## The following object is masked from 'package:survival':
##
## myeloma
```

```
= survival::Surv(time = simdata$surv.data$time,
surv.obj event = simdata$surv.data$event)
= survival::survfit(surv.obj ~ 1,
kaplan type="kaplan-meier")
::ggsurvplot(kaplan, data = simdata$surv.data) survminer
```

`pencal`

Before we begin to estimate the PRC-LMM, let’s determine the number of cores that will be used for the computation of the CBOCP. In general you can use as many cores as available to you; to do this, you can set

`= parallel::detectCores() n.cores `

Since the CRAN Repository Policy allow us to use at most 2 cores when building the vignettes, in this example we will limit the number of cores used to 2:

`= 2 n.cores `

Be aware, however, that **using more than 2 cores will speed
computations up, and it is thus recommended**. Several functions
in `pencal`

will actually return a warning when you perform
computations using less cores than available: the goal of such warnings
is to remind you that you could use more cores to speed computations up;
however, if you are purposedly using a smaller number of cores you can
ignore the warning.

Hereafter we show how to implement the three steps involved in the estimation of the PRC-LMM, alongside with the computation of the CBOCP.

In the first step, for each biomarker we estimate a **linear
mixed model** (LMM) where the longitudinal biomarker levels \(y_{ij}\) depend on two fixed effects (one
intercept, \(\beta_0\) and one slope
for age, \(\beta_1\)), on a
subject-specific random intercept \(u_{0i}\) and on a random slope for age
\(u_{1i}\):

\[y_{ij} = \beta_0 + u_{0i} + \beta_1 a_{ij} + u_{1i} a_{ij} + \varepsilon_{ij}.\]

To do this in `R`

we use the `fit_lmms`

function:

```
= paste('marker', 1:p, sep = '')
y.names = fit_lmms(y.names = y.names,
step1 fixefs = ~ age, ranefs = ~ age | id,
long.data = simdata$long.data,
surv.data = simdata$surv.data,
t.from.base = t.from.base,
n.boots = 10, n.cores = n.cores)
```

`## You requested 2 cores for this computation. It seems that your computer actually has 8 cores available. Consider increasing n.cores accordingly to speed computations up! =)`

```
## Sorting long.data by subject id
## Sorting surv.data by subject id
## Preliminary step: remove measurements taken after event / censoring.
## Removed: 209 measurements. Retained: 391 measurements.
## This computation will be run in parallel
## Estimating the LMMs on the original dataset...
## ...done
## Bootstrap procedure started
## Bootstrap procedure finished
## Computation of step 1: finished :)
```

The LMM fitted here is just an example of how to model longitudinal biomarker trajectories: depending on the data you are dealing with, you may choose to specify different fixed and random effects formulas, or even to consider the MLPMM instead of the LMM.

Note that here I have set `n.boots = 10`

to reduce
computing time for the CBOCP, given that CRAN only allows me to use two
cores when compiling the vignette. In general, it is recommended to set
`n.boots = 0`

if you do not wish to compute the CBOCP, or to
set `n.boots`

equal to a larger number (e.g., 50, 100 or 200)
if you want to accurately compute the CBOCP. In the latter case,
consider using as many cores as available to you to speed computations
up.

`fit_lmms`

returns as output a list with several elements;
among them is `lmm.fits.orig`

, which contains the LMMs fitted
to each biomarker:

`ls(step1)`

```
## [1] "boot.ids" "call.info" "df.sanitized" "lmm.fits.boot"
## [5] "lmm.fits.orig" "n.boots"
```

```
# estimated LMM for marker1:
$lmm.fits.orig[1] step1
```

```
## $marker1
## Linear mixed-effects model fit by REML
## Data: df
## Log-restricted-likelihood: -678.8923
## Fixed: fixef.formula
## (Intercept) age
## 3.491175 -1.025144
##
## Random effects:
## Formula: ~age | id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 1.4804799 (Intr)
## age 0.8240204 0.392
## Residual 0.7262056
##
## Number of Observations: 391
## Number of Groups: 100
```

A useful helper function to inspect the parameter estimates from the
LMMs is the `getlmm`

function:

`getlmm(step1, 'marker3', 'betas')`

```
## (Intercept) age
## 5.779226 1.577539
```

`getlmm(step1, 'marker3', 'tTable')`

```
## Value Std.Error DF t-value p-value
## (Intercept) 5.779226 0.3066900 290 18.84387 2.827700e-52
## age 1.577539 0.1166793 290 13.52029 1.241922e-32
```

`getlmm(step1, 'marker3', 'variances')`

```
## id = pdLogChol(age)
## Variance StdDev Corr
## (Intercept) 0.4374300 0.6613849 (Intr)
## age 0.8941522 0.9455962 0.909
## Residual 0.4994799 0.7067389
```

For more details about the arguments of `fit_lmms`

and its
outputs, see the help pages: `?fit_lmms`

and
`getlmm`

.

In the second step we **compute the predicted random intercepts
and random slopes** for the LMMs fitted in step 1:

`= summarize_lmms(object = step1, n.cores = n.cores) step2 `

`## You requested 2 cores for this computation. It seems that your computer actually has 8 cores available. Consider increasing n.cores accordingly to speed computations up! =)`

```
## This computation will be run in parallel
## Computing the predicted random effects on the original dataset...
## ...done
## Bootstrap procedure started
## Bootstrap procedure finished
## Computation of step 2: finished :)
```

`summarize_lmms`

returns as output a list that contains,
among other elements, a matrix `ranef.orig`

with the
predicted random effects for the LMMs fitted in step 1:

`ls(step2)`

```
## [1] "boot.ids" "call" "n.boots" "ranef.boot.train"
## [5] "ranef.boot.valid" "ranef.orig"
```

```
# view predicted random effects for the first two markers
$ranef.orig[1:5, 1:4] step2
```

```
## marker1_b_int marker1_b_age marker2_b_int marker2_b_age
## 1 0.469478883 0.50242811 0.4806166 0.9897540
## 2 -0.653587521 -1.18903060 0.3094300 -1.2798248
## 3 0.544229791 1.35575525 -1.1855794 -0.4835888
## 4 1.066468253 0.82788124 -0.4332657 0.1949000
## 5 0.002425792 -0.09187156 -0.4224204 -1.4788434
```

For more details about the arguments of `summarize_lmms`

and its outputs, see the help page: `?summarize_lmms`

.

Lastly, in the third step of PRC-LMM we estimate a **penalized
Cox model** where we employ as predictors baseline age and all
the summaries (predicted random effects) computed in step 2:

```
= fit_prclmm(object = step2, surv.data = simdata$surv.data,
step3 baseline.covs = ~ baseline.age,
penalty = 'ridge', n.cores = n.cores)
```

`## You requested 2 cores for this computation. It seems that your computer actually has 8 cores available. Consider increasing n.cores accordingly to speed computations up! =)`

```
## Estimating penalized Cox model on the original dataset...
## ...done
## Bootstrap procedure started
## This computation will be run in parallel
## Bootstrap procedure finished
## Computation of step 3: finished :)
```

In this example I have set `penalty = 'ridge'`

, but you
may also use elasticnet or lasso as alternatives. Moreover, by default
the predicted random effects are standardized when included in the
penalized Cox model (if you don’t want to perform such standardization,
set `standardize = F`

).

`fit_prclmm`

returns as output an object of class
`prclmm`

with its own `print`

and
`summary`

methods:

`print(step3)`

```
## Fitted model: PRC-LMM
## Penalty function used: ridge
## Sample size: 100
## Number of events: 78
## Bootstrap optimism correction: computed using 10 bootstrap samples
## Penalized likelihood estimates (rounded to 4 digits):
## baseline.age marker1_b_int marker1_b_age marker2_b_int marker2_b_age
## 1 -0.03 -0.089 0.0784 0.1868 0.5521
## marker3_b_int marker3_b_age marker4_b_int marker4_b_age marker5_b_int
## 1 0.3243 0.2093 -0.0564 -0.0163 -0.9441
## marker5_b_age marker6_b_int marker6_b_age marker7_b_int marker7_b_age
## 1 -0.1867 20.6379 0.0538 0.1136 0.0204
## marker8_b_int marker8_b_age marker9_b_int marker9_b_age marker10_b_int
## 1 -1.0472 0.0768 -0.1583 -0.087 0.0868
## marker10_b_age
## 1 0.002
```

In practice, the `prclmm`

object is also a list that
contains, among other elements, the fitted penalized Cox model
`pcox.orig`

, which is a `glmnet`

object:

`ls(step3)`

`## [1] "boot.ids" "call" "n.boots" "pcox.boot" "pcox.orig" "surv.data"`

`class(step3$pcox.orig)`

`## [1] "cv.glmnet"`

`library(glmnet)`

`## Loading required package: Matrix`

`## Loaded glmnet 4.1-6`

`t(as.matrix(coef(step3$pcox.orig)))`

```
## baseline.age marker1_b_int marker1_b_age marker2_b_int marker2_b_age
## 1 -9.07517e-05 -0.01937723 -0.005830572 0.05177944 0.1419437
## marker3_b_int marker3_b_age marker4_b_int marker4_b_age marker5_b_int
## 1 0.09567494 0.06274591 -0.0094313 -0.0005147802 -0.355839
## marker5_b_age marker6_b_int marker6_b_age marker7_b_int marker7_b_age
## 1 -0.07001313 6.311638 0.02016179 0.03792503 0.009298512
## marker8_b_int marker8_b_age marker9_b_int marker9_b_age marker10_b_int
## 1 -0.2801484 0.02157958 -0.1438505 -0.02858716 0.02709515
## marker10_b_age
## 1 0.001512091
```

For more details about the arguments of `fit_prclmm`

and
its outputs, see the help page: `?fit_prclmm`

.

Estimation of the PRC-MLPMM(U) and PRC-MLPMM(U+B) models proceeds in
a similar fashion. For details and examples about this multivariate
approach, see: `?fit_mlpmms`

(step 1),
`?summarize_mlpmms`

(step 2), and `?fit_prcmlpmm`

(step 3).

After fitting the model, you will probably want to obtain
**predicted survival probabilities** for each individual at
several time points. This can be done through the function
`survpred_prclmm`

, which takes as inputs the outputs of
step1, step 2 and step 3, alongside with the time points at which to
compute the survival probabilities:

```
= survpred_prclmm(step1, step2, step3, times = c(1, 2, 3))
preds ls(preds)
```

`## [1] "call" "predicted_survival"`

`head(preds$predicted_survival)`

```
## id S(1) S(2) S(3)
## 1 1 0.4402297 0.1335658 0.0571555
## 2 2 0.7879884 0.5573018 0.4355453
## 3 3 0.7230333 0.4512491 0.3226312
## 4 4 0.4924490 0.1758515 0.0845033
## 5 5 0.6469203 0.3434663 0.2188754
## 6 6 0.5550192 0.2358348 0.1282560
```

To accurately quantify the predicted performance of the fitted PRC
model, we need to recur to some form of **internal
validation** strategy (e.g., bootstrap, cross-validation,
etc…).

In `pencal`

the internal validation is performed through a
Cluster Bootstrap Optimism Correction Procedure (CBOCP) that allows to
compute **optimism-corrected estimates of the concordance (C)
index and of the time-dependent AUC**.

Most of the steps that the CBOCP requires are directly computed by
the functions `fit_lmms`

, `summarize_lmms`

and
`fit_prclmm`

whenever the argument `n.boots`

of
`fit_lmms`

is set equal to an integer > 0 (in other words:
most of the computations needed for the CBOCP have already been
performed in the code chunks executed above, so we are almost
done!).

To gather the results of the CBOCP we can use the function
`performance_prc`

:

```
= performance_prc(step2, step3, times = c(1, 2, 3),
cbocp n.cores = n.cores)
```

`## You requested 2 cores for this computation. It seems that your computer actually has 8 cores available. Consider increasing n.cores accordingly to speed computations up! =)`

```
## This computation will be run in parallel
## Computation of optimism correction started
## Computation of the optimism correction: finished :)
```

```
# C index estimates:
$concordance cbocp
```

```
## n.boots C.naive cb.opt.corr C.adjusted
## 1 10 0.7867 -0.0371 0.7496
```

```
# time-dependent AUC estimates:
$tdAUC cbocp
```

```
## pred.time tdAUC.naive cb.opt.corr tdAUC.adjusted
## 1 1 0.8920 -0.0334 0.8586
## 2 2 0.8388 -0.0552 0.7836
## 3 3 0.8347 -0.0685 0.7662
```

From the results above we can see that:

- the naive C index is estimated = 0.786, and the optimism-corrected C index is estimated = 0.747;
- the naive tdAUC for predictions at 1 year from baseline is estimated = 0.887, and the optimism-corrected estimate is = 0.850;
- the naive tdAUC for predictions at 2 years from baseline is estimated = 0.834, and the optimism-corrected estimate is = 0.796;
- the naive tdAUC for predictions at 3 years from baseline is estimated = 0.830, and the optimism-corrected estimate is = 0.770.

The aim of this vignette is to provide a quick-start introduction to
the `R`

package `pencal`

. Here I have focused my
attention on the fundamental aspects that one needs to know to be able
use the package. More details, functions and examples can be found in the manual
of the package. The methodology that `pencal`

relies on
is described in Signorelli et
al. (2021).