Naive-Pooled Parameter Estimation

John Harrold & Anson Abraham

2020-01-20

Introduction

The workshop (https://workshop.ubiquity.tools) provides several examples of performing parameter estimation in ubiquity. To make a copy of these scripts and other supporting files in the current working directory run the following:

library(ubiquity)
fr = workshop_fetch(section="Estimation", overwrite=TRUE)

Each of these scripts build on the previous one to demonstrate different features of the ubiquity parameter estimation routines. All of these examples use a system (shown in the figure below) that contains three differential equations tracking the mass of the parent drug in the blood (Mpb) and in a tissue space (Mpt). In the blood the parent can form (fmKp) a metabolite (Mmb) with subsequent elimination (Km). The parent/metabolite model was adapted from the ADAPT5 Users Manual (https://bmsr.usc.edu/files/2013/02/ADAPT5-User-Guide.pdf).

Model structure of parent/metabolite model

Model structure of parent/metabolite model

Using system_new (use ?system_new to see a list of the available system file examples) you can copy this template file into the current directory and build the system:

library(ubiquity)
system_new(file_name="system.txt", system_file="adapt", overwrite = TRUE)
cfg = build_system(system_file = "system.txt")

Once the system has been built, you can create a local copy of an estimation template for the system:

system_fetch_template(cfg, template="Estimation")

This should create the file: analysis_estimate.R in the working directory. At the beginning of the script there are three variables that are created to control what the script does and the format of the output. The variable analysis_name defines the prefix that will be prepended to the output generated by the script. Archiving the analysis results is controlled by the Boolean variable archive_results. Lastly, the script is controlled using the flowctl variable (the possible options are listed and commented out).

The estimation script has the following main components:

  1. Select the parameter set and subset of parameters to estimate
  2. Set options (simulation, estimation, etc)
  3. Load datasets
  4. Define cohorts
  5. Estimate parameters
  6. Plot results

These will be explored using the scripts above.

Least squares estimation/single output (analysis_parent.r)

The first example above (analysis_parent.r) begins by specifying that we want to perform a parameter estimation and archive the results using the name1 parent_d1030 to indicate we are analyzing the parent PK for the 10 and 30 mg dosing cohorts:

flowctl = 'estimate'
archive_results = TRUE
analysis_name = 'parent_d1030'

Next we select the parameters to estimate. Because we’re only estimating parent data, only the names of the parameters relevant to the parent PK (pnames) are selected. To estimate all parameters simply exclude the third argument. Because only system parameters are being estimated a weighted least squares objective will be used2.

pnames = c('Vp', 'Vt', 'CLp', 'Q')
cfg = system_select_set(cfg, "default", pnames)

Next we set options relevant to the estimation and the underlying simulation routines. In this example only the simulation output times have been specified, but any relevant parameters can be set. The template that is generated has several common options that are commented out. These can be uncommented and modified as needed.

cfg=system_set_option(cfg, group  = "simulation", 
                           option = "output_times", 
                           seq(0,100,1))

The dataset (shown in the table below) is included with the ubiquity package and is accessed here using system.file. In the example script it is referenced explicitly. The format requirements of datasets is that they be flat files with a header row. The format is flexible and only requires time/observation information and columns required to filter the data and isolate cohorts that have received the same treatment. The inputs are defined when the cohorts themselves are defined. Data files are loaded using system_load_data and a name3 is assigned the dataset. Data can be read from files (csv, xls, and tab) or from an existing data frame. This name (pm_data here) will be used to reference this dataset later in the script.

cfg = system_load_data(cfg, dsname     = "pm_data", 
                            data_file  = system.file("ubinc", "csv",
                                                     "pm_data.csv", 
                                                     package = "ubiquity"))

The dataset has a column for the observation time (TIME), the subject id (ID), concentrations for the parent (PT) and the metabolite (MT), a BLQ flag and the nominal dose (DOSE). But only some of these columns will be used (TIME, PT, MT and DOSE), and any others (BQL) will be ignored. This is intended to be an example and not a general guide for a dataset format. It is not necessary to have different outputs in different columns (wide format). The dataset could be loaded in the tall skinny format as well. In this example we are going to analyze the parent data for individuals given 10 or 30 mg using a least squares objective.

Next we need to identify the data and inputs associated with those data. This is done by defining cohorts, groups of data that receive the same treatment and the model inputs (bolus dosing, infusion rates, covariates) associated with those cohorts. In this example we will define a cohort for each dosing group. To make sure we are starting from a blank slate we can use system_clear_cohorts to remove any previously defined information.

cfg = system_clear_cohorts(cfg)

For each cohort we define a list with information about that cohort. Each cohort should have a unique name field4 and a dataset field pointing to the dataset (pm_data) loaded above. The optional cohort filter (cf) field is used to reduce the entire dataset to the records associated with this cohort. See the help for system_define_cohort for more information on how to construct cohort filters.

cohort  = c()
cohort$name                                   = 'dose_10'
cohort$dataset                                = 'pm_data'
cohort$cf$DOSE                                = c(10)

Next it is necessary to define the inputs for this cohort. Here inputs refers to model inputs which may include both dosing as well as covariates, and the estimation template (generated using system_fetch_template above) should contain placeholders for each of these defined in the system file. Note: For bolus and infusion inputs, it is only necessary to define inputs that are nonzero, and for covariates it is only necessary to define those that differ from the definitions in the system file. For each input there is an AMT field and TIME field and the units here are those specified in the system file (both AMT and TIME are internal indentifiers and not taken from the dataset).

cohort$inputs$bolus$Mpb$AMT                   = c(10)
cohort$inputs$bolus$Mpb$TIME                  = c(0)

Next we need to match the outputs in the model to the outputs in the dataset. Under cohort$outputs there is a field used to group each output. Here the cohort output mapping for the blood PK of the parent output is Parent. The times and observations in the dataset are found in the 'TIME’ column and the ’PT’ column (missing data specified by -1 will be dropped). These are mapped to the model timescale ('hours', speified with <TS:?>) and model output (’Cpblood’, specified with <O>). Note the units in the dataset must be the same as those in the model:

cohort$outputs$Parent$obs$time                = 'TIME'
cohort$outputs$Parent$obs$value               = 'PT'
cohort$outputs$Parent$obs$missing             = -1
cohort$outputs$Parent$model$time              = 'hours'
cohort$outputs$Parent$model$value             = 'Cpblood'
cohort$outputs$Parent$model$variance          = '1'

For each output grouping in the cohort the marker color, shape and line type can be specified (controlling the plotted output).

cohort$outputs$Parent$options$marker_color    = 'black'   
cohort$outputs$Parent$options$marker_shape    = 1
cohort$outputs$Parent$options$marker_line     = 2

Finally the cohort is defined using system_define_cohort:

cfg = system_define_cohort(cfg, cohort)

We do the same thing for the 30 mg dose group:

cohort  = c()
cohort$name                                   = 'dose_30'
cohort$dataset                                = 'pm_data'
cohort$cf$DOSE                                = c(30)
cohort$inputs$bolus$Mpb$AMT                   = c(30)
cohort$inputs$bolus$Mpb$TIME                  = c(0)
cohort$outputs$Parent$obs$time                = 'TIME'
cohort$outputs$Parent$obs$value               = 'PT'
cohort$outputs$Parent$obs$missing             = -1
cohort$outputs$Parent$model$time              = 'hours'
cohort$outputs$Parent$model$value             = 'Cpblood'
cohort$outputs$Parent$model$variance          = '1'
cohort$outputs$Parent$options$marker_color    = 'red'   
cohort$outputs$Parent$options$marker_shape    = 2
cohort$outputs$Parent$options$marker_line     = 2

cfg = system_define_cohort(cfg, cohort)

After the cohorts have been defined we call the estimation function (system_estimate_parameters). If flowctl is set to 'plot previous estimate' or 'plot guess' the those values will just be returned.

pest = system_estimate_parameters(cfg, 
                                  flowctl         = flowctl, 
                                  analysis_name   = analysis_name, 
                                  archive_results = archive_results)

If one of the estimation options are selected for the flowctl then several files will be generated in the output folder with the analysis_name as a prefix:

Next the system is simulated at the estimate and the data is stored in erp.

cfg=system_set_option(cfg, group  = "simulation", 
                           option = "output_times", 
                           seq(0,100,5))
erp = system_simulate_estimation_results(pest = pest, cfg = cfg) 

The information in this variable will be used to generate some standard plots below, but it may be desirable to save this information or generate your own figures. To do this it is necessary to understand the structure of erp. This list has two different fields.

Lastly the predictions overlaying the data and the observed vs predicted plots are generated using system_plot_cohorts. Basic formatting of these figures is controlled using the plot_opts list (see the ?system_plot_cohorts for details).

plot_opts = c()
plot_opts$outputs$Parent$yscale       = 'log'
plinfo = system_plot_cohorts(erp, plot_opts, cfg, analysis_name=analysis_name)

When called, system_plot_cohorts will write png and pdf output for the time course and observed vs predicted files. It will also return a list with the ggplot objects and relative paths to the files as well. In this example the following will be generated:

Timecourse

Timecourse

Observed vs Predicted

Observed vs Predicted

Automated Reporting

The outputs above provide components for generating presentations and other documents. Coping and pasting these figures and tables into documents can be tedious. It can be convenient to automate this process and this is accomplished with the function system_report_estimation.

PowerPoint

To append the results of an analysis to a PowerPoint document simply initialize a new report (rpttype="PowerPoint"), call system_report_estiamtion with the appropriate analysis_name, and the results of the analysis will be attached as slides.

This will then save the analysis results to the output directory.

Word

The process for a Word document is the same. Just make sure that the rpttype is set to "Word" when the report is initialized:

For more information on integrated report generation see the Reporting vignette.

Maximum likelihood/two outputs (analysis_parent_metabolite.r)

This example is similar to the last except we are analyzing two different outputs (parent and metabolite) and we have a proportional variance model. So now we can estimate the parameters associated with those outputs as well as the variance parameters:

pnames = c('Vp',
           'Vt',
           'Vm',
           'CLp',
           'Q',
           'CLm',
           'slope_parent',
           'slope_metabolite');

cfg = system_select_set(cfg, "default", pnames)

The parameters being estimated contain variance parameters (slope_parent and slope_metabolite) so a maximum likelihood objective will be used. The cohort definitions look much the same as those before except the variance model here is defined as 'slope_parent*PRED^2', and there is a separate output named Metabolite.

cfg = system_clear_cohorts(cfg)
cohort = c()
cohort$name                                      = 'dose_10'
cohort$cf$DOSE                                   = c(10)
cohort$dataset                                   = 'pm_data'
                                                 
cohort$inputs$bolus$Mpb$TIME                     = c(0)  # hours 
cohort$inputs$bolus$Mpb$AMT                      = c(10) # mpk 

cohort$outputs$Parent$obs$time                   = 'TIME' 
cohort$outputs$Parent$obs$value                  = 'PT'    
cohort$outputs$Parent$obs$missing                = -1;
cohort$outputs$Parent$model$time                 = 'hours'        
cohort$outputs$Parent$model$value                = 'Cpblood'
cohort$outputs$Parent$model$variance             = 'slope_parent*PRED^2'
cohort$outputs$Parent$options$marker_color       = 'black'
cohort$outputs$Parent$options$marker_shape       = 1
cohort$outputs$Parent$options$marker_line        = 1 

cohort$outputs$Metabolite$obs$time               = 'TIME' 
cohort$outputs$Metabolite$obs$value              = 'MT'    
cohort$outputs$Metabolite$obs$missing            = -1;
cohort$outputs$Metabolite$model$time             = 'hours'        
cohort$outputs$Metabolite$model$value            = 'Cmblood'
cohort$outputs$Metabolite$model$variance         = 'slope_metabolite*PRED^2'
cohort$outputs$Metabolite$options$marker_color   = 'blue'
cohort$outputs$Metabolite$options$marker_shape   = 1
cohort$outputs$Metabolite$options$marker_line    = 1 
cfg = system_define_cohort(cfg, cohort)

Similar modifications were made to the 30 mg dosing cohort.

Global estimation routines (analysis_parent_metabolite_global.r)

Now we build on the previous example to demonstrate how to select different optimization routines. By default, the parameter estimation is carried out using the Nelder-Mead optimization method from the optim library. You can specify different functions in the library. See the documentation for optim (?optim) for valid values for method and elements for the control. For example, to use simulated annealing change the method to SANN.

cfg = system_set_option(cfg, group  = "estimation",
                             option = "method",
                             value  = "SANN")

There are also global optimization libraries in R, and two of these can be readily used with ubiquity: Particle Swarm Optimizer (pso package) and Genetic Algorithms (GA package).

To use the pso package set the following options:

library("pso")
cfg = system_set_option(cfg, group  = "estimation",
                             option = "optimizer", 
                             value  = "pso")

cfg = system_set_option(cfg, group  = "estimation",
                             option = "method",
                             value  = "psoptim")

To use the GA package set the following options:

library(GA)

cfg = system_set_option(cfg, group  = "estimation",
                             option = "optimizer", 
                             value  = "ga")

cfg = system_set_option(cfg, group  = "estimation",
                             option = "method",
                             value  = "ga")

cfg = system_set_option(cfg, group  = "estimation",
                             option = "control", 
                             value  = list(maxiter   = 10000, 
                                           optimArgs = list(method  = "Nelder-Mead",
                                                            maxiter = 1000)))

Note: Optimizers like SANN and the global optimizers (pso and GA) are good for identifying parameter sets outside of the region of the initial guess. However, one consequence of these algorithms is they can quickly approach the bounds. Consequently it is important to provide realistic upper and lower bounds on the parameters (the <P> descriptor in the system file or using system_set_guess at the scripting level). If you use the default value of machine precision (eps) for the lower bound and infinity (Inf) for the upper bound these optimization routines can choose parameter values that can cause the internal simulations to fail.

Cohorts from NONMEM dataset (analysis_parent_metabolite_nm_data.r)

In the examples above, cohorts are defined manually. Sometimes you may have data in a NONMEM dataset where the dosing information is located in the dataset. It may be convenient to simply define a cohort for each subject in the dataset. To do that the function system_define_cohorts_nm can be used. The differences between the script analysis_parent_metabolite.r will now be highlighted:

First we load the NONMEM dataset and clear the cohorts:

cfg = system_load_data(cfg, dsname     = "nm_pm_data", 
                            data_file  = system.file("ubinc", "csv",
                                                     "nm_data.csv", 
                                                     package = "ubiquity"))
cfg = system_clear_cohorts(cfg);

Next we define a filter to use on the dataset (include only the 10 and 30 mg doses):

filter = list()
filter$DOSE = c(10, 30)

For more information on filtering datasets see the help for nm_select_records.

Now we define maps for the different outputs. For each output we specify the variance, compartment (CMT) number, model output and the missing number flag:

OBSMAP = list()
OBSMAP$PT = list(variance     = 'slope_parent*PRED^2',
                 CMT          =  1,
                 output       = 'Cpblood', 
                 missing      =  -1 )

OBSMAP$MT = list(variance     = 'slope_metabolite*PRED^2',
                 CMT          =  2,
                 output       = 'Cmblood', 
                 missing      =  -1 )

Lastly we define a map for the model inputs. In this case we only have a bolus in the Mpb compartment:

INPUTMAP = list()
INPUTMAP$bolus$Mpb$CMT_NUM             =  1

Unused columns in the dataset will be ignored. With the filter, input and observation maps defined, we now add the cohorts

cfg = system_define_cohorts_nm(cfg, 
                               DS       = 'nm_pm_data',
                               col_ID   = 'ID',   col_CMT  = 'CMT', col_DV   = 'DV',
                               col_TIME = 'TIME', col_AMT  = 'AMT', col_RATE = 'RATE',
                               col_EVID = 'EVID', col_GROUP= 'DOSE',
                               filter   =  filter, 
                               INPUTS   =  INPUTMAP,
                               OBS      =  OBSMAP,
                               group    =  FALSE)

Contents of system.txt

#
# Parent/Metabolite example taken from Section 9.3 of the ADAPT5 Users Manual
#
<P> Vp    10.0   1e-5   100  L        yes       System     
<P> Vt    10.0   1e-5   100  L        yes       System     
<P> Vm    30.0   1e-5   100  L        yes       System
<P> CLp    1.0   1e-5   100  L/hr     yes       System
<P> CLm    1.0   1e-5   100  L/hr     yes       System
<P> Q      0.3   1e-5   100  L/hr     yes       System

<PSET:default> Original Estimates

<VP> slope_parent     0.1  1e-9    10  --  no Variance
<VP> slope_metabolite 0.1  1e-9    10  --  no Variance

<B:times>;           [  0  ];      1;          hours
<B:events>;   Mpb;   [  0  ];     70;          mpk

<ODE:Mpb> -(CLp/Vp + Q/Vp)*Mpb + Q/Vt*Mpt 
<ODE:Mpt> Q/Vp*Mpb - Q/Vt*Mpt
<ODE:Mmb> CLp/Vp*Mpb  - CLm/Vm*Mmb

<O> Cpblood     = Mpb/Vp
<O> Cmblood     = Mmb/Vm   

<TS:hours> 1
<TS:days>  1/24

<IIV:ETAVp>    0.08       
<IIV:ETAVp:LN> Vp     
<IIV:ETAVt>    0.08       
<IIV:ETAVt:LN> Vt     
<IIV:ETACLp>   0.08       
<IIV:ETACLp:LN> CLp
<IIV:ETACLm>   0.08       
<IIV:ETACLm:LN> CLm


<OPT:output_times> SIMINT_SEQ[0][100][1]
<DATA:HEADER:AUTOMATIC>

  1. analysis names must start with a letter and containing only letters, numbers, and _

  2. The objective function is set based on the parameters being estimated. If there are no variance parameters being estimated a weighted least squares objective will be used. If there are variance parameters being estimated then a maximum likelihood objective will be used.

  3. dataset names must start with a letter and containing only letters, numbers, and _

  4. cohort names must start with a letter and containing only letters, numbers, and _