This vignette aims to illustrate the use of pubh
functions for typical regression analysis in Public Health. In
particular, the vignette shows the use of the following functions from
pubh
:
cosm_reg
for reporting tables of coefficients.cross_tbl
for reporting tables of descriptive
statistics by exposure of interest.multiple
for adjusting confidence intervals and
p-values for post-hoc analysis.get_r2
for accessing \(R^2\) or pseudo-\(R^2\) from regression models.glm_coef
for some special cases of regression
models.The advantages and limitations of glm_coef
are:
gee
, glm
and
survreg
.We start by loading relevant packages and setting the theme for the plots (as suggested in the Template of this package):
rm(list = ls())
library(dplyr)
library(rstatix)
library(easystats)
library(pubh)
library(sjlabelled)
theme_set(see::theme_lucid(base_size = 10))
theme_update(legend.position = "top")
options('huxtable.knit_print_df' = FALSE)
options('huxtable.autoformat_number_format' = list(numeric = "%5.2f"))
knitr::opts_chunk$set(comment = NA)
There is no need to exponentiate the results for continuous outcomes unless the outcome fits in the log scale. In our first example, we want to estimate the effect of smoking and race on babies’ birth weights.
We can generate factors and assign labels in the same
pipe
stream:
data(birthwt, package = "MASS")
birthwt <- birthwt |>
mutate(
smoke = factor(smoke, labels = c("Non-smoker", "Smoker")),
race = factor(race, labels = c("White", "African American", "Other"))
) |>
var_labels(
bwt = 'Birth weight (g)',
smoke = 'Smoking status',
race = 'Race'
)
Is good to start with some basic descriptive statistics, so we can compare the birth weight between groups.
Graphical analysis:
Another way to compare the means between the groups is with
gen_bst_df
, which estimates the means of corresponding
bootstrapped CIs.
Birth weight (g) | LowerCI | UpperCI | Race | smoke |
---|---|---|---|---|
3428.75 | 3228.14 | 3635.42 | White | Non-smoker |
2826.85 | 2666.18 | 2996.39 | White | Smoker |
2854.50 | 2536.03 | 3139.63 | African American | Non-smoker |
2504.00 | 2124.68 | 2853.81 | African American | Smoker |
2815.78 | 2623.95 | 2998.48 | Other | Non-smoker |
2757.17 | 2292.10 | 3136.92 | Other | Smoker |
We fit a linear model.
Note: Model diagnostics are not be discussed in this vignette.
Traditional output from the model:
Anova Table (Type II tests)
Response: bwt
Sum Sq Df F value Pr(>F)
smoke 7322575 1 15.4588 0.0001191 ***
race 8712354 2 9.1964 0.0001557 ***
Residuals 87631356 185
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Parameter | Coefficient | SE | 95% CI | t(185) | p
-------------------------------------------------------------------------------------
(Intercept) | 3334.95 | 91.78 | [3153.89, 3516.01] | 36.34 | < .001
smoke [Smoker] | -428.73 | 109.04 | [-643.86, -213.60] | -3.93 | < .001
race [African American] | -450.36 | 153.12 | [-752.45, -148.27] | -2.94 | 0.004
race [Other] | -452.88 | 116.48 | [-682.67, -223.08] | -3.89 | < .001
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
----------------------------------------------------------------------
3012.223 | 3012.551 | 3028.432 | 0.123 | 0.109 | 680.924 | 688.246
Table of coefficients for publication:
model_norm |>
tbl_regression() |>
cosm_reg() |> theme_pubh() |>
add_footnote(get_r2(model_norm), font_size = 9)
Variable | Beta | 95% CI | p-value |
---|---|---|---|
Smoking status | <0.001 | ||
Non-smoker | — | — | |
Smoker | -429 | -644, -214 | |
Race | <0.001 | ||
White | — | — | |
African American | -450 | -752, -148 | |
Other | -453 | -683, -223 | |
R2 = 0.123 |
model_norm |>
glm_coef(labels = model_labels(model_norm)) |>
as_hux() |> set_align(everywhere, 2:3, "right") |>
theme_pubh() |>
add_footnote(get_r2(model_norm), font_size = 9)
Parameter | Coefficient | Pr(>|t|) |
---|---|---|
Constant | 3334.95 (3153.89, 3516.01) | < 0.001 |
Smoking status: Smoker | -428.73 (-643.86, -213.6) | < 0.001 |
Race: African American | -450.36 (-752.45, -148.27) | 0.004 |
Race: Other | -452.88 (-682.67, -223.08) | < 0.001 |
R2 = 0.123 |
Function glm_coef
allows the use of robust standard
errors.
model_norm |>
glm_coef(se_rob = TRUE, labels = model_labels(model_norm)) |>
as_hux() |> set_align(everywhere, 2:3, "right") |>
theme_pubh() |>
add_footnote(paste(
get_r2(model_norm), "\n",
"CIs and p-values estimated with robust standard errors."),
font_size = 9)
Parameter | Coefficient | Pr(>|t|) |
---|---|---|
Constant | 3334.95 (3144.36, 3525.53) | < 0.001 |
Smoking status: Smoker | -428.73 (-652.88, -204.58) | < 0.001 |
Race: African American | -450.36 (-734.09, -166.63) | 0.002 |
Race: Other | -452.88 (-701.4, -204.35) | < 0.001 |
R2 = 0.123 CIs and p-values estimated with robust standard errors. |
We can use functions from easystats
:
estimate_means
calculates the mean birth weight for each
combination with corresponding confidence intervals.
Estimated Marginal Means
race | smoke | Mean | SE | 95% CI
---------------------------------------------------------------------
White | Non-smoker | 3334.95 | 91.78 | [3153.89, 3516.01]
African American | Non-smoker | 2884.59 | 141.34 | [2605.74, 3163.44]
Other | Non-smoker | 2882.07 | 86.32 | [2711.77, 3052.37]
White | Smoker | 2906.22 | 86.21 | [2736.14, 3076.30]
African American | Smoker | 2455.86 | 150.74 | [2158.48, 2753.24]
Other | Smoker | 2453.34 | 122.81 | [2211.05, 2695.63]
Marginal means estimated at race
We can plot the output from estimate_means
using a
recipe from see
:
model_norm |>
estimate_means(by = c("race", "smoke")) |>
plot() |>
gf_labs(
x = "", y = "Birth weight (g)", title = "",
col = "Smoking status"
)
We have more flexibility and construct a more tidy plot too:
model_norm |>
estimate_means(c("race", "smoke")) |>
gf_point(Mean ~ race) |>
gf_errorbar(CI_low + CI_high ~ race, width = 0.3) |>
gf_facet_wrap(~smoke) |>
gf_labs(
x = "Race",
y = "Birth weight (g)"
)
When the explanatory variables are categorical, another option is
emmip
from the emmeans
package. We can include
CIs in emmip
but as estimates are connected, the
resulting plots look messy, so I recommend emmip
to look at
the trace.
For logistic regression, we are interested in odds ratios. We will examine the effect of fibre intake on the development of coronary heart disease.
data(diet, package = "Epi")
diet <- diet |>
mutate(
chd = factor(chd, labels = c("No CHD", "CHD"))
) |>
var_labels(
chd = "Coronary Heart Disease",
fibre = "Fibre intake (10 g/day)"
)
We start with descriptive statistics:
Coronary Heart Disease | N | Min | Max | Mean | Median | SD | CV | |
---|---|---|---|---|---|---|---|---|
Fibre intake (10 g/day) | No CHD | 288.00 | 0.60 | 5.35 | 1.75 | 1.69 | 0.58 | 0.33 |
CHD | 45.00 | 0.76 | 2.43 | 1.49 | 1.51 | 0.40 | 0.27 |
We fit a linear logistic model:
Summary:
Parameter | Odds Ratio | SE | 95% CI | z | p
--------------------------------------------------------------
(Intercept) | 0.95 | 0.58 | [0.30, 3.16] | -0.08 | 0.936
fibre | 0.33 | 0.12 | [0.15, 0.66] | -2.94 | 0.003
Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
computed using a Wald z-distribution approximation.
# Indices of model performance
AIC | AICc | BIC | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP
--------------------------------------------------------------------------------------------------------
257.535 | 257.571 | 265.151 | 0.028 | 0.337 | 1.000 | 0.381 | -6.359 | 0.017 | 0.773
Table of coefficients for publication:
model_binom |>
tbl_regression(exponentiate = TRUE) |>
cosm_reg() |> theme_pubh() |>
add_footnote(get_r2(model_binom), font_size = 9)
Variable | OR | 95% CI | p-value |
---|---|---|---|
Fibre intake (10 g/day) | 0.33 | 0.15, 0.66 | 0.001 |
Tjur's R2 = 0.028 |
Effect plot:
model_binom |>
estimate_means(by = "fibre") |>
gf_line(Probability ~ fibre) |>
gf_ribbon(CI_low + CI_high ~ fibre) |>
gf_labs(
x = "Fibre intake (10 g/day)",
y = "Probability of CHD"
)
We will look at a matched case-control study on the effect of oestrogen use and history of gall bladder disease on the development of endometrial cancer.
data(bdendo, package = "Epi")
bdendo <- bdendo |>
mutate(
cancer = factor(d, labels = c('Control', 'Case')),
gall = factor(gall, labels = c("No GBD", "GBD")),
est = factor(est, labels = c("No oestrogen", "Oestrogen"))
) |>
var_labels(
cancer = 'Endometrial cancer',
gall = 'Gall bladder disease',
est = 'Oestrogen'
)
We start with descriptive statistics:
bdendo |>
mutate(
cancer = relevel(cancer, ref = "Case"),
gall = relevel(gall, ref = "GBD"),
est = relevel(est, ref = "Oestrogen")
) |>
copy_labels(bdendo) |>
select(cancer, gall, est) |>
tbl_strata(
strata = est,
.tbl_fun = ~ .x |>
tbl_summary(by = gall)
) |>
cosm_sum(bold = TRUE, head_label = " ") |>
theme_pubh(2) |>
set_align(1, everywhere, "center")
Oestrogen | No oestrogen | |||
---|---|---|---|---|
GBD | No GBD | GBD | No GBD | |
Endometrial cancer | ||||
Case | 13 (45%) | 43 (28%) | 4 (33%) | 3 (2.5%) |
Control | 16 (55%) | 111 (72%) | 8 (67%) | 117 (98%) |
bdendo |>
gf_percents(~ cancer|gall, fill = ~ est, position = "dodge", alpha = 0.6) |>
gf_labs(
y = "Percent",
x = "",
fill = ""
)
We fit the conditional logistic model:
require(survival, quietly = TRUE)
model_clogit <- clogit(cancer == 'Case' ~ est * gall + strata(set), data = bdendo)
model_clogit |>
glm_coef(labels = c("Oestrogen/No oestrogen", "GBD/No GBD",
"Oestrogen:GBD Interaction")) |>
as_hux() |> set_align(everywhere, 2:3, "right") |>
theme_pubh() |>
add_footnote(get_r2(model_clogit), font_size = 9)
Parameter | Odds ratio | Pr(>|z|) |
---|---|---|
Oestrogen/No oestrogen | 14.88 (4.49, 49.36) | < 0.001 |
GBD/No GBD | 18.07 (3.2, 102.01) | 0.001 |
Oestrogen:GBD Interaction | 0.13 (0.02, 0.9) | 0.039 |
Nagelkerke's R2 = 0.305 |
We use data about house satisfaction.
#require(MASS, quietly = TRUE)
data(housing, package = "MASS")
housing <- housing |>
var_labels(
Sat = "Satisfaction",
Infl = "Perceived influence",
Type = "Type of rental",
Cont = "Contact"
)
We fit the ordinal logistic model:
model_ord |>
tbl_regression(exponentiate = TRUE) |>
cosm_reg() |> theme_pubh() |>
add_footnote(get_r2(model_ord), font_size = 9)
Variable | OR | 95% CI | p-value |
---|---|---|---|
Perceived influence | <0.001 | ||
Low | — | — | |
Medium | 1.76 | 1.44, 2.16 | |
High | 3.63 | 2.83, 4.66 | |
Type of rental | <0.001 | ||
Tower | — | — | |
Apartment | 0.56 | 0.45, 0.71 | |
Atrium | 0.69 | 0.51, 0.94 | |
Terrace | 0.34 | 0.25, 0.45 | |
Contact | <0.001 | ||
Low | — | — | |
High | 1.43 | 1.19, 1.73 | |
Nagelkerke's R2 = 0.108 |
Effect plots:
model_ord |>
estimate_means(by = c("Infl", "Cont", "Type")) |>
mutate(
Mean = inv_logit(Mean),
CI_low = inv_logit(CI_low),
CI_high = inv_logit(CI_high)
) |>
gf_errorbar(CI_low + CI_high ~ Infl, width = 0.2) |>
gf_point(Mean ~ Infl, size = 0.7) |>
gf_facet_grid(~ Type + Cont) |>
gf_labs(
x = "Perceived influence",
y = "Satisfaction"
)
For Poisson regression, we are interested in incidence rate ratios. We will examine the effect of sex, ethnicity, and age group on the number of absent days from school in a year.
data(quine, package = "MASS")
levels(quine$Eth) <- c("Aboriginal", "White")
levels(quine$Sex) <- c("Female", "Male")
quine <- quine |>
var_labels(
Days = "Number of absent days",
Eth = "Ethnicity",
Age = "Age group"
)
Descriptive statistics:
quine |>
cross_tbl(by = "Eth") |>
theme_pubh(2) |>
add_footnote("n (%); Median (IQR)", font_size = 9)
Ethnicity | |||
---|---|---|---|
Aboriginal | White | Overall | |
Sex | |||
Female | 38 (55%) | 42 (55%) | 80 (55%) |
Male | 31 (45%) | 35 (45%) | 66 (45%) |
Age group | |||
F0 | 13 (19%) | 14 (18%) | 27 (18%) |
F1 | 20 (29%) | 26 (34%) | 46 (32%) |
F2 | 20 (29%) | 20 (26%) | 40 (27%) |
F3 | 16 (23%) | 17 (22%) | 33 (23%) |
Lrn | |||
AL | 40 (58%) | 43 (56%) | 83 (57%) |
SL | 29 (42%) | 34 (44%) | 63 (43%) |
Number of absent days | 15 (6, 32) | 7 (4, 16) | 11 (5, 23) |
n (%); Median (IQR) |
We start by fitting a standard Poisson linear regression model:
model_pois |>
tbl_regression(exponentiate = TRUE) |>
cosm_reg() |> theme_pubh() |>
add_footnote(get_r2(model_pois), font_size = 9)
Variable | IRR | 95% CI | p-value |
---|---|---|---|
Ethnicity | <0.001 | ||
Aboriginal | — | — | |
White | 0.59 | 0.54, 0.64 | |
Sex | 0.012 | ||
Female | — | — | |
Male | 1.11 | 1.02, 1.21 | |
Age group | <0.001 | ||
F0 | — | — | |
F1 | 0.80 | 0.70, 0.91 | |
F2 | 1.42 | 1.26, 1.60 | |
F3 | 1.35 | 1.19, 1.53 | |
Nagelkerke's R2 = 0.896 |
# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------
2342.982 | 2343.586 | 2360.883 | 0.896 | 14.940 | 1.000 | -7.983 | 0.049
The assumption is that the mean is equal to the variance. If that is the case, deviance should be close to the degrees of freedom of the residuals. We can check for overdispersion:
# Overdispersion test
dispersion ratio = 13.890
Pearson's Chi-Squared = 1944.553
p-value = < 0.001
Overdispersion detected.
Thus, we have overdispersion. One option is to use a negative binomial distribution.
model_negbin |>
tbl_regression(exponentiate = TRUE) |>
cosm_reg() |> theme_pubh() |>
add_footnote(get_r2(model_negbin), font_size = 9)
Variable | IRR | 95% CI | p-value |
---|---|---|---|
Ethnicity | <0.001 | ||
Aboriginal | — | — | |
White | 0.57 | 0.41, 0.77 | |
Sex | 0.7 | ||
Female | — | — | |
Male | 1.07 | 0.77, 1.47 | |
Age group | 0.023 | ||
F0 | — | — | |
F1 | 0.69 | 0.43, 1.09 | |
F2 | 1.20 | 0.75, 1.89 | |
F3 | 1.29 | 0.80, 2.06 | |
Nagelkerke's R2 = 0.21 |
# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------
1109.653 | 1110.464 | 1130.538 | 0.210 | 15.087 | 1.000 | -3.908 | 0.067
Effect plot:
We adjust for multiple comparisons:
contrast Eth ratio SE null z.ratio p.value lower.CL upper.CL
1 F1 / F0 Aboriginal 0.69 0.16 1 -1.57 0.40 0.38 1.26
2 F2 / F0 Aboriginal 1.20 0.28 1 0.77 0.86 0.66 2.17
3 F2 / F1 Aboriginal 1.73 0.35 1 2.65 0.04 1.02 2.92
4 F3 / F0 Aboriginal 1.29 0.31 1 1.04 0.72 0.69 2.40
5 F3 / F1 Aboriginal 1.86 0.40 1 2.89 0.02 1.07 3.21
6 F3 / F2 Aboriginal 1.08 0.23 1 0.34 0.99 0.62 1.88
7 F1 / F0 White 0.69 0.16 1 -1.57 0.40 0.38 1.26
8 F2 / F0 White 1.20 0.28 1 0.77 0.86 0.66 2.17
9 F2 / F1 White 1.73 0.35 1 2.65 0.04 1.02 2.92
10 F3 / F0 White 1.29 0.31 1 1.04 0.72 0.69 2.40
11 F3 / F1 White 1.86 0.40 1 2.89 0.02 1.07 3.21
12 F3 / F2 White 1.08 0.23 1 0.34 0.99 0.62 1.88
We can see the comparison graphically with:
We will use an example on the effect of thiotepa versus placebo on the development of bladder cancer.
Warning in data(bladder): data set 'bladder' not found
bladder <- bladder |>
mutate(times = stop,
rx = factor(rx, labels=c("Placebo", "Thiotepa"))
) |>
var_labels(times = "Survival time",
rx = "Treatment")
Using robust standard errors:
model_surv |>
glm_coef(labels = c("Treatment: Thiotepa/Placebo", "Scale"), se_rob = TRUE) |>
as_hux() |> set_align(everywhere, 2:3, "right") |>
theme_pubh(1) |>
add_footnote(get_r2(model_surv), font_size = 9)
Parameter | Survival time ratio | Pr(>|z|) |
---|---|---|
Treatment: Thiotepa/Placebo | 1.64 (0.89, 3.04) | 0.116 |
Scale | 1 (0.85, 1.18) | 0.992 |
Nagelkerke's R2 = 0.02 |
In this example, the scale parameter is not statistically different from one, meaning the hazard is constant, and thus, we can use the exponential distribution:
model_exp |>
glm_coef(labels = c("Treatment: Thiotepa/Placebo"), se_rob = TRUE) |>
as_hux() |> set_align(everywhere, 2:3, "right") |>
theme_pubh() |>
add_footnote(get_r2(model_exp), font_size = 9)
Parameter | Survival time ratio | Pr(>|z|) |
---|---|---|
Treatment: Thiotepa/Placebo | 1.64 (0.85, 3.16) | 0.139 |
Nagelkerke's R2 = 0.02 |
Interpretation: Patients receiving Thiotepa live on average 64% more than those in the Placebo group.
Using naive standard errors:
model_exp |>
glm_coef(labels = c("Treatment: Thiotepa/Placebo")) |>
as_hux() |> set_align(everywhere, 2:3, "right") |>
theme_pubh() |>
add_footnote(get_r2(model_exp), font_size = 9)
Parameter | Survival time ratio | Pr(>|z|) |
---|---|---|
Treatment: Thiotepa/Placebo | 1.64 (1.11, 2.41) | 0.012 |
Nagelkerke's R2 = 0.02 |
model_cox |>
tbl_regression(exponentiate = TRUE) |>
cosm_reg() |> theme_pubh() |>
add_footnote(get_r2(model_cox), font_size = 9)
Variable | HR | 95% CI | p-value |
---|---|---|---|
Treatment | 0.022 | ||
Placebo | — | — | |
Thiotepa | 0.64 | 0.44, 0.94 | |
Nagelkerke's R2 = 0.016 |
Estimated Marginal Means
rx | Mean | SE | 95% CI
-------------------------------------
Placebo | 1.00 | 0.00 | [1.00, 1.00]
Thiotepa | 0.64 | 0.13 | [0.44, 0.94]
Marginal means estimated at rx
Interpretation: Patients receiving Thiotepa are 64% less likely of dying than those in the Placebo group.