To do a study of incidence and prevalence, there are four core analytics functions from this package that you would interact with
generateDenominatorCohortSet()
- this function will
identify a set of denominator populations that can be used for
calculations of prevalence and incidenceestimatePointPrevalence()
- this function will estimate
point prevalence for outcomes among denominator populationsestimatePeriodPrevalence()
- this function will
estimate period prevalence for outcomes among denominator
populationsestimateIncidence()
- this function will estimate
incidence rates for outcomes among denominator populationsBelow, we show an example analysis to provide an broad overview of how this functionality provided by the IncidencePrevalence package can be used. More context and further examples for each of these functions are provided in later vignettes.
First, let’s load relevant libraries.
library(CDMConnector)
library(IncidencePrevalence)
library(dplyr)
library(tidyr)
library(ggplot2)
The IncidencePrevalence package works with data mapped to the OMOP CDM and we will first need to connect to a database, after which we can use the CDMConnector package to represent our mapped data as a single R object. This could like something like:
<- DBI::dbConnect(RPostgres::Postgres(),
con dbname = Sys.getenv("CDM5_POSTGRESQL_DBNAME"),
host = Sys.getenv("CDM5_POSTGRESQL_HOST"),
user = Sys.getenv("CDM5_POSTGRESQL_USER"),
password = Sys.getenv("CDM5_POSTGRESQL_PASSWORD")
)<- CDMConnector::cdm_from_con(con,
cdm cdm_schema = Sys.getenv("CDM5_POSTGRESQL_CDM_SCHEMA")
)
For this example though we´ll generate 50,000 hypothetical patients
using the mockIncidencePrevalenceRef()
function.
<- mockIncidencePrevalenceRef(
cdm sampleSize = 50000,
outPre = 0.5
)
Importantly this example data already includes an outcome cohort. In practice, study-specific outcome cohorts of interest will need to be created. If the outcome cohorts are defined as JSON, we can use the CDMConnector package to read in and generate the cohorts.
<- CDMConnector::readCohortSet(here::here("outcome_cohorts"))
outcome_cohorts <- CDMConnector::generateCohortSet(
cdm cdm = cdm,
cohortSet = outcome_cohorts,
name = outcome_table
)
Once we have a connection to the database set-up, we can use the
generateDenominatorCohortSet()
to identify a denominator
cohort to use later when calculating incidence and prevalence. In this
case we identify three denominator cohorts one with males, one with
females, and one with both males and females included. For each of these
cohorts only those aged between 18 and 65 from 2008 to 2012, and who had
365 days of prior history available are included.
<- generateDenominatorCohortSet(
cdm cdm = cdm,
name = "denominator",
cohortDateRange = c(as.Date("2008-01-01"), as.Date("2012-01-01")),
ageGroup = list(c(18, 65)),
sex = c("Male", "Female", "Both"),
daysPriorHistory = 365
)
We can see that each of our denominator cohorts is in the format of an OMOP CDM cohort:
$denominator %>%
cdmglimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB 0.8.1 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <chr> "2", "3", "13", "19", "29", "46", "88", "149", "1…
#> $ cohort_start_date <date> 2008-01-01, 2010-12-20, 2008-05-28, 2009-03-08, …
#> $ cohort_end_date <date> 2008-08-03, 2011-10-16, 2009-12-29, 2009-03-10, …
We can also see the settings associated with each cohort:
cohortSet(cdm$denominator)
#> # A tibble: 3 × 10
#> cohort_definition_id cohort_name age_group sex days_prior_history start_date
#> <int> <chr> <chr> <chr> <dbl> <date>
#> 1 1 Denominato… 18 to 65 Male 365 2008-01-01
#> 2 2 Denominato… 18 to 65 Fema… 365 2008-01-01
#> 3 3 Denominato… 18 to 65 Both 365 2008-01-01
#> # ℹ 4 more variables: end_date <date>, strata_cohort_definition_id <lgl>,
#> # strata_cohort_name <lgl>, closed_cohort <lgl>
And we can also see the count for each cohort
cohortCount(cdm$denominator)
#> # A tibble: 3 × 3
#> cohort_definition_id number_records number_subjects
#> <int> <dbl> <dbl>
#> 1 1 3021 3021
#> 2 2 3177 3177
#> 3 3 6198 6198
Now that we have our denominator cohorts, and using the outcome
cohort that was also generated by the
mockIncidencePrevalenceRef()
function, we can estimate
prevalence for each using the estimatePointPrevalence()
function. Here we calculate point prevalence on a yearly basis.
<- estimatePeriodPrevalence(
prev cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = "quarters",
minCellCount = 0
)
%>%
prev glimpse()
#> Rows: 48
#> Columns: 30
#> $ analysis_id <chr> "1", "1", "1", "1", "1", "1", …
#> $ prevalence_start_date <date> 2008-01-01, 2008-04-01, 2008-…
#> $ prevalence_end_date <date> 2008-03-31, 2008-06-30, 2008-…
#> $ n_cases <int> 29, 43, 26, 37, 38, 34, 37, 37…
#> $ n_population <int> 640, 633, 661, 668, 698, 710, …
#> $ prevalence <dbl> 0.04531250, 0.06793049, 0.0393…
#> $ prevalence_95CI_lower <dbl> 0.03173229, 0.05082084, 0.0269…
#> $ prevalence_95CI_upper <dbl> 0.06431847, 0.09025267, 0.0570…
#> $ cohort_obscured <chr> "FALSE", "FALSE", "FALSE", "FA…
#> $ result_obscured <chr> "FALSE", "FALSE", "FALSE", "FA…
#> $ outcome_cohort_id <chr> "1", "1", "1", "1", "1", "1", …
#> $ outcome_cohort_name <chr> "cohort_1", "cohort_1", "cohor…
#> $ analysis_outcome_lookback_days <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ analysis_type <chr> "period", "period", "period", …
#> $ analysis_interval <chr> "quarters", "quarters", "quart…
#> $ analysis_complete_database_intervals <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, …
#> $ analysis_time_point <chr> "start", "start", "start", "st…
#> $ analysis_full_contribution <lgl> FALSE, FALSE, FALSE, FALSE, FA…
#> $ analysis_min_cell_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ denominator_cohort_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ denominator_cohort_name <chr> "Denominator cohort 1", "Denom…
#> $ denominator_age_group <chr> "18 to 65", "18 to 65", "18 to…
#> $ denominator_sex <chr> "Male", "Male", "Male", "Male"…
#> $ denominator_days_prior_history <dbl> 365, 365, 365, 365, 365, 365, …
#> $ denominator_start_date <date> 2008-01-01, 2008-01-01, 2008-…
#> $ denominator_end_date <date> 2012-01-01, 2012-01-01, 2012-…
#> $ denominator_strata_cohort_definition_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA…
#> $ denominator_strata_cohort_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA…
#> $ denominator_closed_cohort <lgl> FALSE, FALSE, FALSE, FALSE, FA…
#> $ cdm_name <chr> "test_database", "test_databas…
Similarly we can use the estimateIncidence()
function to
estimate incidence rates. Here we annual incidence rates, with 180 days
used for outcome washout windows.
<- estimateIncidence(
inc cdm = cdm,
denominatorTable = "denominator",
outcomeTable = "outcome",
interval = c("Years"),
outcomeWashout = 180
)
%>%
inc glimpse()
#> Rows: 12
#> Columns: 30
#> $ analysis_id <chr> "1", "1", "1", "1", "2", "2", …
#> $ n_persons <int> 1044, 1052, 1031, 1089, 1101, …
#> $ person_days <dbl> 159970, 161590, 157191, 172517…
#> $ n_events <int> 125, 133, 144, 162, 126, 149, …
#> $ incidence_start_date <date> 2008-01-01, 2009-01-01, 2010-…
#> $ incidence_end_date <date> 2008-12-31, 2009-12-31, 2010-…
#> $ person_years <dbl> 437.9740, 442.4093, 430.3655, …
#> $ incidence_100000_pys <dbl> 28540.51, 30062.66, 33459.93, …
#> $ incidence_100000_pys_95CI_lower <dbl> 23756.87, 25170.84, 28218.21, …
#> $ incidence_100000_pys_95CI_upper <dbl> 34004.73, 35627.65, 39392.91, …
#> $ cohort_obscured <chr> "FALSE", "FALSE", "FALSE", "FA…
#> $ result_obscured <chr> "FALSE", "FALSE", "FALSE", "FA…
#> $ outcome_cohort_id <chr> "1", "1", "1", "1", "1", "1", …
#> $ outcome_cohort_name <chr> "cohort_1", "cohort_1", "cohor…
#> $ analysis_outcome_washout <dbl> 180, 180, 180, 180, 180, 180, …
#> $ analysis_repeated_events <lgl> FALSE, FALSE, FALSE, FALSE, FA…
#> $ analysis_interval <chr> "years", "years", "years", "ye…
#> $ analysis_complete_database_intervals <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, …
#> $ denominator_cohort_id <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, …
#> $ analysis_min_cell_count <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ denominator_cohort_name <chr> "Denominator cohort 1", "Denom…
#> $ denominator_age_group <chr> "18 to 65", "18 to 65", "18 to…
#> $ denominator_sex <chr> "Male", "Male", "Male", "Male"…
#> $ denominator_days_prior_history <dbl> 365, 365, 365, 365, 365, 365, …
#> $ denominator_start_date <date> 2008-01-01, 2008-01-01, 2008-…
#> $ denominator_end_date <date> 2012-01-01, 2012-01-01, 2012-…
#> $ denominator_strata_cohort_definition_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA…
#> $ denominator_strata_cohort_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA…
#> $ denominator_closed_cohort <lgl> FALSE, FALSE, FALSE, FALSE, FA…
#> $ cdm_name <chr> "test_database", "test_databas…
After gathering results, we can export them as CSVs in a zip folder
using the exportIncidencePrevalenceResults()
function.
exportIncidencePrevalenceResults(
resultList = list(
"incidence" = inc,
"prevalence" = prev
),zipName = "example_results",
outputFolder = here::here()
)