library(sdglinkage)
set.seed(1234)

In this vignette, we show how we can use sdglinkage to generate a realistic synthetic gold standard file and how to damage the gold standard file into multiple copies of linkage files that can be used for linkage research.

Usually, when a trusted third party release a dataset to a research organisation, they will remove sensitive identifiers to prevent the data to be linked back to real individuals. In the meanwhile, for research purpose, some organisations will publish the error statistics happened in their dataset. This vignette targets for people from a research organisation that has access to non-sensitive predictor variables and statistics of the error occurred to both predictor variables and sensitive identifiers. And they would like to share a synthetic gold standard and linkage files of this dataset with realistic identifiers to a wider audience. For people from a trusted third party that has the access to sensitive identifiers please see vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers.

  • Assumption:
    • Real gold standard file (real_gsf): We have a gold standard file with non-sensitive predictor variables that we would like to synthesised.
    • Error statistics: We have the statistics of the errors occurred to both predictor variables and sensitive identifiers.
  • Aim:
    • To generate synthetic predictor variables.
    • To add external identifier variables to the synthetic predictor variables, which is considered as our synthetic gold standard file (syn_gsf).
    • To damage the synthetic gold standard file with the error statistics, which gives us the synthetic linkage file (syn_lf).
    • To show how these linkage files can be used for linkage method evaluation.

A gold standard file that gives us the true values of variables of interest, and linkage files that mimic the original error formats of real data. The following figure outlines the framework in generating these two types of files. In this example, we assume we have access to predictor variables such as sex, age and ethnicity but not sensitive identifiers such as nhsid and names. We also assume to know the errors occurred in all variables. We simulate three types of variables, which includes predictor variables learned together with the encoded error flags, external dependent identifiers and independent identifiers. These generated synthetic variables are then merged into a gold standard file and further damaged by inferred synthetic errors, which give us synthetic linkage files.