library(sdglinkage)
set.seed(1234)
In this vignette, we show how we can use sdglinkage to generate a realistic synthetic gold standard file and how to damage the gold standard file into multiple copies of linkage files that can be used for linkage research.
Usually, when a trusted third party release a dataset to a research organisation, they will remove sensitive identifiers to prevent the data to be linked back to real individuals. In the meanwhile, for research purpose, some organisations will publish the error statistics happened in their dataset. This vignette targets for people from a research organisation that has access to non-sensitive predictor variables and statistics of the error occurred to both predictor variables and sensitive identifiers. And they would like to share a synthetic gold standard and linkage files of this dataset with realistic identifiers to a wider audience. For people from a trusted third party that has the access to sensitive identifiers please see vignette From_Sensitive_Real_Identifiers_to_Synthetic_Identifiers.
A gold standard file that gives us the true values of variables of interest, and linkage files that mimic the original error formats of real data. The following figure outlines the framework in generating these two types of files. In this example, we assume we have access to predictor variables such as sex, age and ethnicity but not sensitive identifiers such as nhsid and names. We also assume to know the errors occurred in all variables. We simulate three types of variables, which includes predictor variables learned together with the encoded error flags, external dependent identifiers and independent identifiers. These generated synthetic variables are then merged into a gold standard file and further damaged by inferred synthetic errors, which give us synthetic linkage files.