# Background and Motivating Example

Record linkage (entity resolution or de-duplication) is used to join multiple databases to remove duplicate entities. While record linkage removes the duplicate entities from the data, many researchers are interested in performing inference, prediction, or post-linkage analysis on the linked data (e.g., regression or capture-recapture), which we call the downstream task. Depending on the downstream task, one may wish to find the most representative record before performing the post-linkage analysis. For example, when the values of features used in a downstream task differ for linked data, which values should be used? This is where representr comes in. Before introducing our new package representr from the paper Kaplan, Betancourt, and Steorts, we first provide an introduction to record linkage.

Throughout this vignette, we will use data that is available in the representr package, rl_reg1 (rl = record linkage, reg = regression, 1 = amount of noisiness).

# load libraries
library(representr)
library(stringdist)

data("rl_reg1") # data for record linkage and regression
data("identity.rl_reg1") # true identity of each record
fname lname bm bd by sex education income bp
jasmine sirotic 3 31 1972 F High school graduates, no college 32 127
hugo white 6 9 1958 M Bachelor’s degree only 72 134
madeline burgemeifter 12 9 1967 F Some college or associate degree 30 130
kyle clarke 4 9 1952 M Advanced degree 90 125
livia braciak 11 6 1950 F High school graduates, no college 27 134
phoebe green 8 8 1957 F High school graduates, no college 32 128

This is simulated data, which consists of 500 records with 30% duplication and the following attributes:

• fname: First name
• lname: Last name
• bm: Birth month (numeric)
• bd: Birth day
• by: Birth year
• sex: Sex (“M” or “F”)
• education: Education level (“Less than a high school diploma”, "“High school graduates, no college”, “Some college or associate degree”, “Bachelor’s degree only”, or “Advanced degree”)
• income: Yearly income (in \$1000s)
• bp: Systolic blood pressure

Before we perform prototyping to get a representative data set using representr, we must first perform record linkage to remove duplication in the data set. In the absence of unique identifier (such as a social security number), we can use probabilistic methods to perform record linkage. We recommend the use of clustering records to a latent entity, known in the literature as graphical entity resolution. See (Binette and Steorts) for a review.

For the examples in this vignette, we will use blink (Steorts),a Bayesian record linkage model (Steorts 2015). For this example, we will run only a small number of iterations, which will certainly not have converged. For a faster (more scalable version of this), we recommend recent work of (Marchant et al. 2020) using either dblink or dblinkR.

For a real example, we would want to let this model run longer.

# load blink

# params for running record linkage
a <- 1; b <- 99 # distortion hyperparams
c <- 1 # string density hyperparams
d <- function(string1, string2){ #jaro-winkler string distance
n1 <- length(string1)
n2 <- length(string2)
res <- matrix(NA, n1, n2)
for(i in seq_len(n1)) {
for(j in seq_len(n2)) {
res[i, j] <- stringdist(string1[i], string2[j], method = "jw")
}
}
res
}  # vector string distance function
num.gs <- 10 # number of iterations
M <- nrow(rl_reg1) # upper bound on number of entities
str_idx <- c(1, 2) # string columns
cat_idx <- c(3, 4, 5) # categorical columns

# data prep
# X.c contains the categorical variables
# X.s contains the string variables
X.c <- apply(as.matrix(rl_reg1[, cat_idx]), 2, as.character)
X.s <- as.matrix(rl_reg1[, str_idx])

# X.c and X.s include all files "stacked" on top of each other.
# The vector below keeps track of which rows of X.c and X.s are in which files.
file.num <- rep(1, nrow(rl_reg1))

# perform record linkage
linkage.rl <- rl.gibbs(file.num, X.s, X.c, num.gs=num.gs, a=a, b=b, c=c, d=d, M=M)

In fact, we can load the results of running this model for $$100,000$$ iterations, which have been stored in the package as a data object called linkage.rl.

data("linkage.rl")

## The Downstream Task

After record linkage is complete, one may want to perform analyses of the linked data. This is what we call the “downstream task”. As motivation, consider modeling blood pressure (bp) using the following two features (covariates): income and sex in our example data rl_reg1. We want to fit this model after performing record linkage using the following features: first and last name and full data of birth. Here is an example of four records that represent the same individual (based on the results from record linkage) using data that is in the representr package.

fname lname bm bd by sex education income bp
370 elenys reit 11 11 1954 F High school graduates, no college 28 109
371 elen7 reicl 11 22 1954 F Advanced degree 52 118
372 dleny rejd 11 11 1954 M Bachelor’s degree only 63 109
373 eleni reid 11 11 1954 F Bachelor’s degree only 52 109

Examination of this table raises important questions that need to be addressed before performing a particular downstream task, such as which values of bp, income, and sex should be used as the representative features (or covariates) in a regression model? In this vignette, we will provide multiple solutions to this question using a prototyping approach.

# Prototyping methods

We have four methods to choose or create the representative record from linked data included in representr. This process is a function of the data and the linkage structure, and we present both probabilistic and deterministic functions. The result in all cases is a representative data set to be passed on to the downstream task. The prototyping is completed using the represent() function.

## Random Prototyping

Our first proposal to choose a representative record (prototype) for a cluster is the simplest and serves as a baseline or benchmark. One simply chooses the representative record uniformly at random or using a more informed distribution.

For demonstration purposes, we can create a representative dataset using the last iteration of the results from running the record linkage model using blink. This is accomplished using the represent() function, and passing through the type of prototyping to be proto_random.

# ids for representative records (random)
random_id <- represent(rl_reg1, lambda, "proto_random", parallel = FALSE)
rep_random <- rl_reg1[random_id,] # representative records (random)

We can have a look at a few records chosen as representative in this way.

fname lname bm bd by sex education income bp
240 callum lecke 8 5 1981 M Advanced degree 86 126
353 lachlan ebert 5 15 1954 M Some college or associate degree 42 150
162 francesco petito 6 16 1972 M High school graduates, no college 41 151
6 phoebe green 8 8 1957 F High school graduates, no college 32 128
315 peter chittleborough 7 11 1980 M High school graduates, no college 36 153

## Minimax Prototyping

Our second proposal to choose a representative record is to select the record that “most closely captures” that of the latent entity. Of course, this is quite subjective. We propose selecting the record whose farthest neighbors within the cluster is closest, where closeness is measured by a record distance function, $$d_r(\cdot)$$. We can write this as the record $$r = (i, j)$$ within each cluster $$\Lambda_{j'}$$ such that $r = \arg\min\limits_{(i, j) \in \Lambda_{j'}} \max\limits_{(i^*, j^*) \in \Lambda_{j'}} d_r((i, j), (i^*, j^*)).$ The result is a set of representative records, one for each latent individual, that is closest to the other records in each cluster. When there is a tie within the cluster, we select a record uniformly at random.

There are many distance functions that can be used for $$d_r(\cdot, \cdot)$$. We define the distance function to be a weighted average of individual variable-level distances that depend on the column type. Given two records, $$(i, j)$$ and $$(i*, j*)$$, we use a weighted average of column-wise distances (based on the column type) to produce the following single distance metric: $d_r((i, j), (i*, j*)) = \sum\limits_{\ell = 1}^p w_\ell d_{r\ell}((i, j), (i^*, j^*)),$ where $$\sum\limits_{\ell = 1}^p w_\ell = 1$$. The column-wise distance functions $$d_{r\ell}(\cdot, \cdot)$$ we use are presented below.

Column $$d_{r\ell}(\cdot, \cdot)$$
String Any string distance function, i.e. Jaro-Winkler string distance
Numeric Absolute distance, $$d_{r\ell}((i, j), (i^*, j^*)) = \mid x_{ij\ell} - x_{i^*j^*\ell} \mid$$
Categorical Binary distance, $$d_{r\ell}((i, j), (i^*, j^*)) = \mathbb{I}(x_{ij\ell} != x_{i^*j^*\ell})$$
Ordinal Absolute distance between levels. Let $$\gamma(x_{ij\ell})$$ be the order of the value $$x_{ij\ell}$$, then $$d_{r\ell}((i, j), (i^*, j^*)) = \mid \gamma(x_{ij\ell}) - \gamma(x_{i^*j^*\ell}) \mid$$

The weighting of variable distances is used to place importance on individual features according to prior knowledge of the data set and to scale the feature distances to a common range. In this vignette, we scale all column-wise distances to be values between $$0$$ and $$1$$.

Again, we can create a representative dataset using the last iteration of the results from running the record linkage model using blink. But this time we need to specify some more parameters, like what types the columns are. This is accomplished using the represent() function, and passing through the type of prototyping to be proto_minimax.

# additional parameters for minimax prototyping
# need column types, the order levels for any ordinal variables, and column weights
col_type <- c("string", "string", "numeric", "numeric", "numeric", "categorical", "ordinal", "numeric", "numeric")
orders <- list(education = c("Less than a high school diploma", "High school graduates, no college", "Some college or associate degree", "Bachelor's degree only", "Advanced degree"))
weights <- c(.25, .25, .05, .05, .1, .15, .05, .05, .05)

# ids for representative records (minimax)
distance = dist_col_type, col_type = col_type,
weights = weights, orders = orders, scale = TRUE, parallel = FALSE)
rep_minimax <- rl_reg1[minimax_id,] # representative records (minimax)

We can have a look at some of the representative records chosen via minimax prototyping.

fname lname bm bd by sex education income bp
240 callum lecke 8 5 1981 M Advanced degree 86 126
353 lachlan ebert 5 15 1954 M Some college or associate degree 42 150
162 francesco petito 6 16 1972 M High school graduates, no college 41 151
6 phoebe green 8 8 1957 F High school graduates, no college 32 128
315 peter chittleborough 7 11 1980 M High school graduates, no college 36 153

## Composite Records

Our third proposal to choose a representative record is by aggregating the records (in each cluster) to form a composite record that includes information from each linked record. The form of aggregation can depend on the column type, and the aggregation itself can be weighted by some prior knowledge of the data sources or use the posterior information from the record linkage model. For quantitative variables, we use a weighted arithmetic mean to combine linked values, whereas for categorical variables, a weighted majority vote is used. For string variables, we use a weighted majority vote for each character, which allows for noisy strings to differ on a continuum. This is accomplished using the represent() function, and passing through the type of prototyping to be composite.

# representative records (composite)
rep_composite <- represent(rl_reg1, linkage.rl[nrow(linkage.rl),], "composite", col_type = col_type, parallel = FALSE)

We can have a look at some of the representative records.

fname lname bm bd by sex education income bp
240 callum lecke 8 5.0 1981 M Advanced degree 86.0 126
353 lachlan ebert 5 15.0 1954 M Some college or associate degree 42.0 150
158 francesco petito 6 16.4 1972 M High school graduates, no college 52.6 150
6 phoebe green 8 8.0 1957 F High school graduates, no college 32.0 128
315 peter chittleborough 7 11.0 1980 M High school graduates, no college 36.0 153

## Posterior Prototyping

Our fourth proposal to choose a representative record utilizes the minimax prototyping method in a fully Bayesian setting. This is desirable as the posterior distribution of the linkage is used to weight the downstream tasks, which allows the error from the record linkage task to be naturally propagated into the downstream task.

We propose two methods for utilizing the posterior prototyping (PP) weights — a weighted downstream task and a thresholded representative data set based on the weights. As already mentioned, PP weights naturally propagate the linkage error into the downstream task, which we now explain. For each MCMC iteration from the Bayesian record linkage model, we obtain the most representative records using minimax prototyping and then compute the probability of each record being selected over all MCMC iterations. The posterior prototyping (PP) probabilities can then either be used as weights for each record in the regression or as a thresholded variant where we only include records whose PP weights are above $$0.5$$. Note that a record with PP weight above 0.5 has a posterior probability greater than 0.5 of being chosen as a prototype and should be included in the final data set.

# Posterior prototyping weights
pp_weights <- pp_weights(rl_reg1, linkage.rl[seq(80000, 100000, by = 100), ],
"proto_minimax", distance = dist_col_type,
col_type = col_type, weights = weights, orders = orders,
scale = TRUE, parallel = FALSE)

We can look at the minimax PP weights distribution for the true and duplicated records in the data set as an example. Note that the true records consistently have higher PP weights and the proportion of duplicated records with high weights is relatively low.

We can make a representative dataset with these weights by using the cutoff of $$0.5$$, and look at some of the records.

# representative records (PP threshold)
rep_pp_thresh <- rl_reg1[pp_weights > .5, ]
fname lname bm bd by sex education income bp
jasmine sirotic 3 31 1972 F High school graduates, no college 32 127
hugo white 6 9 1958 M Bachelor’s degree only 72 134
madeline burgemeifter 12 9 1967 F Some college or associate degree 30 130
kyle clarke 4 9 1952 M Advanced degree 90 125
livia braciak 11 6 1950 F High school graduates, no college 27 134
phoebe green 8 8 1957 F High school graduates, no college 32 128

These four proposed methods each have potential benefits. The goal of prototyping is to select the correct representations of latent entities as often as possible; however, uniform random selection has no means to achieve this goal. Turning to minimax selection, if a distance function can accurately reflect the distance between pairs of records in the data set, then this method may perform well. Alternatively, composite records necessarily alter the data for all entities with multiple copies in the data, affecting some downstream tasks (like linear regression) heavily. The ability of posterior prototyping to propagate record linkage error to the downstream task is an attractive feature and a great strength of the Bayesian paradigm. In addition, the ability to use the entire posterior distribution of the linkage structure also poses the potential for superior downstream performance.

# Evaluation

We can evaluate the performance of our methods by assessing the distributional closeness of the representative dataset to the true records. The distributional closeness of the representative datasets to the true records is useful because one of the benefits of using a two-stage approach to record linkage and downstream analyses is the ability to perform multiple analyses with the same data set. As such, downstream performance of representative records may be dependent on the type of downstream task that is being performed. In order to assess the distributional closeness of the representative data sets to the truth, we use an empirical Kullback-Leibler (KL) divergence metric. Let $$\hat{F}_{rep}(\boldsymbol x)$$ and $$\hat{F}_{true}(\boldsymbol x)$$ be the empirical distribution functions for the representative data set and true data set, respectively (with continuous variables transformed to categorical using a histogram approach with statistically equivalent data-dependent bins). The empirical KL divergence metric we use is then defined as $\hat{D}_{KL}(\hat{F}_{rep} || \hat{F}_{true}) = \sum_{\boldsymbol x} \hat{F}_{rep}(\boldsymbol x) \log\left(\frac{\hat{F}_{rep}(\boldsymbol x)}{\hat{F}_{true}(\boldsymbol x)}\right).$

This metric is accessed in representr using the emp_kl_div() command.

true_dat <- rl_reg1[unique(identity.rl_reg1),] # true records
emp_kl_div(true_dat, rep_random, c("sex"), c("income", "bp"))
#> [1] 0.0204802
emp_kl_div(true_dat, rep_minimax, c("sex"), c("income", "bp"))
#> [1] 0.005069378
emp_kl_div(true_dat, rep_composite, c("sex"), c("income", "bp"))
#> [1] 0.0563834
emp_kl_div(true_dat, rep_pp_thresh, c("sex"), c("income", "bp"))
#> [1] 0.004226951

The representative dataset based on the posterior prototyping weights is the closest to the truth using the three variables we might be interested in using for regression. This might indicate that we should use this representation in a downstream model, like linear regression.

# References

Binette, Olivier, and Rebecca C Steorts. “(Almost) All of Entity Resolution.” arXiv Preprint arXiv:2008.04443. https://arxiv.org/abs/2008.04443.

Kaplan, Andee, Brenda Betancourt, and Rebecca C. Steorts. “Posterior Prototyping: Bridging the Gap Between Record Linkage and Regression.” arXiv Preprint arXiv:1810.01538. https://arxiv.org/abs/1810.01538.

Marchant, Neil G, Andee Kaplan, Daniel N Elazar, Benjamin IP Rubinstein, and Rebecca C Steorts. 2020. “D-Blink: Distributed End-to-End Bayesian Entity Resolution.” Journal of Computational and Graphical Statistics just-accepted. Taylor & Francis: 1–42.

Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis 10 (4). International Society for Bayesian Analysis: 849–75.