# Microclustr

#### 2020-09-28

We present a simulated data set from Betancourt, Zanella, and Steorts (2020), “Random Partitions Models for Microclustering Tasks” , Minor Revision. The microclustr package performs entity resolution with categorical variables using partition-based Bayesian clustering models.

Our goals include:

• Creating a synthetic data set (with a fixed partition)
• Illustrating how the user can perform entity resolution using the microclustr package
• Illustrating how the user can calculate standard evaluation metrics when a unique identifier is known.

We first load all packages needed for this example.

# load all packages need
# set seed for reproducibility
library('microclustr')
set.seed(123)

## Creating a Synthetic Data Set with a Fixed Partition

Now we create a synthetic data set, where we assume a fixed partition.

Assume there are a maximum of four clusters. Assume that there are 50 records within each cluster. Assume that each record has 5 fields of categorical variable. Assume that there are 10 potential categories per field. Assume the distortion probability for each field is 0.01.

Our synthetic data set produces duplicate records using the SimData() function, where there are 500 records in all with 200 unique records.

# true partition to generate simulated data
# 50 clusters of each size, max cluster size is 4
truePartition <- c(50,50,50,50)
# number of fields
numberFields <- 5
# number of categories per field
numberCategories <- rep(10,5)
# distortion probability for the fields
trueBeta <- 0.01
# generate simulated data
simulatedData <- SimData(true_L = truePartition, nfields = numberFields, ncat = numberCategories, true_beta = trueBeta)
# dimension of data set
dim(simulatedData)
#> [1] 500   5

## Partition Priors for Entity Resolution

This package contains the implementation of four random partition models used for entity resolution tasks:

• Two traditional random partition models:
1. Dirichlet process (DP) mixtures
2. Pitman–Yor process (PY) mixtures.
• Two random partition models that exhibit the microclustering property, which are

Exchangeable Sequences of Clusters (ESC) models, which are referred to as (and are further defined in our paper):

1. The ESCNB model
2. The ESCD model

## Posterior Samples

In order to obtain posterior samples of the cluster assignments and the model parameters the user needs to specify the following:

• the data,
• the random partition Prior (“DP”, “PY”, “ESCNB” or “ESCD”),
• the burn-in period,
• and the number of iterations for the Gibbs sampler to run.

## Investigation for Synthetic Data Set

Let’s investigate this for the synthetic data set where we draw a posterior sample from the ESCD model using our simulated data with a burn-in period of 5 points and 10 Gibbs sampler values.

# example of drawing from the posterior with the ESCD prior
posteriorESCD <- SampleCluster(data=simulatedData, Prior="ESCD", burn=5, nsamples=10)
#> [1] "iter= 10"

The output is a list with two elements:

• Z: A matrix of size nsamples x N containing the samples of the cluster assignments.
• Params: A matrix of size nsamples x # of model hyper-parameters containing the samples of the model hyper-parameters. The columns named beta_1 to beta_L correspond to the distortion probabilities of the fields in the data.

Observe that each row corresponds to an iteration of the Gibbs sampler. Observe that each column corresponds to a record. Observe that we have 500 records and 10 Gibbs iterations, as expected. We can inspect the first five row and first 10 columns of the posterior output.

dim(posteriorESCD$Z) #> [1] 10 500 posteriorESCD$Z[1:5,1:10]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,]  123   90  154  105  121  110  264  122    8   260
#> [2,]  119   87  149  102  262  107  254  118    8   250
#> [3,]  116   84  146   99  253  104  246  115    8   257
#> [4,]  112  186  142   96  238  101  233  111    8   242
#> [5,]  108  181  137   92  230   97  225  107    7   234

In addition, we can inspect the samples of the model hyperparameters. In the case of the ESCD model, there are three hyperparamters $$\alpha$$, $$r$$, and $$p$$.

head(posteriorESCD$Params) #> alpha r p beta_1 beta_2 beta_3 beta_4 #> [1,] 1 0.21983936 0.5273158 0.09402145 0.09792707 0.08056948 0.09822450 #> [2,] 1 0.69413123 0.5162523 0.07294254 0.06327406 0.06065586 0.09775310 #> [3,] 1 0.75932593 0.3238241 0.06741252 0.07269144 0.06716123 0.09311753 #> [4,] 1 0.96935686 0.4881323 0.04866242 0.04653387 0.04441998 0.05344789 #> [5,] 1 0.41957922 0.4396494 0.03370082 0.04297929 0.04055417 0.06170793 #> [6,] 1 0.09803845 0.2113132 0.03990637 0.03613336 0.04101764 0.05028134 #> beta_5 #> [1,] 0.09990110 #> [2,] 0.09543110 #> [3,] 0.09561922 #> [4,] 0.08492962 #> [5,] 0.06825275 #> [6,] 0.04042812 Samples for the DP, PY, and ESCNB models can be similarly obtained as follows: posteriorDP <- SampleCluster(simulatedData, "DP", 5, 10) posteriorPY <- SampleCluster(simulatedData, "PY", 5, 10) posteriorESCNB <- SampleCluster(simulatedData, "ESCNB", 5, 10) ## Evaluation Metrics If the ground truth for the partition of the data is available, the average False Negative Rate (FNR) and False Discovery Rate (FDR) over the posterior samples can be computed (for any model) using the mean_fnr and mean_fdr functions: maxPartitionSize<- length(truePartition) uniqueNumberRecords <- sum(truePartition) #true_M <- length(truePartition) #true_K <- sum(truePartition) # true cluster assignments id <- rep(1:uniqueNumberRecords, times=rep(1:maxPartitionSize, times=truePartition)) # average fnr mean_fnr(posteriorESCD$Z,id)
#> [1] 0.3066
# average fdr
mean_fdr(posteriorESCD\$Z,id)
#> [1] 0.1661055

Of course, in practice, one would want to run the sampler much longer in practice to calculate the error rates above.