tlsh

Rebecca C. Steorts

2020-11-02

We present a small example from Steorts, R., Ventura, S., Sadinle, M., and Fienberg, S. (2014). “Blocking Comparisons for Record Linkage.” Privacy in Statistical Databases (Lecture Notes in Computer Science 8744), ed. J Domingo-Ferrer, Springer, 252-268, . We will be using the blink package in R and the RLdata500 data set, which was previously available in the Record Linkage package (but has been deprecated). Here, we illustrate transitive LSH.

In a record linkage task one wants to remove duplicate entries from multiple databases. However, before performing this task, one needs to perform a means of dimension reduction so that the record linkage task is computationally scalable.

Using the TLSH algorithm, we illustrate an example of using this package using a German dataset comprised of first and last name and full date of birth.

Our goals include

Understanding the RLdata500 dataset

The RLdata500 dataset exists already in the blink package in R. We review this data set for the user.

The RLdata500 data consists of 500 records with 10 percent duplication. Thus, there are 450 unique individuals. There is full information on each record containing first name, last name, and full date of birth.

We first load the blink package and load the RLdata500 data set. We also, provide the first few lines of the data. We also remove missing values (they are all missing in this data set).

library(blink)
library(plyr)
library(tlsh)
data(RLdata500)
head(RLdata500)
##   fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
## 1  CARSTEN     <NA>    MEIER     <NA> 1949  7 22
## 2     GERD     <NA>    BAUER     <NA> 1968  7 27
## 3   ROBERT     <NA> HARTMANN     <NA> 1930  4 30
## 4   STEFAN     <NA>    WOLFF     <NA> 1957  9  2
## 5     RALF     <NA>  KRUEGER     <NA> 1966  1 13
## 6  JUERGEN     <NA>   FRANKE     <NA> 1929  7  4
data.500 <- RLdata500[-c(2,4)]
head(data.500)
##   fname_c1 lname_c1   by bm bd
## 1  CARSTEN    MEIER 1949  7 22
## 2     GERD    BAUER 1968  7 27
## 3   ROBERT HARTMANN 1930  4 30
## 4   STEFAN    WOLFF 1957  9  2
## 5     RALF  KRUEGER 1966  1 13
## 6  JUERGEN   FRANKE 1929  7  4

TLSH applied to RLdata500

We now explain how to run TLSH on the RLdata500 data set, piece by piece.

  1. We first must creat a universal set of tokens.
  2. We then number find the number of tokens in the universal set.
  3. Then we must generate a vector of random hash functions.
  4. Next, we must creating an index vector and apply the hash functions to each record
  5. Then we build an edgelist, divide the graph into communities initially, sub-divide the communities more if needed
  6. Finally, we have our blocks.
  7. Then we can compute the dimension reduction and the recall.

The function that find the blocks is called **block_setup_v2.

 blocks <- block_setup_v2(RLdata500, b=22, k=2)
## [1] "Creating the universal set of tokens"
## elapsed 
##   0.005 
## [1] "Number of tokens in universal set"
## [1] 404
## [1] "Generating a vector of random hash functions"
## elapsed 
##   0.003 
## [1] "Creating index vector and applying hash functions to first record"
##    user  system elapsed 
##   0.006   0.000   0.006 
##    user  system elapsed 
##   3.205   0.021   3.234 
## [1] "Creating edgelist"
##    user  system elapsed 
##   0.207   0.007   0.214 
## [1] 23146     2
## [1] "Building graph from edgelist"
##    user  system elapsed 
##   0.001   0.000   0.002 
## [1] "Dividing graph into communities initially"
##    user  system elapsed 
##   0.017   0.000   0.017 
## [1] "Subdividng communities"
##    user  system elapsed 
##       0       0       0
 summary(blocks)
##      Length Class  Mode   
## [1,]  48    -none- numeric
## [2,]  62    -none- numeric
## [3,] 141    -none- numeric
## [4,] 249    -none- numeric

where b is the number of buckets and k is the shingle size.

Observe that the blocks are roughly about the same size, however, this does not have to be the case.

The function that allows us to find the recall is eval.blocksetup.

eval.blocksetup(RLdata500, b=26, key=identity.RLdata500)
## [1] "Creating the universal set of tokens"
## elapsed 
##   0.004 
## [1] "Number of tokens in universal set"
## [1] 2516
## [1] "Generating a vector of random hash functions"
## elapsed 
##   0.003 
## [1] "Creating index vector and applying hash functions to first record"
##    user  system elapsed 
##   0.006   0.000   0.006 
##    user  system elapsed 
##   3.298   0.020   3.328 
## [1] "Creating edgelist"
##    user  system elapsed 
##   0.260   0.005   0.269 
## [1] 13434     2
## [1] "Building graph from edgelist"
##    user  system elapsed 
##   0.001   0.001   0.002 
## [1] "Dividing graph into communities initially"
##    user  system elapsed 
##   0.003   0.000   0.003 
## [1] "Subdividng communities"
##    user  system elapsed 
##       0       0       0
##   recall
## 1   0.86

The function that allows us to find the reduction ratio is reduction.ratio.from.blocking.

(rr <- reduction.ratio.from.blocking(blocks)) 
## [1] 0.6491784

To summarize, we have reduced the entire space by roughly 66 percent and the recall is 0.90, which means we are only splitting records across blocks 10 percent of the time.