Getting started

Colin Fay

2017-09-29

tidystringdist

Compute string distance the tidy way. Built on top of the ‘stringdist’ package.

Install tidystringdist

devtools::install_github("ColinFay/tidystringdist")

tidystringdist basic workflow

tidycomb

First, you need to create a tibble with the combinations of words you want to compare. You can do this with the tidy_comb and tidy_comb_all functions. The first takes a base word and combines it with each elements of a list or a column of a data.frame, the 2nd combines all the possible couples from a list or a column.

If you already have a data.frame with two columns containing the strings to compare, you can skip this part.

library(tidystringdist)

tidy_comb_all(LETTERS[1:3])
#>   V1 V2
#> 1  A  B
#> 2  A  C
#> 3  B  C
tidy_comb_all(iris, Species)
#>           V1         V2
#> 1     setosa versicolor
#> 2     setosa  virginica
#> 3 versicolor  virginica
tidy_comb("Paris", state.name[1:3])
#>        V1    V2
#> 1 Alabama Paris
#> 2  Alaska Paris
#> 3 Arizona Paris

tidy_string_dist

Once you’ve got this data.frame, you can use tidy_string_dist to compute string distance. This function takes a data.frame, the two columns containing the strings, and a stringdist method.

Note that if you’ve used the tidy_comb function to create you data.frame, you won’t need to set the column names.

library(dplyr)
data(starwars)
tidy_comb_sw <- tidy_comb_all(starwars, name)
tidy_stringdist(tidy_comb_sw, method = "jw")
#> # A tibble: 3,741 x 3
#>                V1                 V2 string_dist
#>  *         <fctr>             <fctr>       <dbl>
#>  1 Luke Skywalker              C-3PO   1.0000000
#>  2 Luke Skywalker              R2-D2   1.0000000
#>  3 Luke Skywalker        Darth Vader   0.5752165
#>  4 Luke Skywalker        Leia Organa   0.5335498
#>  5 Luke Skywalker          Owen Lars   0.4624339
#>  6 Luke Skywalker Beru Whitesun lars   0.4656085
#>  7 Luke Skywalker              R5-D4   1.0000000
#>  8 Luke Skywalker  Biggs Darklighter   0.5728291
#>  9 Luke Skywalker     Obi-Wan Kenobi   0.6349206
#> 10 Luke Skywalker   Anakin Skywalker   0.2816558
#> # ... with 3,731 more rows

The goal is to provide a convenient interface to work with other tools from the tidyverse.

tidy_stringdist(tidy_comb_sw, method= "osa") %>%
  filter(string_dist > 20) %>%
  arrange(desc(string_dist))
#> # A tibble: 11 x 3
#>                       V1                    V2 string_dist
#>                   <fctr>                <fctr>       <dbl>
#>  1                 C-3PO Jabba Desilijic Tiure          21
#>  2                 C-3PO Wicket Systri Warrick          21
#>  3                 R2-D2 Wicket Systri Warrick          21
#>  4                 R5-D4 Wicket Systri Warrick          21
#>  5 Jabba Desilijic Tiure                 IG-88          21
#>  6 Jabba Desilijic Tiure                 Cordé          21
#>  7 Jabba Desilijic Tiure                R4-P17          21
#>  8 Jabba Desilijic Tiure                   BB8          21
#>  9                 IG-88 Wicket Systri Warrick          21
#> 10 Wicket Systri Warrick                R4-P17          21
#> 11 Wicket Systri Warrick                   BB8          21
starwars %>%
  filter(species == "Droid") %>%
  tidy_comb_all(name) %>%
  tidy_stringdist() %>% 
  filter(string_dist > 2) %>%
  arrange(desc(string_dist))
#> # A tibble: 9 x 3
#>       V1     V2 string_dist
#>   <fctr> <fctr>       <dbl>
#> 1  C-3PO  R2-D2           5
#> 2  C-3PO  R5-D4           5
#> 3  C-3PO  IG-88           5
#> 4  C-3PO    BB8           5
#> 5  R2-D2    BB8           5
#> 6  R5-D4    BB8           5
#> 7  R2-D2  IG-88           4
#> 8  R5-D4  IG-88           4
#> 9  IG-88    BB8           4

Contact

Questions and feedbacks welcome!