General

The R source code comparison is based on similarity coefficients for the names used in R programs or expressions. Use cases would be the detection of

In the first case, detection of similar code sequences can lead to better code quality if similar code is embedded in a function rather than repeatedly in different places. In the second case, cheating is looked for.

The goal, however, is not perfect detection of similar code sequences, but rather to give clues as to where similar code sequences might be.

We have several steps to take:

  1. read in the source code
  2. create documents from source codes
  3. calculate similarities
  4. get an overview of the calculated similarities
  5. compare code sequences

A thanks to Masatoshi Nishimura to his blog post The Best Document Similarity Algorithm: A Beginner’s Guide which is very inspriring.

Step 1: Reading in source codes

The makers of the package SimilaR (R Source Code Similarity Evaluation) have provided some sample files for testing:

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE)
#> 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/aa.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/aa1.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/bucketSort1.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/bucketSort1_addLines.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/bucketSort1_variables.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/isPrime2.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/isPrime2_addLines.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/isPrime2_callReverse.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/kendall4.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/kendall4_variables.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/kombinuj1.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/kombinuj1_variables.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/kwantyle1.R 
#> /tmp/RtmpPp7OUw/Rinst33c862ec301e/rscc/examples/kwantyle1_variables.R
names(prgs)
#>  [1] "aa.R"                    "aa1.R"                  
#>  [3] "bucketSort1.R"           "bucketSort1_addLines.R" 
#>  [5] "bucketSort1_variables.R" "isPrime2.R"             
#>  [7] "isPrime2_addLines.R"     "isPrime2_callReverse.R" 
#>  [9] "kendall4.R"              "kendall4_variables.R"   
#> [11] "kombinuj1.R"             "kombinuj1_variables.R"  
#> [13] "kwantyle1.R"             "kwantyle1_variables.R"

The parameter basename=TRUE ensures that names of the list elements are the basename of the files and not the file names including the path.

The parameter silent=TRUE suppresses the output of the parsed files. If an error occurs during parsing, the file will not be loaded and will be included in the following steps.

If you want to consider expressions and not the whole R file, you have to set the parameter minlines. sourcecode checks whether an expression in the source file has more than minlines lines. If so, the expression is kept for further analysis. The name of the list items in prgs is then filename[number]. For example, you could access the expression named prgs[["aa.R[1]"]].

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, minlines=3, silent=TRUE)
names(prgs)
#>  [1] "aa.R[1]"                    "aa1.R[1]"                  
#>  [3] "bucketSort1.R[1]"           "bucketSort1_addLines.R[1]" 
#>  [5] "bucketSort1_variables.R[1]" "isPrime2.R[1]"             
#>  [7] "isPrime2_addLines.R[1]"     "isPrime2_callReverse.R[1]" 
#>  [9] "kendall4.R[1]"              "kendall4_variables.R[1]"   
#> [11] "kombinuj1.R[1]"             "kombinuj1_variables.R[1]"  
#> [13] "kwantyle1.R[1]"             "kwantyle1_variables.R[1]"

Step 2: Create documents

The next step is to create documents, based on the names of variables, functions and operators from the parsed source codes-

docs <- documents(prgs)
# create term document frequency table
freq_table(docs)[1:8,1:8]
#>                             vword
#> vdoc                         asd asd1 asd2 asd3 asd4 asd5 asd6 asd7
#>   aa.R[1]                      1    0    0    0    0    0    0    0
#>   aa1.R[1]                     1    0    0    0    0    0    0    0
#>   bucketSort1.R[1]             0    0    0    0    0    0    0    0
#>   bucketSort1_addLines.R[1]    0    0    0    0    0    0    0    0
#>   bucketSort1_variables.R[1]   0    0    0    0    0    0    0    0
#>   isPrime2.R[1]                0    0    0    0    0    0    0    0
#>   isPrime2_addLines.R[1]       0    0    0    0    0    0    0    0
#>   isPrime2_callReverse.R[1]    0    1    1    1    1    1    1    1

With the type parameter you can distinguish between different types of names and with ignore.case the names are either treated case-insensitive or not.

type

With the type parameter you can distinguish between different types of names:

cat(as.character(prgs[[1]]))                       # source code
#> asd <- function(x) {
#>     for (i in x) {
#>         cat(i)
#>         x[i] <- 3
#>     }
#> }
all.vars(prgs[[1]])                                # type="v", default
#> [1] "asd" "i"   "x"
all.names(prgs[[1]])                               # type="n"
#>  [1] "<-"       "asd"      "function" "{"        "for"      "i"       
#>  [7] "x"        "{"        "cat"      "i"        "<-"       "["       
#> [13] "x"        "i"
setdiff(all.names(prgs[[1]]), all.vars(prgs[[1]])) # type="f"
#> [1] "<-"       "function" "{"        "for"      "cat"      "["

minlen and ignore.case

With the parameter minlen you can exclude names that are shorter than minlen. The default is minlen=2 because the name of an index variable in loops often consists of only one letter, for example for (i in 1:n). ignore.case is either TRUE or FALSE. If TRUE (default), then names wil case-insensitive, otherwuse case-sensitive.

Step 3a: Calculate similarity coefficients

In cluster analysis has been developed several similarity coefficients. They use words and basically calculate the intersection of words in documents over the union of words.

To calculate similarity coefficients between all source text segments based on the names used:

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
docs  <- documents(prgs)
similarities(docs)[1:8,1:8]
#>                         aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R                       1     1     0.0000000              0.0000000
#> aa1.R                      1     1     0.0000000              0.0000000
#> bucketSort1.R              0     0     1.0000000              1.0000000
#> bucketSort1_addLines.R     0     0     1.0000000              1.0000000
#> bucketSort1_variables.R    0     0     0.1428571              0.1428571
#> isPrime2.R                 0     0     0.0000000              0.0000000
#> isPrime2_addLines.R        0     0     0.0000000              0.0000000
#> isPrime2_callReverse.R     0     0     0.0000000              0.0000000
#>                         bucketSort1_variables.R isPrime2.R isPrime2_addLines.R
#> aa.R                                  0.0000000  0.0000000           0.0000000
#> aa1.R                                 0.0000000  0.0000000           0.0000000
#> bucketSort1.R                         0.1428571  0.0000000           0.0000000
#> bucketSort1_addLines.R                0.1428571  0.0000000           0.0000000
#> bucketSort1_variables.R               1.0000000  0.0000000           0.0000000
#> isPrime2.R                            0.0000000  1.0000000           1.0000000
#> isPrime2_addLines.R                   0.0000000  1.0000000           1.0000000
#> isPrime2_callReverse.R                0.0000000  0.1111111           0.1111111
#>                         isPrime2_callReverse.R
#> aa.R                                 0.0000000
#> aa1.R                                0.0000000
#> bucketSort1.R                        0.0000000
#> bucketSort1_addLines.R               0.0000000
#> bucketSort1_variables.R              0.0000000
#> isPrime2.R                           0.1111111
#> isPrime2_addLines.R                  0.1111111
#> isPrime2_callReverse.R               1.0000000

This calculates a matrix of Jaccard coefficients based on the variable names.

If the Jaccard coefficient is used then each entry can be interpreted as the percentage:

\[\frac{\text{number of words used in both documents}}{\text{number of all words used in both documents}}.\] The interpretation will be different if another similarity coefficient is used! But in any case, a higher similarity coefficient corresponds to a larger proportion of variable names in both files (or expressions).

coeff (similarity)

With the parameter coeff a certain similarity coefficient can be calculated (default: jaccard).

If you specify two sets with unique names set1, set2 and one set setfull with predefined names, four numbers will be calculated (default: setfull <- unique(c(set1,set2))):

inset1 <- setfull %in% unique(set1)
inset2 <- setfull %in% unique(set2)
p      <- length(setfull)
n11    <- sum(inset1 & inset2)
n10    <- sum(inset1 & !inset2)
n01    <- sum(!inset1 & inset2)
n00    <- sum(!inset1 & !inset2)

The following coefficients can be calculated:

  • braun = n11/max(n01+n11, n10+n11),
  • dice = 2*n11/(n01+n10+2*n11),
  • jaccard = n11/(n01+n10+n11) (default),
  • kappa = 1/(1+p/2*(n01+n10)/(n00*n11-n01*n10)),
  • kulczynski = n11/(n01+n10),
  • matching = (n00+n11)/p,
  • ochiai = n11/sqrt((n11+n10)*(n11+n10)),
  • phi = (n11*n00-n10*n01)/sqrt((n11+n10)*(n11+n10)*(n00+n10)*(n00+n10)),
  • russelrao = n11/p,
  • simpson = n11/min(n01+n11, n10+n11),
  • sneath = n11/(n11+2*n01+2*n10),
  • tanimoto = (n11+n00)/(n11+2*n01+2*n10+n00), and
  • yule = (n11*n00-n01*n10)/(n11*n00-n01*n10).

If a coefficient name is not found or a NaN is generated then a zero is returned.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
docs  <- documents(prgs)
similarities(docs, coeff="m")[1:8,1:8]
#>                         aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#> aa.R                       1     1     0.0000000              0.0000000
#> aa1.R                      1     1     0.0000000              0.0000000
#> bucketSort1.R              0     0     1.0000000              1.0000000
#> bucketSort1_addLines.R     0     0     1.0000000              1.0000000
#> bucketSort1_variables.R    0     0     0.1428571              0.1428571
#> isPrime2.R                 0     0     0.0000000              0.0000000
#> isPrime2_addLines.R        0     0     0.0000000              0.0000000
#> isPrime2_callReverse.R     0     0     0.0000000              0.0000000
#>                         bucketSort1_variables.R isPrime2.R isPrime2_addLines.R
#> aa.R                                  0.0000000  0.0000000           0.0000000
#> aa1.R                                 0.0000000  0.0000000           0.0000000
#> bucketSort1.R                         0.1428571  0.0000000           0.0000000
#> bucketSort1_addLines.R                0.1428571  0.0000000           0.0000000
#> bucketSort1_variables.R               1.0000000  0.0000000           0.0000000
#> isPrime2.R                            0.0000000  1.0000000           1.0000000
#> isPrime2_addLines.R                   0.0000000  1.0000000           1.0000000
#> isPrime2_callReverse.R                0.0000000  0.1111111           0.1111111
#>                         isPrime2_callReverse.R
#> aa.R                                 0.0000000
#> aa1.R                                0.0000000
#> bucketSort1.R                        0.0000000
#> bucketSort1_addLines.R               0.0000000
#> bucketSort1_variables.R              0.0000000
#> isPrime2.R                           0.1111111
#> isPrime2_addLines.R                  0.1111111
#> isPrime2_callReverse.R               1.0000000

Step 3b: Similarity based on term frequency–inverse document frequency (td-idf)

Another possibility to find similar source codes as cosine of angles of term frequency–inverse document frequency:

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
docs  <- documents(prgs)
tfidf(docs)[1:8,1:8]
#>                          
#>                           aa.R aa1.R bucketSort1.R bucketSort1_addLines.R
#>   aa.R                       1     1     0.0000000              0.0000000
#>   aa1.R                      1     1     0.0000000              0.0000000
#>   bucketSort1.R              0     0     1.0000000              1.0000000
#>   bucketSort1_addLines.R     0     0     1.0000000              1.0000000
#>   bucketSort1_variables.R    0     0     0.3128966              0.3128966
#>   isPrime2.R                 0     0     0.0000000              0.0000000
#>   isPrime2_addLines.R        0     0     0.0000000              0.0000000
#>   isPrime2_callReverse.R     0     0     0.0000000              0.0000000
#>                          
#>                           bucketSort1_variables.R isPrime2.R
#>   aa.R                                  0.0000000  0.0000000
#>   aa1.R                                 0.0000000  0.0000000
#>   bucketSort1.R                         0.3128966  0.0000000
#>   bucketSort1_addLines.R                0.3128966  0.0000000
#>   bucketSort1_variables.R               1.0000000  0.0000000
#>   isPrime2.R                            0.0000000  1.0000000
#>   isPrime2_addLines.R                   0.0000000  1.0000000
#>   isPrime2_callReverse.R                0.0000000  0.2021137
#>                          
#>                           isPrime2_addLines.R isPrime2_callReverse.R
#>   aa.R                              0.0000000              0.0000000
#>   aa1.R                             0.0000000              0.0000000
#>   bucketSort1.R                     0.0000000              0.0000000
#>   bucketSort1_addLines.R            0.0000000              0.0000000
#>   bucketSort1_variables.R           0.0000000              0.0000000
#>   isPrime2.R                        1.0000000              0.2021137
#>   isPrime2_addLines.R               1.0000000              0.2021137
#>   isPrime2_callReverse.R            0.2021137              1.0000000

The attribute tfidf of the result matrix contains the term frequency–inverse document frequency matrix.

Step 4: Get an overview about the results

Since the matrix of coefficients can not easily be overviewed we have must transform them for visualiziation.

matrix2dataframe

With matrix2dataframe you transform the matrix of coefficients to a data frame where the first column is the row index, the scrond column the column index and the third column the the matrix entry.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
docs  <- documents(prgs)
simm  <- similarities(docs, coeff="m")
simdf <- matrix2dataframe(simm)
head(simdf, 10)
#>                       row                     col  matching
#> 1                    aa.R                   aa1.R 1.0000000
#> 2           bucketSort1.R  bucketSort1_addLines.R 1.0000000
#> 3              isPrime2.R     isPrime2_addLines.R 1.0000000
#> 4             kombinuj1.R   kombinuj1_variables.R 0.3000000
#> 5             kwantyle1.R   kwantyle1_variables.R 0.2000000
#> 6           bucketSort1.R bucketSort1_variables.R 0.1428571
#> 7  bucketSort1_addLines.R bucketSort1_variables.R 0.1428571
#> 8              kendall4.R    kendall4_variables.R 0.1428571
#> 9              isPrime2.R  isPrime2_callReverse.R 0.1111111
#> 10    isPrime2_addLines.R  isPrime2_callReverse.R 0.1111111

As you can see the lines are ordered by decreasing values. The parameter decreasing=FALSE will order it by ascending order.

The output can be interpreted line by line:

  • The variable names used in the files aa.R and aa1.R are identical. Each variable name used in aa.R is also used in aa1.R and vice versa.
  • The variable names in the files bucketSort1_addLines.R and bucketSort1.R are identical.
  • The variable names in the files isPrime2_addLines.R and isPrime2.R are identical.
  • The variable names in the files kombinuj1_variables.R and kombinuj1.R are not identical. In both files, the variable names overlap by 30%.
  • And so on.

In case of a symmetric similarity matrix only the upper triangle is considered otherwise the whole matrix.

You may use a stripchart to see all similarity coefficients:

stripchart(simdf[,3], "jitter", pch=19, xlab=names(simdf)[3])

as_igraph

In a second step you can plot the coefficients in a diagram, where thicker edges correspond to higher similarity coefficients.

library("igraph")
#> 
#> Attache Paket: 'igraph'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     decompose, spectrum
#> Das folgende Objekt ist maskiert 'package:base':
#> 
#>     union
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
docs  <- documents(prgs, type="n", minlen=3)
simm  <- similarities(docs)
graph <- as_igraph(simm, diag=FALSE)
# color all edges wit a large similarity coefficients in red
E(graph)$color <- ifelse(E(graph)$weight>0.4, "red", "grey")
plot(graph, edge.width=1+3*E(graph)$weight)
box()