General

The R source code comparison is based on similarity coefficients for the names used in R programs or expressions. Use cases would be the detection of

In the first case, detection of similar code sequences can lead to better code quality if similar code is embedded in a function rather than repeatedly in different places. In the second case, cheating is looked for.

The goal, however, is not perfect detection of similar code sequences, but rather to give clues as to where similar code sequences might be.

We have some steps to take:

  1. read in the source code
  2. calculate similarity coefficients
  3. get an overview of the calculated similarity coefficients
  4. compare code sequences

Step 1: Reading in source codes

The makers of the package SimilaR (R Source Code Similarity Evaluation) have provided some sample files for testing:

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE)
#> 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_addLines.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_variables.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_addLines.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_callReverse.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4_variables.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1_variables.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1_variables.R
names(prgs)
#>  [1] "aa.R"                    "aa1.R"                  
#>  [3] "bucketSort1.R"           "bucketSort1_addLines.R" 
#>  [5] "bucketSort1_variables.R" "isPrime2.R"             
#>  [7] "isPrime2_addLines.R"     "isPrime2_callReverse.R" 
#>  [9] "kendall4.R"              "kendall4_variables.R"   
#> [11] "kombinuj1.R"             "kombinuj1_variables.R"  
#> [13] "kwantyle1.R"             "kwantyle1_variables.R"

The parameter basename=TRUE ensures that names of the list elements are the basename of the files and not the file names including the path.

The parameter silent=TRUE suppresses the output of the parsed files. If an error occurs during parsing, the file will not be loaded and will be included in the following steps.

If you want to consider expressions and not the whole R file, you have to set the parameter minlines. sourcecode checks whether an expression in the source file has more than minlines lines. If so, the expression is kept for further analysis. The name of the list items in prgs is then filename[number]. For example, you could access the expression named prgs[["aa.R[1]"]].

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, minlines=3)
#> 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/aa1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_addLines.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/bucketSort1_variables.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_addLines.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/isPrime2_callReverse.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kendall4_variables.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kombinuj1_variables.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1.R 
#> /tmp/Rtmp3F3aCE/Rinst163d67fdabb9/rscc/examples/kwantyle1_variables.R
names(prgs)
#>  [1] "aa.R[1]"                    "aa1.R[1]"                  
#>  [3] "bucketSort1.R[1]"           "bucketSort1_addLines.R[1]" 
#>  [5] "bucketSort1_variables.R[1]" "isPrime2.R[1]"             
#>  [7] "isPrime2_addLines.R[1]"     "isPrime2_callReverse.R[1]" 
#>  [9] "kendall4.R[1]"              "kendall4_variables.R[1]"   
#> [11] "kombinuj1.R[1]"             "kombinuj1_variables.R[1]"  
#> [13] "kwantyle1.R[1]"             "kwantyle1_variables.R[1]"

Step 2: Calculate the similarity coefficients

The next step is to calculate similarity coefficients between all source text segments based on the names used:

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
simy  <- similarities(prgs)
head(simy)
#>             row                     col   jaccard
#> 1          aa.R                   aa1.R 1.0000000
#> 2 bucketSort1.R  bucketSort1_addLines.R 1.0000000
#> 3    isPrime2.R     isPrime2_addLines.R 1.0000000
#> 4   kombinuj1.R   kombinuj1_variables.R 0.3000000
#> 5   kwantyle1.R   kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571

This calculates the Jaccard coefficients based on the variable names.

The output can be interpreted line by line:

The interpretation will be different if a different similarity coefficient is used! But in any case, a higher similarity coefficient corresponds to a larger proportion of variable names in both files (or expressions).

type

With the type parameter you can distinguish between different types of names:

cat(as.character(prgs[[1]]))                       # source code
#> asd <- function(x) {
#>     for (i in x) {
#>         cat(i)
#>         x[i] <- 3
#>     }
#> }
all.vars(prgs[[1]])                                # type="v", default
#> [1] "asd" "i"   "x"
all.names(prgs[[1]])                               # type="n"
#>  [1] "<-"       "asd"      "function" "{"        "for"      "i"       
#>  [7] "x"        "{"        "cat"      "i"        "<-"       "["       
#> [13] "x"        "i"
setdiff(all.names(prgs[[1]]), all.vars(prgs[[1]])) # type="f"
#> [1] "<-"       "function" "{"        "for"      "cat"      "["

minlen and ignore.case

With the parameter minlen you can exclude names that are shorter than minlen. The default is minlen=2 because the name of an index variable in loops often consists of only one letter, for example for (i in 1:n). Ignore.case" is either TRUE or FALSE. If TRUE (default), then "A"=="a" and so on.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
simy  <- similarities(prgs, minlen=4)
#> Warning in similarities(prgs, minlen = 4): no names found in aa.R, aa1.R
head(simy)
#>                      row                     col   jaccard
#> 1          bucketSort1.R  bucketSort1_addLines.R 1.0000000
#> 2          bucketSort1.R bucketSort1_variables.R 1.0000000
#> 3 bucketSort1_addLines.R bucketSort1_variables.R 1.0000000
#> 4             isPrime2.R     isPrime2_addLines.R 1.0000000
#> 5            kombinuj1.R   kombinuj1_variables.R 0.3333333
#> 6            kwantyle1.R   kwantyle1_variables.R 0.2000000

same.file

If you are only interested in the differences between files, you can set the similarities between expressions to zero if they are in the same file. The use case here is to detect plagiarism in different files.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE, minlines=1)
simy  <- similarities(prgs)
attr(simy, "similarity")[1:3,1:3]
#>                  aa.R[1] aa1.R[1] bucketSort1.R[1]
#> aa.R[1]                1        1                0
#> aa1.R[1]               1        1                0
#> bucketSort1.R[1]       0        0                1
simy  <- similarities(prgs, same.file=FALSE)
attr(simy, "similarity")[1:3,1:3]
#>                  aa.R[1] aa1.R[1] bucketSort1.R[1]
#> aa.R[1]                0        1                0
#> aa1.R[1]               1        0                0
#> bucketSort1.R[1]       0        0                0

coeff (similarity)

With the parameter coeff a certain similarity coefficient can be calculated (default: jaccard).

If you specify two sets with unique names set1, set2 and one set setfull with predefined names, four numbers will be calculated (default: setfull <- unique(c(set1,set2))):

inset1 <- setfull %in% unique(set1)
inset2 <- setfull %in% unique(set2)
p      <- length(setfull)
n11    <- sum(inset1 & inset2)
n10    <- sum(inset1 & !inset2)
n01    <- sum(!inset1 & inset2)
n00    <- sum(!inset1 & !inset2)

The following coefficients can be calculated:

  • braun = n11/max(n01+n11, n10+n11),
  • dice = 2*n11/(n01+n10+2*n11),
  • jaccard = n11/(n01+n10+n11) (default),
  • kappa = 1/(1+p/2*(n01+n10)/(n00*n11-n01*n10)),
  • kulczynski = n11/(n01+n10),
  • matching = (n00+n11)/p,
  • ochiai = n11/sqrt((n11+n10)*(n11+n10)),
  • phi = (n11*n00-n10*n01)/sqrt((n11+n10)*(n11+n10)*(n00+n10)*(n00+n10)),
  • russelrao = n11/p,
  • simpson = n11/min(n01+n11, n10+n11),
  • sneath = n11/(n11+2*n01+2*n10),
  • tanimoto = (n11+n00)/(n11+2*n01+2*n10+n00), and
  • yule = (n11*n00-n01*n10)/(n11*n00-n01*n10).

If a coefficient name is not found or a NaN is generated then a zero is returned.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
simy  <- similarities(prgs, coeff="m")
head(simy)
#>             row                     col  matching
#> 1          aa.R                   aa1.R 1.0000000
#> 2 bucketSort1.R  bucketSort1_addLines.R 1.0000000
#> 3    isPrime2.R     isPrime2_addLines.R 1.0000000
#> 4   kombinuj1.R   kombinuj1_variables.R 0.3000000
#> 5   kwantyle1.R   kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571

decreasing and tol

The matrix m of similarities is checked if it is a symmetric matrix. It is symmetric if for all entries holds abs(m-t(m))<=tol; the result is stored in the attribute symmetrical. The matrix m is transformed into a data frame, where the first column is the row index, the second column the column index and the third column the similarity coefficient. If decreasing is TRUE (default), the data frame is sorted in descending order by the third column.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
simy  <- similarities(prgs)
head(simy)
#>             row                     col   jaccard
#> 1          aa.R                   aa1.R 1.0000000
#> 2 bucketSort1.R  bucketSort1_addLines.R 1.0000000
#> 3    isPrime2.R     isPrime2_addLines.R 1.0000000
#> 4   kombinuj1.R   kombinuj1_variables.R 0.3000000
#> 5   kwantyle1.R   kwantyle1_variables.R 0.2000000
#> 6 bucketSort1.R bucketSort1_variables.R 0.1428571

Step 3: Get an overview about the results

The first step is to look at the sorted calculated coefficients.

files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
simy  <- similarities(prgs, type="n", minlen=3)
stripchart(simy$jaccard, "jitter", pch=19, xlab="Jaccard")

In a second step you can plot the coefficients in a diagram, where thicker edges correspond to higher similarity coefficients.

library("igraph")
#> 
#> Attache Paket: 'igraph'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     decompose, spectrum
#> Das folgende Objekt ist maskiert 'package:base':
#> 
#>     union
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs  <- sourcecode(files, basename=TRUE, silent=TRUE)
simy  <- similarities(prgs, type="n", minlen=3)
graph <- as_igraph(simy, diag=FALSE)
# color all edges wit a large similarity coefficient in red
E(graph)$color <- ifelse(E(graph)$weight>0.4, "red", "grey")
plot(graph, edge.width=1+3*E(graph)$weight)
box()