diffobj - Diffs for R Objects

Brodie Gaslam

Introduction

diffobj uses the same comparison mechanism used by git diff and diff to highlight differences between rendered R objects:

library(diffobj)
a <- b <- matrix(1:100, ncol=2)
a <- a[-20,]
b <- b[-45,]
b[c(18, 44)] <- 999
diffPrint(target=a, current=b)
@@ 17,6 @@
@@ 17,7 @@
~
[,1] [,2]
~
[,1] [,2]
 
[16,] 16 66
 
[16,] 16 66
 
[17,] 17 67
 
[17,] 17 67
<
[18,] 18 68
>
[18,] 999 68
 
[19,] 19 69
 
[19,] 19 69
~
>
[20,] 20 70
 
[20,] 21 71
 
[21,] 21 71
 
[21,] 22 72
 
[22,] 22 72
@@ 42,6 @@
@@ 43,5 @@
 
[41,] 42 92
 
[42,] 42 92
 
[42,] 43 93
 
[43,] 43 93
<
[43,] 44 94
>
[44,] 999 94
<
[44,] 45 95
~
 
[45,] 46 96
 
[45,] 46 96
 
[46,] 47 97
 
[46,] 47 97

diffobj comparisons work best when objects have some similarities, or when they are relatively small. The package was originally developed to help diagnose failed unit tests by comparing test results to reference objects in a human-friendly manner.

If your terminal supports formatting through ANSI escape sequences, diffobj will output colored diffs to the terminal. If not, it will output colored diffs to your IDE viewport if it is supported, or to your browser otherwise.

Interpreting Diffs

Shortest Edit Script

The output from diffobj is a visual representation of the Shortest Edit Script (SES). An SES is the shortest set of deletion and insertion instructions for converting one sequence of elements into another. In our case, the elements are lines of text. We encode the instructions to convert a to b by deleting lines from a (in yellow) and inserting new ones from b (in blue).

Diff Structure

The first line of our diff output acts as a legend to the diff by associating the colors and symbols used to represent differences present in each object with the name of the object:

After the legend come the hunks, which are portions of the objects that have differences with nearby matching lines provided for context:

@@ 17,6 @@
@@ 17,7 @@
~
[,1] [,2]
~
[,1] [,2]
 
[16,] 16 66
 
[16,] 16 66
 
[17,] 17 67
 
[17,] 17 67
<
[18,] 18 68
>
[18,] 999 68
 
[19,] 19 69
 
[19,] 19 69
~
>
[20,] 20 70
 
[20,] 21 71
 
[21,] 21 71
 
[21,] 22 72
 
[22,] 22 72

At the top of the hunk is the hunk header: this tells us that the first displayed hunk (including context lines), starts at line 17 and spans 6 lines for a and 7 for b. These are display lines, not object row indices, which is why the first row shown of the matrix is row 16. You might have also noticed that the line after the hunk header is out of place:

~
[,1] [,2]
~
[,1] [,2]

This is a special context line that is not technically part of the hunk, but is shown nonetheless because it is useful in helping understand the data. The line is styled differently to highlight that it is not part of the hunk. Since it is not part of the hunk, it is not accounted for in the hunk header. See ?guideLines for more details.

The actual mismatched lines are highlighted in the colors of the legend, with additional visual cues in the gutters:

<
[18,] 18 68
>
[18,] 999 68
 
[19,] 19 69
 
[19,] 19 69
~
>
[20,] 20 70
 
[20,] 21 71
 
[21,] 21 71

diffobj uses a line by line diff to identify which portions of each of the objects are mismatches, so even if only part of a line mismatches it will be considered different. diffobj then runs a word diff within the hunks and further highlights mismatching words.

Let’s examine the last two lines from the previous hunk more closely:

~
>
[20,] 20 70
 
[20,] 21 71
 
[21,] 21 71

Here b has an extra line so diffobj adds an empty line to a to maintain the alignment for subsequent matching lines. This additional line is marked with a tilde in the gutter and is shown in a different color to indicate it is not part of the original text.

If you look closely at the next matching line you will notice that the a and b values are not exactly the same. The row indices are different, but diffobj excludes row indices from the diff so that rows that are identical otherwise are shown as matching. diffobj indicates this is happening by showing the portions of a line that are ignored in the diff in grey.

See ?guides and ?trim for details and limitations on guideline detection and unsemantic meta data trimming.

Atomic Vectors

Since R can display multiple elements in an atomic vector on the same line, and diffPrint is fundamentally a line diff, we use specialized logic when diffing atomic vectors. Consider:

state.abb2 <- state.abb[-16]
state.abb2[37] <- "Pennsylvania"
diffPrint(state.abb, state.abb2)
@@ 1,5 @@
@@ 6,5 @@
 
[1] "AL" "AK" "AZ" "AR" "CA" "CO"
 
[11] "HI" "ID"
 
[7] "CT" "DE" "FL" "GA" "HI" "ID"
 
[13] "IL" "IN"
<
[13] "IL" "IN" "IA" "KS" "KY" "LA"
>
[15] "IA" "KY"
 
[19] "ME" "MD" "MA" "MI" "MN" "MS"
 
[17] "LA" "ME"
 
[25] "MO" "MT" "NE" "NV" "NH" "NJ"
 
[19] "MD" "MA"
@@ 6,4 @@
@@ 17,5 @@
~
 
 
[33] "ND" "OH"
 
[31] "NM" "NY" "NC" "ND" "OH" "OK"
 
[35] "OK" "OR"
<
[37] "OR" "PA" "RI" "SC" "SD" "TN"
>
[37] "Pennsylvania" "RI"
 
[43] "TX" "UT" "VT" "VA" "WA" "WV"
 
[39] "SC" "SD"
 
[49] "WI" "WY"
 
[41] "TN" "TX"

Due to the different wrapping frequency no line in the text display of our two vectors matches. Despite this, diffPrint only highlights the lines that actually contain differences. The side effect is that lines that only contain matching elements are shown as matching even though the actual lines may be different. You can turn off this behavior in favor of a normal line diff with the unwrap.atomic argument to diffPrint.

Currently this only works for unnamed vectors, and even for them some inputs may produce sub-optimal results. Nested vectors inside lists will not be unwrapped. You can also use diffChr (see below) to do a direct element by element comparison.

Other Diff Functions

Method Overview

diffobj defines several S4 generics and default methods to go along with them. Each of them uses a different text representation of the inputs:

Note the diff* functions use lowerCamelCase in keeping with S4 method name convention, whereas the package name itself is all lower case.

Compare Structure with diffStr

For complex objects it is often useful to compare structures:

mdl1 <- lm(Sepal.Length ~ Sepal.Width, iris)
mdl2 <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
diffStr(mdl1$qr, mdl2$qr, line.limit=15)
@@ 1,9 @@
@@ 1,10 @@
 
List of 5
 
List of 5
<
$ qr : num [1:150, 1:2] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
>
$ qr : num [1:150, 1:4] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
 
..- attr(*, "dimnames")=List of 2
 
..- attr(*, "dimnames")=List of 2
<
..- attr(*, "assign")= int [1:2] 0 1
>
..- attr(*, "assign")= int [1:4] 0 1 2 2
~
>
..- attr(*, "contrasts")=List of 1
<
$ qraux: num [1:2] 1.08 1.02
>
$ qraux: num [1:4] 1.08 1.02 1.05 1.11
<
$ pivot: int [1:2] 1 2
>
$ pivot: int [1:4] 1 2 3 4
 
$ tol : num 1e-07
 
$ tol : num 1e-07
<
$ rank : int 2
>
$ rank : int 4
 
- attr(*, "class")= chr "qr"
 
- attr(*, "class")= chr "qr"
3 differences are hidden by our use of `max.level`

If you specify a line.limit with diffStr it will fold nested levels in order to fit under line.limit so long as there remain visible differences. If you prefer to see all the differences you can leave line.limit unspecified.

Compare Vectors Elements with diffChr

Sometimes it is useful to do a direct element by element comparison:

diffChr(letters[1:3], c("a", "B", "c"))
@@ 1,3 @@
@@ 1,3 @@
 
a
 
a
<
b
>
B
 
c
 
c

Notice how we are comparing the contents of the vectors with one line per element.

Why S4?

The diff* functions are defined as S4 generics with default methods (signature c("ANY", "ANY")) so that users can customize behavior for their own objects. For example, a custom method could set many of the default parameters to values more suitable for a particular object. If the objects in question are S3 objects the S3 class will have to be registered with setOldClass.

Return Value

All the diff* methods return a Diff S4 object. It has a show method which is responsible for rendering the Diff and displaying it to the screen. Because of this you can compute and render diffs in two steps:

x <- diffPrint(letters, LETTERS)
x   # or equivalently: `show(x)`

This may cause the diff to render funny if you change screen widths, etc., between the two steps.

There are also summary, any, and as.character methods. The summary method provides a high level overview of where the differences are, which can be helpful for large diffs:

summary(diffStr(mdl1, mdl2))
Found differences in 12 hunks:
45 insertions, 39 deletions, 18 matches (lines)

Diff map (line:char scale is 1:1 for single chars, 1-2:1 for char seqs):
DDDIII.DDDIII.DI.DI..DDDIIIII.DI.DDDDDIIIIIII.DDDIII..DDDIII..DDIII.DDDIII..DDII.

any returns TRUE if there are differences, and as.character returns the character representation of the diff.

Controlling Diffs and Their Appearance

Parameters

The diff* family of methods has an extensive set of parameters that allow you to fine tune how the diff is applied and displayed. We will review some of the major ones in this section. For a full description see ?diffPrint.

While the parameter list is extensive, only the objects being compared are required. All the other parameters have default values, and most of them are for advanced use only. The defaults can all be adjusted via the diffobj.* options.

Display Mode

There are three built-in display modes that are similar to those found in GNU diff: “sidebyside”, “unified”, and “context”. For example, by varying the mode parameter with:

x <- y <- letters[24:26]
y[2] <- "GREMLINS"
diffChr(x, y)

we get:

mode="sidebyside"mode="unified"mode="context"
@@ 1,3 @@
@@ 1,3 @@
 
x
 
x
<
y
>
GREMLINS
 
z
 
z
@@ 1,3 / 1,3 @@
 
x
<
y
>
GREMLINS
 
z
@@ 1,3 / 1,3 @@
 
x
<
y
 
z
~
----------
 
x
>
GREMLINS
 
z

By default diffobj will try to use mode="sidebyside" if reasonable given display width, and otherwise will switch to mode="unified". You can always force a particular display style by specifying it with the mode argument.

Color Mode

The default color mode uses yellow and blue to symbolize deletions and insertions for accessibility to dichromats. If you prefer the more traditional color mode you can specify color.mode="rgb" in the parameter list, or use options(diffobj.color.mode="rgb"):

diffChr(x, y, color.mode="rgb")
@@ 1,3 @@
@@ 1,3 @@
 
x
 
x
<
y
>
GREMLINS
 
z
 
z

Output Formats

If your terminal supports it diffobj will format the output with ANSI escape sequences. diffobj uses Gábor Csárdi’s crayon package to detect ANSI support and to apply ANSI based formatting. If you are using RStudio or another IDE that supports getOption("viewer"), diffobj will output an HTML/CSS formatted diff to the viewport. In other terminals that do not support ANSI colors, diffobj will attempt to output to an HTML/CSS formatted diff to your browser using browseURL.

You can explicitly specify the output format with the format parameter:

See Pagers for more details.

Brightness

The brightness parameter allows you to pick a color scheme compatible with the background color of your terminal. The options are:

Here are examples of terminal screen renderings for both “rgb” and “yb” color.mode for the three brightness levels.

The examples for “light” and “dark” have the backgrounds forcefully set to a color compatible with the scheme. In actual use the base background and foreground colors are left unchanged, which will look bad if you use “dark” with light colored backgrounds or vice versa. Since we do not know of a good cross platform way of detecting terminal background color the default brightness value is “neutral”.

At this time the only format that is affected by this parameter is “ansi256”. If you want to specify your own light/dark/neutral schemes you may do so either by specifying a style directly or with Palette of Styles.

Pagers

In interactive mode, if the diff output is very long or if your terminal does not support ANSI colors, diff* methods will pipe output to a pager. This is done by writing the output to a temporary file and passing the file reference to the pager. The default action is to invoke the pager with file.show if your terminal supports ANSI colors and the pager is known to support ANSI colors as well (as of this writing, only less is assumed to support ANSI colors), or if not to use getOption("viewer") if available (this outputs to the viewport in RStudio), or if not to use browseURL.

You can fine tune when, how, and if a pager is used with the pager parameter. See ?diffPrint and ?Pager for more details.

Styles

You can control almost all aspects of the diff output formatting via the style parameter. To do so, pass an appropriately configured Style object. See ?Style for more details on how to do this.

The default is to auto pick a style based on the values of the format, color.mode, and brightness parameters. This is done by using the computed values for each of those parameters to subset the PaletteOfStyles object passed as the palette.of.styles parameter. This PaletteOfStyles object contains a Style object for all the possible permutations of the style, format, and color.mode parameters. See ?PaletteOfStyles.

If you specify the style parameter the values of the format, brightness, and color.mode parameters will be ignored.

Diff Algorithm

The primary diff algorithm is Myer’s solution to the shortest edit script / longest common sequence problem with the Hirschberg linear space refinement as described in:

E. Myers, “An O(ND) Difference Algorithm and Its Variations”, Algorithmica 1, 2 (1986), 251-266.

and should be the same algorithm used by GNU diff. The implementation used here is a heavily modified version of Michael B. Allen’s diff program from the libmba C library. Any and all bugs in the C code in this package were most likely introduced by yours truly. Please note that the resulting C code is incompatible with the original libmba library.

Performance Considerations

Diff

The diff algorithm scales with the square of the number of differences. For reasonably small diffs (< 10K differences), the diff itself is unlikely to be the bottleneck.

Capture and Processing

Capture of inputs for diffPrint and diffStr, and processing of output for all diff* methods will account for most of the execution time unless you have large numbers of differences. This input and output processing scales mostly linearly with the input size.

You can improve performance somewhat by using diffChr since that skips the capture part, and by turning off word.diff:

v1 <- 1:5e4
v2 <- v1[-sample(v1, 100)]
diffChr(v1, v2, word.diff=FALSE)

will be ~2x as fast as:

diffPrint(v1, v2)

Note: turning off word.diff when using diffPrint with unnamed atomic vectors can actually slow down the diff because there may well be fewer element by element differences than line differences as displayed. For example, when comparing 1:1e6 to 2:1e6 there is only one element difference, but every line as displayed is different because of the shift. Using word.diff=TRUE (and unwrap.atomic=TRUE) allows diffPrint to compare element by element rather than line by line. diffChr always compares element by element.

Minimal Diff

If you are looking for the fastest possible diff you can use ses and completely bypass most input and output processing. Inputs will be coerced to character if they are not character.

ses(v1, v2)

This will be 10-20x faster than diffChr, at the cost of less useful output.

Acknowledgements