User Guide

‘ggpmisc’ 0.3.0

Pedro J. Aphalo

2018-07-16

Preliminaries

We load all the packages used in the examples.

library(ggpmisc)
## Loading required package: ggplot2
## For news about 'ggpmisc', please, see https://www.r4photobiology.info/
## For on-line documentation see https://docs.r4photobiology.info/ggpmisc/
library(ggplot2)
library(ggrepel)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(tibble)
library(nlme)

As we will use text and labels on the plotting area we change the default theme to an uncluttered one.

old_theme <- theme_set(theme_bw())

Introduction

Many of the functions, including ggplot statistics and geoms, included in package ‘ggpmisc’ had their origin in my need to produce plots for use in teaching. Some of them are more generally useful, such as stat_poly_eq(), but others like stat_fit_deviations() are squarely aimed and producing learning material. Finally, several statistics for debugging and learning how ggplot statistics and geoms interact with each other will be of use only to developers of new statistics and geoms. Function try_tibble() opens the door to easily converting time series stored in objects of several different classes into data frames. This, for example, useful for plotting time series data with ggplot(). New ggplot() methods for classes ts and xts make the call to try_tibble() and conversion automatic.

ggplot methods for time series

ggplot() methods for classes "ts" and "xts" automate plotting of time series data. For plotting time series data stored in objects of other classes, see the conversion functions try_tibble() and try_data_frame() in the last section of this vignette.

ggplot(lynx) + geom_line()

ggplot(lynx, as.numeric = FALSE) + geom_line()

ggplot(AirPassengers) + geom_line()

ggplot(AirPassengers, as.numeric = FALSE) + geom_line()

stat_peaks and stat_valleys

Using POSIXct for time and the default formatting of labels.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", vjust = -0.5) +
  ylim(-100, 7300)

Using numeric values for time and the default formatting of labels.

ggplot(lynx) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", vjust = -0.5) +
  ylim(-100, 7300)

Using POSIXct for time but supplying a format string. In addition marking both peaks and valleys.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", vjust = -0.5, x.label.fmt = "%Y") +
  stat_valleys(colour = "blue") +
  stat_valleys(geom = "text", colour = "blue", vjust = 1.5, x.label.fmt = "%Y") +
  ylim(-100, 7300)

Using numeric for time but supplying a format string. In addition marking both peaks and valleys.

ggplot(lynx) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", vjust = -0.5, x.label.fmt = "%4.0f") +
  stat_valleys(colour = "blue") +
  stat_valleys(geom = "text", colour = "blue", vjust = 1.5, x.label.fmt = "%4.0f") +
  ylim(-100, 7300)

Rotating the labels.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "text", colour = "red", angle = 66,
             hjust = -0.1, x.label.fmt = "%Y") +
  ylim(NA, 8000)

Using geom_rug for the peaks and valleys.

ggplot(lynx, as.numeric = FALSE) + geom_line() + 
  stat_peaks(colour = "red") +
  stat_peaks(geom = "rug", colour = "red") +
  stat_valleys(colour = "blue") +
  stat_valleys(geom = "rug", colour = "blue")

stat_quadrant_counts

This statistic automates the annotation of plots with number of observations, either by quadrant, by pairs of quadrants or the four quadrants taken together.

We generate some artificial data.

set.seed(4321)
# generate artificial data
x <- -99:100
y <- x + rnorm(length(x), mean = 0, sd = abs(x))
my.data <- data.frame(x, 
                      y, 
                      group = c("A", "B"))

Using defaults except for colour.

ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0, colour = "red") +
  geom_vline(xintercept = 0, colour = "red") +
  geom_point() +
  stat_quadrant_counts(colour = "red")

Pooling quadrants along the x-axis.

ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0, colour = "red") +
  geom_point() +
  stat_quadrant_counts(colour = "red", pool.along = "x")

Manual positioning of the text annotations and pooling of all four quadrants.

ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_quadrant_counts(quadrants = 0L, labels.range.x = c(-90, -90), hjust = 0)

Annotation of only specific quadrants.

ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0, colour = "red") +
  geom_vline(xintercept = 0, colour = "red") +
  geom_point() +
  stat_quadrant_counts(colour = "red", quadrants = c(2, 4))

ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0, colour = "red") +
  geom_vline(xintercept = 0, colour = "red") +
  geom_point() +
  stat_quadrant_counts(colour = "red") +
  facet_wrap(~group)

stat_poly_eq

We generate some artificial data.

set.seed(4321)
# generate artificial data
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x, 
                      y, 
                      group = c("A", "B"), 
                      y2 = y * c(0.5,2),
                      block = c("a", "a", "b", "b"),
                      wt = sqrt(x))

First one example using defaults.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(formula = formula, parse = TRUE)

stat_poly_eq() makes available three different labels in the returned data frame. One of these is used by default, but aes() can be used to select a different one.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..adj.rr.label..), formula = formula, 
               parse = TRUE)

BIC and AIC labels are also returned.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..AIC.label..), 
               formula = formula, 
               parse = TRUE)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..), formula = formula, 
               parse = TRUE)

Within aes() it is possible to compute new labels based on those returned plus “arbitrary” text. The supplied labels are meant to be parsed into expressions, so any text added should be valid for a string that will be parsed.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label =  paste(..eq.label.., ..adj.rr.label.., sep = "~~~~")),
               formula = formula, parse = TRUE)

Above we inserted spaces between the labels, but it is possible to add also other characters, even character strings. In this example we use several labels instead of just two, and separate them with comma followed by a space. Do note the need to escape the embedded quotation marks.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label =  paste(..eq.label.., ..rr.label.., 
                                  ..AIC.label.., ..BIC.label..,
                                  sep = "*\",\"~~")),
               formula = formula, parse = TRUE)

It is also possible to insert arbitrary text, and format it, using plain(), italic(), bold() or bolditalic().

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label =  paste(..eq.label.., ..adj.rr.label.., 
                                  sep = "~~italic(\"with\")~~")),
               formula = formula, parse = TRUE)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = paste("atop(", ..AIC.label.., ",", ..BIC.label.., ")", sep = "")), 
               formula = formula, 
               parse = TRUE)

Two examples of removing and modifying the default lhs and/or rhs of the equation.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..),
               eq.with.lhs = FALSE,
               formula = formula, parse = TRUE)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..),
               eq.with.lhs = "italic(hat(y))~`=`~",
               formula = formula, parse = TRUE)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  labs(x = expression(italic(z)), y = expression(italic(h)) ) + 
  stat_poly_eq(aes(label = ..eq.label..),
               eq.with.lhs = "italic(h)~`=`~",
               eq.x.rhs = "~italic(z)",
               formula = formula, parse = TRUE)

As any valid R expression can be used, Greek letters are also supported, as well as the inclusion in the label of variable transformations used in the model formula.

formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(my.data, aes(x, log10(y + 1e6))) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..),
               eq.with.lhs = "plain(log)[10](italic(y)+10^6)~`=`~",
               formula = formula, parse = TRUE)

A couple of additional examples of polynomials of different orders, and specified in different ways.

Higher order polynomial.

formula <- y ~ poly(x, 5, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..), formula = formula, parse = TRUE)

Intercept forced to zero.

formula <- y ~ x + I(x^2) + I(x^3) - 1
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..), formula = formula, 
               parse = TRUE)

We give below several examples to demonstrate how other components of the ggplot object affect the behaviour of this statistic.

Facets work as expected either with fixed or free scales. Although bellow we had to adjust the size of the font used for the equation. In addition to we manually position the equation label by supplying coordinates.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..), size = 3,
               formula = formula, parse = TRUE) +
  facet_wrap(~group)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..), size = 3,
               formula = formula, parse = TRUE) +
  facet_wrap(~group, scales = "free_y")

Grouping, in this example using the colour aesthetic also works as expected. We can use justification and supply an absolute location for the equation.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..),
               formula = formula, parse = TRUE)

Label positions relative to the ranges of the x and y scales are also supported, both through string constants and numeric values in the range 0 to 1.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..),
               formula = formula, parse = TRUE, label.y.npc = "center")

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..eq.label..),
               formula = formula, parse = TRUE, label.y.npc = 0.75)

The default locations are now based on normalized coordinates, and consequently these defaults work even when the range of the x and y scales varies from panel to panel.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, fill = block)) +
  geom_point(shape = 21, size = 3) +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..rr.label..), size = 3,
               geom = "label", alpha = 0.33,
               formula = formula, parse = TRUE) +
  facet_wrap(~group, scales = "free_y")

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y2, colour = group, fill = block)) +
  geom_point(shape = 21, size = 3) +
  geom_smooth(method = "lm", formula = formula) +
  stat_poly_eq(aes(label = ..rr.label..), size = 3,
               geom = "label", alpha = 0.2,
               formula = formula, parse = TRUE,
               label.y.npc = 0.66) +
  facet_wrap(~group, scales = "free_y")

stat_fit_residuals

I had the need to quickly plot residuals matching fits plotted with geom_smooth() using grouping and facets, so a new (simple) statistic was born.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  stat_fit_residuals(formula = formula)

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  stat_fit_residuals(formula = formula, resid.type = "working")

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y, color = group)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  stat_fit_residuals(formula = formula)

stat_fit_deviations

As I also had the need to highlight residuals in slides and notes to be used in teaching, another statistic was born.

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_deviations(formula = formula, color = "red") +
  geom_point()

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y, color = group)) +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_deviations(formula = formula) +
  geom_point()

formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(my.data, aes(x, y)) +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_deviations(formula = formula, color = "red",
                      arrow = arrow(length = unit(0.015, "npc"), 
                                   ends = "both")) +
  geom_point()

stat_fit_glance

# formula <- y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly correctly!
formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_glance(method = "lm", 
                  method.args = list(formula = formula),
                  geom = "text",
                  aes(label = signif(..p.value.., digits = 4)))

# formula <- y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly() correctly!
formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, color = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_glance(method = "lm", 
                  method.args = list(formula = formula),
                  geom = "text", 
                  aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))

# formula <- y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly correctly!
formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, color = group)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_glance(method = "lm", 
                  method.args = list(formula = formula),
                  label.x.npc = "right",
                  label.y.npc = "bottom",
                  geom = "text", 
                  aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))

It is also possible to fit a non-linear model with method = "nls", and any other model for which a glance() method exists. Do consult the documentation for package ‘broom’. Here we fit the Michaelis-Menten equation to reaction rate versus concentration data.

micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_glance(method = "nls", 
                  method.args = list(formula = micmen.formula),
                  geom = "text",
                  aes(label = paste("AIC = ", signif(..AIC.., digits = 3), 
                                    ", BIC = ", signif(..BIC.., digits = 3),
                                    sep = "")))

stat_fit_tidy

This stat makes it possible to add the equation for any fitted model for which broom::tidy() is implemented. Alternatively, individual values such as estimates for the fitted parameters, standard errors, or P-values can be added to a plot. However, the user has to explicitly construct the labels within aes(). This statistic respects grouping based on aesthetics, and reshapes the output of tidy() so that the values for a given group are in a single row in the returned data.

micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tidy(method = "nls", 
                method.args = list(formula = micmen.formula),
                geom = "text",
                label.x.npc = "right",
                label.y.npc = "bottom",
                aes(label = paste("V[m]~`=`~", signif(..Vm_estimate.., digits = 3),
                                  "%+-%", signif(..Vm_se.., digits = 2),
                                  "~~~~K~`=`~", signif(..K_estimate.., digits = 3),
                                  "%+-%", signif(..K_se.., digits = 2),
                                  sep = "")),
                parse = TRUE)

micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tidy(method = "nls", 
                method.args = list(formula = micmen.formula),
                geom = "text",
                label.x.npc = 0.9,
                label.y.npc = 0.3,
                aes(label = paste("V~`=`~frac(", signif(..Vm_estimate.., digits = 2), "~C,",
                                  signif(..K_estimate.., digits = 2), "+C)",
                                  sep = "")),
                parse = TRUE) +
  labs(x = "C", y = "V")

stat_fit_tb

This stat makes it possible to add summary or ANOVA tables for any fitted model for which broom::tidy() is implemented. The output from tidy() is embedded as a single list value within the returned data, an object of class tibble. This statistic ignores grouping based on aesthetics. This allows fitting models when x or y is a factor (as in such cases ggplot splits the data into groups, one for each level of the factor, which is needed for example for stat_summary() to work as expected). By default, the "table" geometry is used. The use of geom_table() is described in a separate section of this User Guide.

The default output of stat_fit_tb is the default output from tidy(mf) where mf is the fitted model.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.vars = c(Parameter = "term", 
                          Estimate = "estimate", 
                          "s.e." = "std.error", 
                          "italic(t)" = "statistic", 
                          "italic(P)" = "p.value"),
              label.y.npc = "top", label.x.npc = "left",
              parse = TRUE)

When tb.type = "fit.anova" the output returned is that from tidy(anova(mf)) where mf is the fitted model.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.type = "fit.anova",
              tb.vars = c(Effect = "term", 
                          "df",
                          "M.S." = "meansq", 
                          "italic(F)" = "statistic", 
                          "italic(P)" = "p.value"),
               label.y.npc = "top", label.x.npc = "left",
              parse = TRUE)

When tb.type = "fit.coefs" the output returned is that of tidy(mf) after selecting the term and estimate columns.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", formula = formula) +
  stat_fit_tb(method = "lm",
              method.args = list(formula = formula),
              tb.type = "fit.coefs",
              label.y.npc = "center", label.x.npc = "left")

Faceting works as expected, but grouping is ignored as mentioned above. In this case, the colour aesthetic is not applied to the text of the tables. Furthermore, if label.x.npc or label.y.npc are passed numeric vectors of length > 1, the values are obeyed by the different panels.

micmen.formula <- y ~ SSmicmen(x, Vm, K)
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  facet_wrap(~state) +
  geom_point() +
  geom_smooth(method = "nls",
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tb(method = "nls",
              method.args = list(formula = micmen.formula),
              tb.type = "fit.coefs",
              label.x.npc = 0.67,
              label.y.npc = c(0.6, 0.3)) +
  theme(legend.position = "none") +
  labs(x = "C", y = "V")

The data in the example below are split by ggplot into six groups based on the levels of the feed factor. However, as stat_fit_tb() ignores groupings, we can still fit a linear model to all the data in the panel.

ggplot(chickwts, aes(factor(feed), weight)) +
  stat_summary(fun.data = "mean_se") +
  stat_fit_tb(tb.type = "fit.anova",
              label.x.npc = 1.07, hjust = 1,
              label.y.npc = "bottom") +
  expand_limits(y = 0)

We can flip the system of coordinates, if desired. Be aware that in this case the specification of the position and justification of the table is no longer intuitive.

ggplot(chickwts, aes(factor(feed), weight)) +
  stat_summary(fun.data = "mean_se") +
  stat_fit_tb(tb.type = "fit.anova",
              label.x.npc = 1.1, hjust = 0,
              label.y.npc = 0, vjust = 1) +
  expand_limits(y = 0) +
  coord_flip()

It is also possible to rotate the table using angle.

ggplot(chickwts, aes(factor(feed), weight)) +
  stat_summary(fun.data = "mean_se") +
  stat_fit_tb(tb.type = "fit.anova",
              angle = 90,
              tb.vars = c(Effect = "term", 
                          "df",
                          "M.S." = "meansq", 
                          "italic(F)" = "statistic", 
                          "italic(P)" = "p.value"),
              label.x.npc = 0.5, hjust = 0.5,
              label.y.npc = 0, vjust = 1,
              parse = TRUE) +
  expand_limits(y = 0) +
  coord_flip()

stat_fit_augment

Experimental! Use ggplot2::stat_smooth instead of stat_fit_augment if possible.

For a single panel and no grouping, there is little advantage in using this statistic compared to the examples in the documentation of package ‘broom’. With grouping and faceting stat_fit_augment may occasionally be more convenient than ggplot2::stat_smooth because of its flexibility.

# formula <- y ~ poly(x, 3, raw = TRUE)
# broom::augment does not handle poly correctly!
formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_fit_augment(method = "lm",
                   method.args = list(formula = formula))

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, color = group)) +
  geom_point() +
  stat_fit_augment(method = "lm", 
                   method.args = list(formula = formula))

We can override the variable returned as y to be any of the variables in the data frame returned by broom::augment while still preserving the original y values.

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y)) +
  stat_fit_augment(method = "lm",
                   method.args = list(formula = formula),
                   geom = "point",
                   y.out = ".resid")

formula <- y ~ x + I(x^2) + I(x^3)
ggplot(my.data, aes(x, y, color = group)) +
  stat_fit_augment(method = "lm",
                   method.args = list(formula = formula),
                   geom = "point",
                   y.out = ".std.resid")

We can use any model fitting method for which augment is implemented.

args <- list(formula = y ~ k * e ^ x,
             start = list(k = 1, e = 2))
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  stat_fit_augment(method = "nls",
                   method.args = args)

args <- list(formula = y ~ k * e ^ x,
             start = list(k = 1, e = 2))
ggplot(mtcars, aes(wt, mpg)) +
  stat_fit_augment(method = "nls",
                   method.args = args,
                   geom = "point",
                   y.out = ".resid")

args <- list(model = y ~ SSlogis(x, Asym, xmid, scal),
             fixed = Asym + xmid + scal ~1,
             random = Asym ~1 | group,
             start = c(Asym = 200, xmid = 725, scal = 350))
ggplot(Orange, aes(age, circumference, color = Tree)) +
  geom_point() +
  stat_fit_augment(method = "nlme",
                   method.args = args,
                   augment.args = list(data = quote(data)))

stat_dens2d_labels and stat_dens2d_filter

These stats had their origin in an enhancement suggestion for ‘ggrepel’ from Hadley Wickham and discussion with Kamil Slowikowski (ggrepel’s author) and others. In fact the code is based on code Kamil gave during the discussion, but simplified and taking a few further ideas from ggplot::stat_dens2d.

Warning! Which observations are selected by the algorithm used, based on MASS:kde2d, depends strongly on the values of parameters h and n. You may need to alter the defaults by passing explicit arguments to these stats. Beware, though, that what are good values, may depend on individual data sets even if they include the same number of observations. For the selection of observations to work cleanly, the argument for n must create a dense grid. Much larger values of n than in the examples in the documentation of MASS::kde2d and ggplot2::stat_dens2d will be needed in most cases.

Some random data with random labels.

random_string <- function(len = 6) {
paste(sample(letters, len, replace = TRUE), collapse = "")
}

# Make random data.
set.seed(1001)
d <- tibble::tibble(
  x = rnorm(100),
  y = rnorm(100),
  group = rep(c("A", "B"), c(50, 50)),
  lab = replicate(100, { random_string() })
)

stat_dens2d_filter

The stat stat_dens2d_filter filters observations, in other words passes to the geom a subset of the data received as input. The default value for geom is "point".

Using defaults except for the color aesthetic. Highlight 1/10 of observations from lowest density areas of the plot panel.

Highlighting 1/4 of the observations by under-plotting with larger black points.

A different way of highlighting 1/4 of the observations, using over-plotting with a ‘hollow’ shape. We also shift one group with respect to the other.

Highlight 1/4 of observations from lowest density areas of the plot, with density considered separately for each individual group. In this case grouping is based on the color aesthetic.

Add text labels to 1/10 of the observations. The “text_repel” geom sees only these observations.

Add text labels to 1/2 of the observations.

stat_dens2d_labels

The stat stat_dens2d_labels replaces the values of the label (aesthetic) variable in data in the high density regions of the panel with the argument passed to label.fill before passing them to the geom. The default value for geom is "text". The default value of label.fill is "" which results in empty labels, while using NA as fill label results in observations being omitted. Using NA as label.fill is not too different from using stat_dens2d_filter as long as the geom used requires a label aesthetic.

Label 1/10 of observations from lowest density areas of the plot panels.

Add text labels to 45% of the observations.

When using geom "text" we can statically adjust the positioning of labels, but this is rarely enough even when keeping 1/4 of the labels.

Using the geoms from package ‘ggrepel’ avoids clashes among labels or on top of data points. This works with versions 0.6.0 and newer of ‘ggrepel’. One example with geom_text_repel follows.

With geom_label_repel one needs to use a smaller value for keep.fracton, or a smaller size, as labels use more space on the plot than the test alone.

Additional arguments can be used to change the angle and position of the text, but may give unexpected output when labels are long as the repulsion algorithm “sees” always a rectangular bounding box that is not rotated. With short labels or angles that are multiples of 90 degrees, there is no such problem.

Using NA as fill makes the observations with labels set to NA to be skipped completely, possibly leading to text overlapping the points corresponding to unlabelled observations around the boundary of the regions where labels are kept or discarded. We use here alpha so that the overlaps can be seen.

geom_table and stat_fmt_table

The geom_table() geometry plots a data frame or tibble, nested in a tibble passed as data argument, using easthetics x and y for positioning, and label for the data frame containing the data for the table. The table is created as a ‘grid’ grob and added as usual to the ggplot object.

tb <- tibble(date = ymd(c("2017-05-21", "2007-05-21"), tz = "UCT"),
             value = 1:2 * 10)
data.tb <- tibble(x = 0, y = 8e5, tb = list(tb))
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_table(data = data.tb, aes(label = tb), hjust = 0, vjust = 0) +
  theme_bw()

Using stat_fmt_tb() we can rename the columns of the tibble as shown below, and/or select a subset of columns or a slice of rows.

tb <- tibble(date = ymd(c("2017-05-21", "2007-05-21"), tz = "UCT"),
             value = 1:2 * 10)
data.tb <- tibble(x = 0, y = 8e5, tb = list(tb))
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_table(data = data.tb, aes(label = tb), hjust = 0, vjust = 0,
             stat = "fmt_tb", tb.vars = c(Date = "date", Amount = "value")) +
  theme_bw()

tb <- tibble(date = ymd(c("2017-05-21", "2007-05-21"), tz = "UCT"),
             value = 1:2 * 10)
data.tb <- tibble(x = 25, y = 8e5, tb = list(tb))
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_table(data = data.tb, aes(label = tb), colour = "blue", size = 4) +
  theme_bw()

Parsed text, using plot math syntax is supported in the table, with fall-back to plain text in case of parsing errors, on a cell by cell basis.

tb.pm <- tibble(parameter = c("frac(beta[1], a^2)", "frac(beta[2], a^3)"),
             value = c("10^2.4", "10^3.532"))
data.tb <- tibble(x = 12.5, y = 8.5e5, tb = list(tb.pm))
ggplot(my.data, aes(x, y)) +
  geom_point() +
  geom_table(data = data.tb, aes(label = tb), parse = TRUE) +
  theme_bw()

try_tibble

Time series

Several different formats for storing time series data are used in R. Here we use in the examples objects of class ts but several other classes are supported as try.xts() is used internally. The first example is a quarterly series.

## [1] "ts"
## [1] "tbl_df"     "tbl"        "data.frame"
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $x
## [1] "numeric"
## # A tibble: 4 x 2
##   time                     x
##   <dttm>               <dbl>
## 1 1971-04-01 00:00:00 13067.
## 2 1971-07-01 00:00:00 13130.
## 3 1971-10-01 00:00:00 13198.
## 4 1972-01-01 00:00:00 13254.

The next chunk demonstrates that numeric times are expressed as decimal years in the returned data frame.

## $time
## [1] "numeric"
## 
## $x
## [1] "numeric"
## # A tibble: 4 x 2
##    time      x
##   <dbl>  <dbl>
## 1 1971. 13067.
## 2 1972. 13130.
## 3 1972. 13198.
## 4 1972  13254.

This second example is for a series of yearly values.

## [1] "ts"
## [1] "tbl_df"     "tbl"        "data.frame"
## $time
## [1] "POSIXct" "POSIXt" 
## 
## $x
## [1] "numeric"
## # A tibble: 3 x 2
##   time                    x
##   <dttm>              <dbl>
## 1 1820-02-01 00:00:00   269
## 2 1821-02-01 00:00:00   321
## 3 1822-02-01 00:00:00   585

Above there is a small rounding error of 1 s for these old dates. We can correct this by rounding to year.

## # A tibble: 3 x 2
##   time                    x
##   <dttm>              <dbl>
## 1 1821-01-01 00:00:00   269
## 2 1822-01-01 00:00:00   321
## 3 1823-01-01 00:00:00   585

In addition we can convert the POSIXct values into numeric values in calendar years plus a decimal fraction.

## $time
## [1] "numeric"
## 
## $x
## [1] "numeric"
## # A tibble: 3 x 2
##    time     x
##   <dbl> <dbl>
## 1  1821   269
## 2  1822   321
## 3  1823   585

Other classes

try_tibble() attempts to handle gracefully objects that are not time series.

## # A tibble: 5 x 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5
## # A tibble: 5 x 1
##   x    
##   <fct>
## 1 a    
## 2 b    
## 3 c    
## 4 d    
## 5 e
## # A tibble: 5 x 1
##   x    
##   <fct>
## 1 a    
## 2 b    
## 3 c    
## 4 d    
## 5 e
## # A tibble: 5 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     1     3
## 4     1     4
## 5     1     5
## # A tibble: 5 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     1     3
## 4     1     4
## 5     1     5
## # A tibble: 5 x 2
##      V1    V2
##   <int> <int>
## 1     1     6
## 2     2     7
## 3     3     8
## 4     4     9
## 5     5    10

Appendix: Additional examples of the use density filtering

We define a function to simplify the generation of random data sets.

make_data_tbl <- function(nrow = 100, rfun = rnorm, ...) {
  if (nrow %% 2) {
    nrow <- nrow + 1
  }
  
  set.seed(1001)
  
  tibble::tibble(
    x = rfun(nrow, ...),
    y = rfun(nrow, ...),
    group = rep(c("A", "B"), c(nrow / 2, nrow / 2))
  )
}

As we will draw many points on the plotting area we change the default theme to an uncluttered one.

old_theme <- theme_set(theme_bw())

In all the examples in this vignette we use colours to demonstrate which data points are selected, but any other suitable aesthetic and discrete scale can be used instead. With keep.sparse = FALSE we keep 1/3 of the observations in the denser region of the plot. Although here we first plot all data points and later overplot the selected ones this is not necessary.

ggplot(data = make_data_tbl(300), aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(color = "red", 
                     keep.sparse = FALSE, 
                     keep.fraction = 1/3)

Here we highlight the observations split in three group equal groups, each of a different density of observations.

ggplot(data = make_data_tbl(300), aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(color = "red", 
                     keep.sparse = FALSE, 
                     keep.fraction = 1/3)+
  stat_dens2d_filter(color = "blue", 
                     keep.fraction = 1/3)

The algorithm seems to work well also with other distributions, in this example the uniform distribution.

ggplot(data = make_data_tbl(300, rfun = runif), aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(color = "red", keep.fraction = 1/2)

One example with the gamma distribution, which is asymmetric.

ggplot(data = make_data_tbl(300, rfun = rgamma, shape = 2), 
       aes(x, y)) +
  geom_point() +
  stat_dens2d_filter(color = "red", keep.fraction = 1/3)