supportR Vignette

Nicholas J Lyon

Overview

The supportR package is an amalgam of distinct functions I’ve written to accomplish small data wrangling, quality control, or visualization tasks. These functions tend to be short and narrowly-defined. An additional consequence of the motivation for creating them is that they tend to not be inter-related or united by a common theme. If this vignette feels somewhat scattered because of that, I hope it doesn’t negatively affect how informative this is!

This vignette describes the main functions of supportR using the examples included in each function.

library(supportR)

Summarizing Data

In order to demonstrate the summarizing function(s), we’ll use some some example data from Dr. Allison Horst’s palmerpenguins R package.

# Load library
library(palmerpenguins)

# Glimpse the penguins dataset
dplyr::glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

With that data loaded, we can use the summary_table function to quickly get group-wise summaries and retrieve generally useful summary statistics.

# Summarize the data
summary_table(data = penguins, groups = c("species", "island"),
              response = "bill_length_mm", drop_na = T)
#>     species    island  mean std_dev sample_size std_error
#> 1    Adelie    Biscoe 38.98    2.48          44      0.37
#> 2    Adelie     Dream 38.50    2.47          56      0.33
#> 3    Adelie Torgersen 38.95    3.03          52      0.42
#> 4 Chinstrap     Dream 48.83    3.34          68      0.41
#> 5    Gentoo    Biscoe 47.50    3.08         124      0.28

The groups argument supports a vector of all of the column names to group by while response must be a single numeric column. The drop_na argument allows group combinations that result in an NA to be automatically dropped (i.e., if a penguin didn’t have an island listed that would be dropped). The mean, standard deviation (SD), sample size, and standard error (SE) are all returned to facilitate easy figure creation. There is also a round_digits argument that lets you specify how many digits you’d like to retain for the mean, SD, and SE.

Quality Control

I do write a lot of tidyverse-style lengthy pipes (%>%) that create and remove potentially many columns without creating intermediary objects. This does have the potential for accidentally dropping desired columns without noticing (especially when using dplyr::select to implicitly exclude columns). I’ve created the diff_check function to–in part–quickly confirm that this isn’t happening.

Generally speaking, diff_check just compares two vectors and reports back what is in the first but not the second (i.e., what is “lost”) and what is in the second but not the first (i.e., what is “gained”).

# Make two vectors
vec1 <- c("x", "a", "b")
vec2 <- c("y", "z", "a")

# Compare them!
diff_check(old = vec1, new = vec2)
#> Following element(s) found in old object but not new:
#> [1] "b" "x"
#> Following element(s) found in new object but not old:
#> [1] "y" "z"

diff_check also includes optional logical arguments sort and return that will respectively either sort the difference in both vectors and return a two-element if set to TRUE. As I said above, I most often feed diff_check the vector of column names returned by names before and after a pipe to ensure that I know which columns were lost/gained by a set of operations.

This package also includes the function num_check that identifies all values of a column that would be coerced to NA if as.numeric was run on the column.

# Make a dataframe with non-numbers in a number column
fish <- data.frame('species' = c('salmon', 'bass', 'halibut', 'eel'),
                   'count' = c(1, '14x', '_23', 12))

# Use `num_check` to identify non-numbers
num_check(data = fish, col = "count")
#> [1] "14x" "_23"

Once these non-numbers are identified you can handle that in whatever way you feel is most appropriate. num_check is intended only to flag these for your attention, not to attempt a fix using a method you may or may not support.

date_check on the other hand does that same thing but it is checking a column for entries that would be coerced to NA by as.Date instead. Note that if a date is sufficiently badly formatted as.Date will throw an error instead of coercing to NA to date_check will do the same thing.

# Make a dataframe including malformed dates
sites <- data.frame('site' = c("LTR", "GIL", "PYN", "RIN"),
                    'visit' = c('2021-01-01', '2021-01-0w', '1990', '2020-10-xx'))

# Now we can use our function to identify bad dates
date_check(data = sites, col = 'visit')
#> [1] "2021-01-0w" "1990"       "2020-10-xx"

Both num_check and date_check have an expanded version (multi_num_check and multi_date_check respectively) that uses a col_vec argument instead of col. col_vec accepts a vector of column names and checks all of those columns for non-numbers or bad dates.

Finally (for now), is the descriptively-named date_format_guess function. This function checks a column of dates (stored as characters!) and tries to guess the format of the date (i.e., month/day/year, day/month/year, etc.).

It can make a more informed guess if there is a grouping column because it can use the frequency of the numbers to guess whether a given number is the day or the month. This is based on the assumption that sampling occurs more often within months than across so if the “left” number is repeated more than the “right” it is likely the month while the less frequent number (i.e., more unique values in that position) is likely days.

If you are uncomfortable with that assumption (totally fine!) you can set groups to FALSE and it will do the more commonplace assessments (i.e., if a number is >12 it is day, etc.).

# Make a dataframe with dates in various formats and a grouping column
my_df <- data.frame('data_enterer' = c('person A', 'person B',
                                       'person B', 'person B',
                                       'person C', 'person D',
                                       'person E', 'person F',
                                       'person G'),
                    'bad_dates' = c('2022.13.08', '2021/2/02',
                                    '2021/2/03', '2021/2/04',
                                    '1899/1/15', '10-31-1901',
                                    '26/11/1901', '08.11.2004',
                                    '6/10/02'))

# Now we can invoke the function!
date_format_guess(data = my_df, date_col = "bad_dates",
                  group_col = "data_enterer", return = "dataframe")
#> Returning dataframe of data format guesses
#>   data_enterer  bad_dates     format_guess
#> 1     person A 2022.13.08   year/day/month
#> 2     person B  2021/2/02   year/month/day
#> 3     person B  2021/2/03   year/month/day
#> 4     person B  2021/2/04   year/month/day
#> 5     person C  1899/1/15   year/month/day
#> 6     person D 10-31-1901   month/day/year
#> 7     person E 26/11/1901   day/month/year
#> 8     person F 08.11.2004 FORMAT UNCERTAIN
#> 9     person G    6/10/02 FORMAT UNCERTAIN

# If preferred, do it without groups and return a vector
date_format_guess(data = my_df, date_col = "bad_dates",
                  groups = FALSE, return = "vector")
#> Defining `groups` is strongly recommended! If none exist, consider adding a single artificial group shared by all rows then re-run this function
#> Returning vector of data format guesses
#> [1] "year/day/month"   "FORMAT UNCERTAIN" "FORMAT UNCERTAIN" "FORMAT UNCERTAIN"
#> [5] "year/month/day"   "month/day/year"   "day/month/year"   "FORMAT UNCERTAIN"
#> [9] "FORMAT UNCERTAIN"

Note that dates that cannot be guessed by my function will return “FORMAT UNCERTAIN” so that you can handle them using your knowledge of the system (or by returning to your raw data if need be).

Data Visualization

I’ve created a set of custom ggplot2 theme elements to guarantee that all of my figures share similar aesthetics. Feel free to use theme_lyon if you have similar preferences!

# Load ggplot2
library(ggplot2)

# Create a plot
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(outlier.shape = 24) +
  theme_lyon()
#> Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

theme_lyon does the following changes to a ggplot2 plot:

I’ve also created nms_ord and pcoa_ord for Non-Metric Multi-Dimensional Scaling (NMS) & Principal Coordinates Analysis (PCoA) Ordinations respectively.

# Load data from the `vegan` package
utils::data("varespec", package = 'vegan')
resp <- varespec

# Make a columns to split the data into 4 groups
factor_4lvl <- c(rep.int("Trt_1", (nrow(resp)/4)),
                 rep.int("Trt_2", (nrow(resp)/4)),
                 rep.int("Trt_3", (nrow(resp)/4)),
                 rep.int("Trt_4", (nrow(resp)/4)))

# And combine them into a single data object
data <- cbind(factor_4lvl, resp)

# Actually perform multidimensional scaling
mds <- vegan::metaMDS(data[-1], autotransform = FALSE, 
                      expand = FALSE, k = 2, try = 10)

# With the scaled object and original dataframe we can use this function
nms_ord(mod = mds, groupcol = data$factor_4lvl,
        title = '4-Level NMS', leg_pos = 'topright',
        leg_cont = c('1', '2', '3', '4'))

pcoa_ord has the same syntax as nms_ord but it expects an object created by ape::pcoa rather than vegan::metaMDS.

Reshaping Data

array_melt allows users to ‘melt’ an array of dimensions X, Y, and Z into a dataframe containing columns “x”, “y”, “z”, and “value” where “value” is whatever was stored at those coordinates in the array.

# Make data to fill the array
vec1 <- c(5, 9, 3)
vec2 <- c(10:15)

# Create dimension names (x = col, y = row, z = which matrix)
x_vals <- c("Col_1","Col_2","Col_3")
y_vals <- c("Row_1","Row_2","Row_3")
z_vals <- c("Mat_1","Mat_2")

# Make an array from these components
g <- array(data = c(vec1, vec2), dim = c(3, 3, 2),
           dimnames = list(x_vals, y_vals, z_vals))

# "Melt" the array into a dataframe
array_melt(array = g)
#>        z     y     x value
#> 1  Mat_1 Col_1 Row_1     5
#> 2  Mat_1 Col_1 Row_2    10
#> 3  Mat_1 Col_1 Row_3    13
#> 4  Mat_1 Col_2 Row_1     9
#> 5  Mat_1 Col_2 Row_2    11
#> 6  Mat_1 Col_2 Row_3    14
#> 7  Mat_1 Col_3 Row_1     3
#> 8  Mat_1 Col_3 Row_2    12
#> 9  Mat_1 Col_3 Row_3    15
#> 10 Mat_2 Col_1 Row_1     5
#> 11 Mat_2 Col_1 Row_2    10
#> 12 Mat_2 Col_1 Row_3    13
#> 13 Mat_2 Col_2 Row_1     9
#> 14 Mat_2 Col_2 Row_2    11
#> 15 Mat_2 Col_2 Row_3    14
#> 16 Mat_2 Col_3 Row_1     3
#> 17 Mat_2 Col_3 Row_2    12
#> 18 Mat_2 Col_3 Row_3    15

crop_tri also exists under this category and provides a hopefully straightforward way of dropping one “triangle” of a dataframe / matrix. It also includes a drop_diag argument that accepts a logical for whether to drop the diagonal of the data object. See ?crop_tri. for complete syntax explanation

Miscellaneous Other Functions

Finally, I’ve written rmd_export which knits and exports a given R Markdown file locally and to a user-designated Google Drive folder. Note that you MUST authenticate your R session with the googledrive package so that it has permission to access the Drive folder you supply. I recommend running googledrive::drive_auth() and doing the authentication “dance” before using rmd_export to ensure that this doesn’t cause issues for you.

# Authorize R to interact with GoogleDrive
googledrive::drive_auth()

# Use `rmd_export()` to knit and export an .Rmd file
rmd_export(rmd = "my_markdown.Rmd",
           in_path = file.path("Folder in my WD with the .Rmd named in `rmd`"),
           out_path = file.path("Folder in my WD to save the knit file to"),
           out_name = "desired name for output",
           out_type = "html",
           drive_link = "<Full Google Drive link>")

Looking Ahead

If you have ideas for other functions that lterpalettefinder could contain, post them as a GitHub Issue and we’ll review them as soon as possible!