The supportR
package is an amalgam of distinct functions
I’ve written to accomplish small data wrangling, quality control, or
visualization tasks. These functions tend to be short and
narrowly-defined. An additional consequence of the motivation for
creating them is that they tend to not be inter-related or united by a
common theme. If this vignette feels somewhat scattered because of that,
I hope it doesn’t negatively affect how informative this is!
This vignette describes the main functions of supportR
using the examples included in each function.
library(supportR)
In order to demonstrate the summarizing function(s), we’ll use some
some example data from Dr. Allison
Horst’s palmerpenguins
R package.
# Load library
library(palmerpenguins)
# Glimpse the penguins dataset
::glimpse(penguins)
dplyr#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
With that data loaded, we can use the summary_table
function to quickly get group-wise summaries and retrieve generally
useful summary statistics.
# Summarize the data
summary_table(data = penguins, groups = c("species", "island"),
response = "bill_length_mm", drop_na = T)
#> species island mean std_dev sample_size std_error
#> 1 Adelie Biscoe 38.98 2.48 44 0.37
#> 2 Adelie Dream 38.50 2.47 56 0.33
#> 3 Adelie Torgersen 38.95 3.03 52 0.42
#> 4 Chinstrap Dream 48.83 3.34 68 0.41
#> 5 Gentoo Biscoe 47.50 3.08 124 0.28
The groups
argument supports a vector of all of the
column names to group by while response
must be a single
numeric column. The drop_na
argument allows group
combinations that result in an NA to be automatically dropped (i.e., if
a penguin didn’t have an island listed that would be dropped). The mean,
standard deviation (SD), sample size, and standard error (SE) are all
returned to facilitate easy figure creation. There is also a
round_digits
argument that lets you specify how many digits
you’d like to retain for the mean, SD, and SE.
I do write a lot of tidyverse
-style lengthy pipes
(%>%
) that create and remove potentially many columns
without creating intermediary objects. This does have the potential for
accidentally dropping desired columns without noticing (especially when
using dplyr::select
to implicitly exclude
columns). I’ve created the diff_check
function to–in
part–quickly confirm that this isn’t happening.
Generally speaking, diff_check
just compares two vectors
and reports back what is in the first but not the second (i.e., what is
“lost”) and what is in the second but not the first (i.e., what is
“gained”).
# Make two vectors
<- c("x", "a", "b")
vec1 <- c("y", "z", "a")
vec2
# Compare them!
diff_check(old = vec1, new = vec2)
#> Following element(s) found in old object but not new:
#> [1] "b" "x"
#> Following element(s) found in new object but not old:
#> [1] "y" "z"
diff_check
also includes optional logical arguments
sort
and return
that will respectively either
sort the difference in both vectors and return a two-element if set to
TRUE
. As I said above, I most often feed
diff_check
the vector of column names returned by
names
before and after a pipe to ensure that I know which
columns were lost/gained by a set of operations.
This package also includes the function num_check
that
identifies all values of a column that would be coerced to
NA
if as.numeric
was run on the column.
# Make a dataframe with non-numbers in a number column
<- data.frame('species' = c('salmon', 'bass', 'halibut', 'eel'),
fish 'count' = c(1, '14x', '_23', 12))
# Use `num_check` to identify non-numbers
num_check(data = fish, col = "count")
#> [1] "14x" "_23"
Once these non-numbers are identified you can handle that in whatever
way you feel is most appropriate. num_check
is intended
only to flag these for your attention, not to attempt a fix
using a method you may or may not support.
date_check
on the other hand does that same thing but it
is checking a column for entries that would be coerced to
NA
by as.Date
instead. Note that if a date is
sufficiently badly formatted as.Date
will throw an error
instead of coercing to NA
to date_check
will
do the same thing.
# Make a dataframe including malformed dates
<- data.frame('site' = c("LTR", "GIL", "PYN", "RIN"),
sites 'visit' = c('2021-01-01', '2021-01-0w', '1990', '2020-10-xx'))
# Now we can use our function to identify bad dates
date_check(data = sites, col = 'visit')
#> [1] "2021-01-0w" "1990" "2020-10-xx"
Both num_check
and date_check
have an
expanded version (multi_num_check
and
multi_date_check
respectively) that uses a
col_vec
argument instead of col
.
col_vec
accepts a vector of column names and checks all of
those columns for non-numbers or bad dates.
Finally (for now), is the descriptively-named
date_format_guess
function. This function checks a column
of dates (stored as characters!) and tries to guess the format of the
date (i.e., month/day/year, day/month/year, etc.).
It can make a more informed guess if there is a grouping column because it can use the frequency of the numbers to guess whether a given number is the day or the month. This is based on the assumption that sampling occurs more often within months than across so if the “left” number is repeated more than the “right” it is likely the month while the less frequent number (i.e., more unique values in that position) is likely days.
If you are uncomfortable with that assumption (totally fine!) you can
set groups
to FALSE
and it will do the more
commonplace assessments (i.e., if a number is >12 it is day,
etc.).
# Make a dataframe with dates in various formats and a grouping column
<- data.frame('data_enterer' = c('person A', 'person B',
my_df 'person B', 'person B',
'person C', 'person D',
'person E', 'person F',
'person G'),
'bad_dates' = c('2022.13.08', '2021/2/02',
'2021/2/03', '2021/2/04',
'1899/1/15', '10-31-1901',
'26/11/1901', '08.11.2004',
'6/10/02'))
# Now we can invoke the function!
date_format_guess(data = my_df, date_col = "bad_dates",
group_col = "data_enterer", return = "dataframe")
#> Returning dataframe of data format guesses
#> data_enterer bad_dates format_guess
#> 1 person A 2022.13.08 year/day/month
#> 2 person B 2021/2/02 year/month/day
#> 3 person B 2021/2/03 year/month/day
#> 4 person B 2021/2/04 year/month/day
#> 5 person C 1899/1/15 year/month/day
#> 6 person D 10-31-1901 month/day/year
#> 7 person E 26/11/1901 day/month/year
#> 8 person F 08.11.2004 FORMAT UNCERTAIN
#> 9 person G 6/10/02 FORMAT UNCERTAIN
# If preferred, do it without groups and return a vector
date_format_guess(data = my_df, date_col = "bad_dates",
groups = FALSE, return = "vector")
#> Defining `groups` is strongly recommended! If none exist, consider adding a single artificial group shared by all rows then re-run this function
#> Returning vector of data format guesses
#> [1] "year/day/month" "FORMAT UNCERTAIN" "FORMAT UNCERTAIN" "FORMAT UNCERTAIN"
#> [5] "year/month/day" "month/day/year" "day/month/year" "FORMAT UNCERTAIN"
#> [9] "FORMAT UNCERTAIN"
Note that dates that cannot be guessed by my function will return “FORMAT UNCERTAIN” so that you can handle them using your knowledge of the system (or by returning to your raw data if need be).
I’ve created a set of custom ggplot2
theme
elements to guarantee that all of my figures share similar aesthetics.
Feel free to use theme_lyon
if you have similar
preferences!
# Load ggplot2
library(ggplot2)
# Create a plot
ggplot(penguins, aes(x = species, y = body_mass_g, fill = species)) +
geom_boxplot(outlier.shape = 24) +
theme_lyon()
#> Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
theme_lyon
does the following changes to a
ggplot2
plot:
I’ve also created nms_ord
and pcoa_ord
for
Non-Metric Multi-Dimensional Scaling (NMS) & Principal Coordinates
Analysis (PCoA) Ordinations respectively.
# Load data from the `vegan` package
::data("varespec", package = 'vegan')
utils<- varespec
resp
# Make a columns to split the data into 4 groups
<- c(rep.int("Trt_1", (nrow(resp)/4)),
factor_4lvl rep.int("Trt_2", (nrow(resp)/4)),
rep.int("Trt_3", (nrow(resp)/4)),
rep.int("Trt_4", (nrow(resp)/4)))
# And combine them into a single data object
<- cbind(factor_4lvl, resp)
data
# Actually perform multidimensional scaling
<- vegan::metaMDS(data[-1], autotransform = FALSE,
mds expand = FALSE, k = 2, try = 10)
# With the scaled object and original dataframe we can use this function
nms_ord(mod = mds, groupcol = data$factor_4lvl,
title = '4-Level NMS', leg_pos = 'topright',
leg_cont = c('1', '2', '3', '4'))
pcoa_ord
has the same syntax as nms_ord
but
it expects an object created by ape::pcoa
rather than
vegan::metaMDS
.
array_melt
allows users to ‘melt’ an array of dimensions
X, Y, and Z into a dataframe containing columns “x”, “y”, “z”, and
“value” where “value” is whatever was stored at those coordinates in the
array.
# Make data to fill the array
<- c(5, 9, 3)
vec1 <- c(10:15)
vec2
# Create dimension names (x = col, y = row, z = which matrix)
<- c("Col_1","Col_2","Col_3")
x_vals <- c("Row_1","Row_2","Row_3")
y_vals <- c("Mat_1","Mat_2")
z_vals
# Make an array from these components
<- array(data = c(vec1, vec2), dim = c(3, 3, 2),
g dimnames = list(x_vals, y_vals, z_vals))
# "Melt" the array into a dataframe
array_melt(array = g)
#> z y x value
#> 1 Mat_1 Col_1 Row_1 5
#> 2 Mat_1 Col_1 Row_2 10
#> 3 Mat_1 Col_1 Row_3 13
#> 4 Mat_1 Col_2 Row_1 9
#> 5 Mat_1 Col_2 Row_2 11
#> 6 Mat_1 Col_2 Row_3 14
#> 7 Mat_1 Col_3 Row_1 3
#> 8 Mat_1 Col_3 Row_2 12
#> 9 Mat_1 Col_3 Row_3 15
#> 10 Mat_2 Col_1 Row_1 5
#> 11 Mat_2 Col_1 Row_2 10
#> 12 Mat_2 Col_1 Row_3 13
#> 13 Mat_2 Col_2 Row_1 9
#> 14 Mat_2 Col_2 Row_2 11
#> 15 Mat_2 Col_2 Row_3 14
#> 16 Mat_2 Col_3 Row_1 3
#> 17 Mat_2 Col_3 Row_2 12
#> 18 Mat_2 Col_3 Row_3 15
crop_tri
also exists under this category and provides a
hopefully straightforward way of dropping one “triangle” of a dataframe
/ matrix. It also includes a drop_diag
argument that
accepts a logical for whether to drop the diagonal of the data object.
See ?crop_tri
. for complete syntax explanation
Finally, I’ve written rmd_export
which knits and exports
a given R Markdown file locally and to a user-designated Google Drive
folder. Note that you MUST authenticate your R
session with the googledrive
package so that it has
permission to access the Drive folder you supply. I recommend running
googledrive::drive_auth()
and doing the authentication
“dance” before using rmd_export
to ensure that this doesn’t
cause issues for you.
# Authorize R to interact with GoogleDrive
::drive_auth()
googledrive
# Use `rmd_export()` to knit and export an .Rmd file
rmd_export(rmd = "my_markdown.Rmd",
in_path = file.path("Folder in my WD with the .Rmd named in `rmd`"),
out_path = file.path("Folder in my WD to save the knit file to"),
out_name = "desired name for output",
out_type = "html",
drive_link = "<Full Google Drive link>")
If you have ideas for other functions that
lterpalettefinder
could contain, post them as a GitHub Issue and
we’ll review them as soon as possible!