Reproducible Datasets


In this case, we will create the dataset from a file. We add the mandatory Title attribute to the dataset, and a mandatory data frame identifier dataset_id. Because the iris dataset does not have a clear identifier, the dataset() constructor will use the row names as unique observation identifiers. We do not define a dimension column, and apart from the Species attribute, we define all variables as measurements.

In a tidy dataset, we only apply one unit of measure. In this case this is millimeters. We add this as an optional parameter to the dataset.

The .f parameter names the function that will read in the dataset. Any function can be used, but it must be installed on the user’s computer. The ... ellipsis must contain all parameters that are necessary to make the call to the file reading function successful. In this case, we will use a singe parameter, the path to the file.

temp_path <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = temp_path, row.names = F)
iris_ds <- read_dataset(
  dataset_id = "iris_dataset",
  obs_id = NULL,
  dimensions = NULL,
  measures = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
  attributes = "Species",
  Title = "Iris Dataset",
  unit = list(code = "MM", label = "millimeter"),
  .f = "utils::read.csv", file = temp_path )

That attributes of the dataset contain the creation date and the function call that created the dataset. The related items contain the bibliographic reference to the creator function. Because we used a local file, the entire path is recorded for reproducibility. A similar function will be defined for internet URLs.


Let’s apply a very simple function (method) on this dataset, the summary function (in this case, the summary method of the base R data.frame class.)

iris_summary <- use_function(dataset = iris_ds, .f = "summary")

The new dataset, iris_summary shows that it was created by a read.csv call and them the summary method created its current form and contents.

knitr::kable(attr(iris_summary, "Date"))

The dataset class maintains the entire processing history and all the bibliographic references of the functions that contributed to its creation. [There version will be recorded, too.] This creates the full reproducibility of the dataset class.