rrefine

VP Nagraj

2021-10-03

Introduction

OpenRefine (formerly Google Refine) is a popular, open source data cleaning software1. rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions avaialable in this package, you can import, export or delete a project in OpenRefine directly from an R script. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby2. rrefine extends this functionality to R users.

Installation

The latest version of rrefine is availabe on Github and can be installed via devtools by using the following:

# install.packages("devtools")
devtools::install_github("vpnagraj/rrefine")
library(rrefine)

lateformeeting

rrefine includes a sample “dirty” data set to illustrate its features. This object (lateformeeting) is a simulated data frame that holds 63 observations of dates, days of the week, numbers of hours slept and indicators of whether or not the subject was on time for work. The data are recorded in inconsistent formats and will require cleaning in order to be parsed correctly by R. You can take a look at how messy things are below:

theDate what.day.whas.it sleephours was.i.on.time.for.work
1/8/99 Friday 7 Yes
Janury 15 1999 friday 5 No
1/25/1999 Monday 8 No
1/28/1999 thursday 8 Yes
2/9/99 Tuesday six Yes
2/11/1999 Thursday 7 No
February 15, 1999 Monday 8 Yes
2/19/1999 Friday 10 Yes
3/3/1999 Wednesday 5 n
3/4/1999 Thursday 9 Yes
3/5/1999 Fridday 8 Y
March 8, 1999 Monday 11 Yes
3/15/99 Monday 8 yep
3/16/1999 Tuesday 8 Yes
3/17/1999 Wedensday 6.5 N
4/1/99 Thursday 7 Yes
4/12/1999 Monday 8 Yes
4/20/1999 Tuesday 10 Yes
4/27/99 Tuesday 7 Yes
5/5/1999 Wednesday 6 N
5/12/1999 Wednseday 9 Yes
5/13/99 Thursday 7 Yes
5/17/1999 monday 7 and a half Y
May 25 1999 Tuesday 8 YES
5/27/1999 Thursday 8 No
6/1/1999 tuesday 7 Yes
6/7/1999 Monday 10 Yes
6/11/1999 Friday 6 NO
6/15/99 Tuesday 8 Yes
6/24/1999 thursday 7.5 Yes
7/1/1999 Thursday 7 No
July 6 1999 Wednesday 5 Yes
7/12/1999 Monday 8 Y
7/20/1999 tuesday 9 Yes
7/21/99 Wednesday 6 No
7/23/1999 fridya 8 Yes
8/3/1999 Tuesday 7 Yes
8/5/1999 Thursday 6 Y
8/16/1999 Modnay 8 Yes
Aug 27 1999 Friday 4 Yes
August 31, 1999 Teusday 7 yep
September 1, 1999 Wednesday 11 Yes
9/10/1999 Friday 7 Yes
9/20/1999 Monday 7 No
9/23/1999 thursdayy 7.5 Yes
9/28/1999 Monday 8 Yes
10/1/99 Friday 7 Y
10/4/1999 Monday 6 Yes
10/5/1999 tuesday 9 Yes
10/14/1999 Thursday 8 Yes
10/22/1999 Friday 7 No
10/25/99 Monday 7 Yes
10/27/1999 Wednesday 8 Y
11/2/1999 Tuesday 7 Yes
11/3/1999 Wednseday 8 Yes
11/8/1999 monday 5 Yes
11/18/99 Thursday 8 No
12/1/1999 Wednesday 7 Yes
12/3/1999 Friday 6 N
12/6/1999 Monday 6.5 Yes
12/13/1999 Monday 9 Yes
12/16/1999 Thursday 7 Yes
12/21/99 Tuesday 8 Y

refine_upload()

While all of the data cleaning here could be performed using R, this is a typical scenario for OpenRefine users. The first step to creating a new project is to make sure OpenRefine is installed and running locally at http://127.0.0.1:3333/3. The refine_upload() function allows you to pass the contents of a comma separted value (csv) along with a project name (optional) and an argument to automatically open the browser in which OpenRefine is running. The example below demonstrates this workflow using the lateformeeting sample data:

write.csv(lateformeeting, file = "lateformeeting.csv", row.names = FALSE)
refine_upload(file = "lateformeeting.csv", project.name = "lfm_cleanup", open.browser = TRUE)

With the project uploaded, you can perform any of the desired clean-up procedures in OpenRefine.

refine_export()

Once you’ve cleaned up the data in OpenRefine you can pull it back into R for plotting, modeling, etc. by using refine_export(). This function will accept either the project name or the numerical unique identifier. It is only necessary to use both if there are multiple projects with the same name in your OpenRefine application. Note that here, the data is exported directly into R as a data frame, and you can assign it to a new object.

lfm_clean <- refine_export(project.name = "lfm_cleanup")
lfm_clean
date dotw hours.slept on.time
1999-01-08T00:00:00Z Friday 7.0 Yes
1999-01-15T00:00:00Z Friday 5.0 No
1999-01-25T00:00:00Z Monday 8.0 No
1999-01-28T00:00:00Z Thursday 8.0 Yes
1999-02-09T00:00:00Z Tuesday 6.0 Yes
1999-02-11T00:00:00Z Thursday 7.0 No
1999-02-15T00:00:00Z Monday 8.0 Yes
1999-02-19T00:00:00Z Friday 10.0 Yes
1999-03-03T00:00:00Z Wednesday 5.0 No
1999-03-04T00:00:00Z Thursday 9.0 Yes
1999-03-05T00:00:00Z Friday 8.0 Yes
1999-03-08T00:00:00Z Monday 11.0 Yes
1999-03-15T00:00:00Z Monday 8.0 Yes
1999-03-16T00:00:00Z Tuesday 8.0 Yes
1999-03-17T00:00:00Z Wednesday 6.5 No
1999-04-01T00:00:00Z Thursday 7.0 Yes
1999-04-12T00:00:00Z Monday 8.0 Yes
1999-04-20T00:00:00Z Tuesday 10.0 Yes
1999-04-27T00:00:00Z Tuesday 7.0 Yes
1999-05-05T00:00:00Z Wednesday 6.0 No
1999-05-12T00:00:00Z Wednesday 9.0 Yes
1999-05-13T00:00:00Z Thursday 7.0 Yes
1999-05-17T00:00:00Z Monday 7.5 Yes
1999-05-25T00:00:00Z Tuesday 8.0 Yes
1999-05-27T00:00:00Z Thursday 8.0 No
1999-06-01T00:00:00Z Tuesday 7.0 Yes
1999-06-07T00:00:00Z Monday 10.0 Yes
1999-06-11T00:00:00Z Friday 6.0 No
1999-06-15T00:00:00Z Tuesday 8.0 Yes
1999-06-24T00:00:00Z Thursday 7.5 Yes
1999-07-01T00:00:00Z Thursday 7.0 No
1999-07-06T00:00:00Z Wednesday 5.0 Yes
1999-07-12T00:00:00Z Monday 8.0 Yes
1999-07-20T00:00:00Z Tuesday 9.0 Yes
1999-07-21T00:00:00Z Wednesday 6.0 No
1999-07-23T00:00:00Z Friday 8.0 Yes
1999-08-03T00:00:00Z Tuesday 7.0 Yes
1999-08-05T00:00:00Z Thursday 6.0 Yes
1999-08-16T00:00:00Z Monday 8.0 Yes
1999-08-27T00:00:00Z Friday 4.0 Yes
1999-08-31T00:00:00Z Tuesday 7.0 Yes
1999-09-01T00:00:00Z Wednesday 11.0 Yes
1999-09-10T00:00:00Z Friday 7.0 Yes
1999-09-20T00:00:00Z Monday 7.0 No
1999-09-23T00:00:00Z Thursday 7.5 Yes
1999-09-28T00:00:00Z Monday 8.0 Yes
1999-10-01T00:00:00Z Friday 7.0 Yes
1999-10-04T00:00:00Z Monday 6.0 Yes
1999-10-05T00:00:00Z Tuesday 9.0 Yes
1999-10-14T00:00:00Z Thursday 8.0 Yes
1999-10-22T00:00:00Z Friday 7.0 No
1999-10-25T00:00:00Z Monday 7.0 Yes
1999-10-27T00:00:00Z Wednesday 8.0 Yes
1999-11-02T00:00:00Z Tuesday 7.0 Yes
1999-11-03T00:00:00Z Wednesday 8.0 Yes
1999-11-08T00:00:00Z Monday 5.0 Yes
1999-11-18T00:00:00Z Thursday 8.0 No
1999-12-01T00:00:00Z Wednesday 7.0 Yes
1999-12-03T00:00:00Z Friday 6.0 No
1999-12-06T00:00:00Z Monday 6.5 Yes
1999-12-13T00:00:00Z Monday 9.0 Yes
1999-12-16T00:00:00Z Thursday 7.0 Yes
1999-12-21T00:00:00Z Tuesday 8.0 Yes

From there the clean data is available for analyses that couldn’t have been performed in its original format.

refine_delete()

To clean up your OpenRefine workspace you can delete projects using refine_delete(). Just like refine_export() it’s possible to pass either a project name or unique identifier to this function. And it is only necessary to use both if there are multiple projects with the same name.

refine_delete(project.name = "lfm_cleanup")

References


  1. https://openrefine.org/↩︎

  2. https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Developers#known-client-libraries-for-refine↩︎

  3. https://openrefine.org/download.html↩︎