The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!
There are three ways to use the package:
Interactive data exploration (univariat, bivariat, multivariat)
Generate an Automated Report with one line of code. The target can be binary, categorical or numeric.
Manual exploration using a easy to remember set of tidy functions. Introduces four main verbs. explore() to grafically explore a variable or table, describe() to describe a variable or table, explain_tree() to create a simple decision tree that explains a target. report() to generate an automated report of all variables.
explore package on Github: https://github.com/rolkra/explore
As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.
Explore your dataset (in this case the iris dataset) in one line of code:
A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few “mouseclicks”.
You can choose each variable containng as a target, that is binary (0/1, FALSE/TRUE or “no”/“yes”), categorical or numeric.
Create a rich HTML report of all variables with one line of code:
Or you can simply add a target and create the report. In this case we use a binary tharget, but a categorical or numerical target would work as well.
If you use a binary tharget, the parameter split = FALSE will give you a different view on the data.
Grow a decision tree with one line of code:
You can grow a decision tree with a binary target too.
Or using a numerical target. The syntax stays the same.
You can control the growth of the tree using the parameters
Explore your table with one line of code to see which type of variables it contains.
You can also use describe_tbl() if you just need the main facts without visualisation.
Explore a variable with one line of code. You don’t have to care if a variable is numerical or categorical.
Explore a variable and its relationship with a binary target with one line of code. You don’t have to care if a variable is numerical or categorical.
Using split = FALSE will change the plot to %target:
The target can have more than two levels:
Or the target can even be numeric:
To use a high number of variables with explore_all() in a RMarkdown-File, it is necessary to set a meaningful fig.width and fig.height in the junk. The function total_fig_height() helps to automatically set fig.height:
If you use a target:
fig.height=total_fig_height(iris, target = Species)
You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot)
Explore correlation between two variables with one line of code:
You can add a target too:
If you use explore to explore a variable and want to set lower and upper limits for values, you can use the
max_val parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val.
explore uses auto-scale by default. To deactivate it use the parameter
auto_scale = FALSE
Describe your data in one line of code:
iris %>% describe() #> # A tibble: 5 x 8 #> variable type na na_pct unique min mean max #> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> #> 1 Sepal.Length dbl 0 0 35 4.3 5.84 7.9 #> 2 Sepal.Width dbl 0 0 23 2 3.06 4.4 #> 3 Petal.Length dbl 0 0 43 1 3.76 6.9 #> 4 Petal.Width dbl 0 0 22 0.1 1.2 2.5 #> 5 Species fct 0 0 3 NA NA NA
The result is a data-frame, where each row is a variable of your data. You can use
filter from dplyr for quick checks:
You can use
describe for describing variables too. You don’t need to care if a variale is numerical or categorical. The output is a text.
Create a Data Dictionary of a dataset (Markdown File data_dict.md)
Add title, detailed descriptions and change default filename
To clean a variable you can use
clean_var. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value.
iris %>% clean_var(Sepal.Length, min_val = 4.5, max_val = 7.0, na = 5.8, name = "sepal_length") %>% describe() #> # A tibble: 5 x 8 #> variable type na na_pct unique min mean max #> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> #> 1 sepal_length dbl 0 0 26 4.5 5.81 7 #> 2 Sepal.Width dbl 0 0 23 2 3.06 4.4 #> 3 Petal.Length dbl 0 0 43 1 3.76 6.9 #> 4 Petal.Width dbl 0 0 22 0.1 1.2 2.5 #> 5 Species fct 0 0 3 NA NA NA
The explore package comes with a set easy to remember function to connect, read and write from/to a datawarehouse (dwh) using odbc.
# connect to a dwh(odbc DSN must be defined) dwh <- dwh_connect("DWH_DSN") # if you need to pass user and password dwh <- dwh_connect("DWH_DSN", user = "myuser", pwd = rstudioapi::askForPassword() ) # read table from a dwh data <- dwh_read_table(dwh, "db.tablename") # read data from a dwh using sql data <- dwh_read_data(dwh, sql = "select * from db.tablename") # disconnect from dwh dwh_disconnect(dwh)
To write large data to a dwh you can use
dwh_fastload(). It connects to a dwh, writes the data and disconnects.