# Create and Analyse a Dataset

This article provides an example of how to use owidR to create a country level dataset consisting of multiple variables. It does this in the context of trying to answer the research question: does higher internet use lead to higher levels of democracy? This is based on research done by (citation) We will do a similar analysis using data gathered from Our World in Data. This analysis is only intended to be an example of how to use owidR and not a robust research paper. The article assumes some basic knowledge of the tidyverse, especially dplyr and ggplot2. If you aren’t familiar with either of these you should still be able to follow along but I would recommend reading R for Data Science. An understanding of basic R is essential.

To begin we’ll load owidR using the library() function. We’ll also load dplyr, ggplot2, plm as well as texreg which we’ll be using to do the analysis.

library(owidR)
library(dplyr)
library(ggplot2)
library(plm)
library(texreg)

Searching for and importing data using owidR is very easy. First, we can search for data on a topic using owid_search(). We’ll start by searching for data to use as our outcome variable: internet To do this I enter the keyword “internet” as the argument in owid_search().

owid_search("internet")
#> No internet connection available: returning blank tibble

When running this line of code around 10 datasets about the internet are returned. Let’s use the dataset with the title: Share of the population using the Internet. The corresponding chart_id to this data is: “Share of the population using the Internet”. Using the chart_id as an argument to owid() imports that data into R, assigning it to an object called internet. We use the rename argument to give the value column a shorter a clean name.

internet <- owid("share-of-individuals-using-the-internet", rename = "internet_use")
#> No internet connection available: returning blank tibble
internet
#> # A tibble: 1 × 3
#>   entity year  value
#>   <lgl>  <lgl> <lgl>
#> 1 NA     NA    NA

We can find information about the source of data by using owid_source(), with the owid dataset object as the argument. This gives us the original publisher of the data as well as a link to the data. For some datasets additional information about how the variables is calculated is also provided. Using view_chart() takes you to the Our World in Data webpage for that dataset, where there is also additional information and a pretty graph.

owid_source(internet)
#> owid object had not connected to ourworldindata.org
#> NULL
view_chart(internet)

To create simple plots to see how internet use has changed over time simply use owid_plot(), filtering to give the World total. Given that this function is a wrapper around ggplot2 you can use normal ggplot2 functions to further manipulate the graph. I’m going to add a title using labs() and change the y axis scale so that it starts from 0 (this makes the graph clearer to interpret given that the value is a percentage, otherwise small variations can appear large).

owid_plot(internet, filter = "World") +
labs(title = "Share of the World Population using the Internet") +
scale_y_continuous(limits = c(0, 100))
#> owid object had not connected to ourworldindata.org

We can see how internet use varies between countries by creating a choropleth map using owid_map. It shows that, in 2018, there is still a large variation in the level of internet use in countries, with many African countries having particularly low use.

owid_map(internet, year = 2017) +
labs(title = "Share of Population Using the Internet, 2017")
#> owid object had not connected to ourworldindata.org

It’s also possible to compare countries level of internet use across time, again using owid_plot(). By using the argument summarise = FALSE, owid_plot() will show individual countries instead of aggregating them into the total. You can then use the filter argument to select which countries you want to be displayed.

owid_plot(internet, summarise = FALSE, filter = c("United Kingdom", "Spain", "Russia", "Egypt", "Nigeria")) +
labs(title = "Share of Population with Using the Internet") +
scale_y_continuous(limits = c(0, 100), labels = scales::label_number(suffix = "%")) # The labels argument allows you to make it clear that the value is a percentage
#> owid object had not connected to ourworldindata.org

Now let’s get data on democracy, first searching for a data source and then importing it using owid(). Using that data we’ll do some similar exploration to what we did with internet use data.

owid_search("democrac")
democracy <- owid("electoral-democracy")
#> No internet connection available: returning blank tibble
democracy
#> # A tibble: 1 × 3
#>   entity year  value
#>   <lgl>  <lgl> <lgl>
#> 1 NA     NA    NA

owid_source(democracy)
#> owid object had not connected to ourworldindata.org
#> NULL

owid_map(democracy, palette = "YlGn") +
labs(title = "Electoral Democracy")
#> owid object had not connected to ourworldindata.org

So we’ve done some nice exploratory analysis and produced some pretty graphs, but now let’s get into some more in depth analysis. We’ll use a fixed effect (FE) regression analysis to estimate the average effect that an increase in internet use has on democracy within a country. If you aren’t familiar with FE regression this article explain its purpose well and this chapter from Introduction to Econometrics with R shows how to implement it in R. To estimate the effect of internet use we’re going to use a within-unit fixed effects model.

This model will require us to adjust for confounding factors, so we’ll need some extra data. I’m going to use data on variables that I think might be confounding the relationship between internet use and democracy. These are: GDP per Capita, Government Expenditure, Age Dependency and Unemployment. There are almost certainly more confounding factors so feel free to use owid_search() to find data on other variables you think might be confounders and add them to the analysis.

gdp <- owid("gdp-per-capita-worldbank", rename = "gdp")
#> No internet connection available: returning blank tibble

gov_exp <- owid("total-gov-expenditure-gdp-wdi", rename = "gov_exp")
#> No internet connection available: returning blank tibble

age_dep <- owid("age-dependency-ratio-of-working-age-population", rename = "age_dep")
#> No internet connection available: returning blank tibble

unemployment <- owid("unemployment-rate", rename = "unemp")
#> No internet connection available: returning blank tibble

In order to create an FE model, all these separate dataframes now need to combined into one. To do this I’m going to use the left_join() function from dplyr and create a new dataframe called data that combines all the other dataframes.

data <- internet %>%
left_join(democracy) %>%
left_join(gdp) %>%
left_join(gov_exp) %>%
left_join(age_dep) %>%
left_join(unemployment)
#> Joining, by = c("entity", "year", "value")
#> Joining, by = c("entity", "year", "value")
#> Joining, by = c("entity", "year", "value")
#> Joining, by = c("entity", "year", "value")
#> Joining, by = c("entity", "year", "value")

Now that we have a combined dataset we can get to the analysis. First, let’s use ggplot2 create a graph to see the correlation between internet access and democracy in 2015.

data %>%
filter(year == 2015) %>%
ggplot(aes(internet_use, electdem_vdem_owid)) +
geom_point(colour = "#57677D") +
geom_smooth(method = "lm", colour = "#DC5E78") +
labs(title = "Relationship Between Internet Use and Polity IV Score", x = "Internet Use", y = "Polity IV") +
theme_owid()
#> Warning in theme_owid(): importing fonts requires an internet connection please
#> use import_fonts = FALSE
#> Error in FUN(X[[i]], ...): object 'internet_use' not found

There appears to be some relationship but this could easily be explained by countries with higher development also being more democratic and not actually the result of internet access. That’s why we control for GDP and the other confounders. Next, we’ll create two models, one with just internet use and democracy, and the other with the confounders added.

fe_model <- plm(electdem_vdem_owid ~ internet_use, data,
effect = c("individual"), index = "entity")
#> Error in 1:Ti[i]: NA/NaN argument

fe_model_2 <- plm(electdem_vdem_owid ~ internet_use + gdp + gov_exp + age_dep + unemp, data,
effect = c("individual"), index = "entity")
#> Error in 1:Ti[i]: NA/NaN argument

htmlreg(list(fe_model, fe_model_2))
#> Error in "list" %in% class(l)[1]: object 'fe_model' not found

You can see that internet use has a significant positive effect in the first model, but once the confounders are added the effect is insignificant. This means that our model provides no evidence that internet use has an effect on democracy. However, feel free to play around with this data yourself and see if you get a different result when other variables are used.