This article provides an example of how to use
owidR to create a country level dataset consisting of multiple variables. It does this in the context of trying to answer the research question: does higher internet use lead to higher levels of democracy? This is based on research done by (citation) We will do a similar analysis using data gathered from Our World in Data. This analysis is only intended to be an example of how to use owidR and not a robust research paper. The article assumes some basic knowledge of the
ggplot2. If you aren’t familiar with either of these you should still be able to follow along but I would recommend reading R for Data Science. An understanding of basic R is essential.
To begin we’ll load
owidR using the
library() function. We’ll also load
plm as well as
texreg which we’ll be using to do the analysis.
library(owidR) library(dplyr) library(ggplot2) library(plm) library(texreg)
Searching for and importing data using
owidR is very easy. First, we can search for data on a topic using
owid_search(). We’ll start by searching for data to use as our outcome variable: internet To do this I enter the keyword “internet” as the argument in
When running this line of code around 10 datasets about the internet are returned. Let’s use the dataset with the title: Share of the population using the Internet. The corresponding chart_id to this data is: “Share of the population using the Internet”. Using the chart_id as an argument to
owid() imports that data into R, assigning it to an object called internet. We use the rename argument to give the value column a shorter a clean name.
<- owid("share-of-individuals-using-the-internet", rename = "internet_use") internet internet#> # A tibble: 7,246 × 4 #> entity code year internet_use #> * <chr> <chr> <int> <dbl> #> 1 Afghanistan AFG 1990 0 #> 2 Afghanistan AFG 1991 0 #> 3 Afghanistan AFG 1992 0 #> 4 Afghanistan AFG 1993 0 #> 5 Afghanistan AFG 1994 0 #> 6 Afghanistan AFG 1995 0 #> 7 Afghanistan AFG 2001 0.00472 #> 8 Afghanistan AFG 2002 0.00456 #> 9 Afghanistan AFG 2003 0.0879 #> 10 Afghanistan AFG 2004 0.106 #> # … with 7,236 more rows
We can find information about the source of data by using
owid_source(), with the owid dataset object as the argument. This gives us the original publisher of the data as well as a link to the data. For some datasets additional information about how the variables is calculated is also provided. Using
view_chart() takes you to the Our World in Data webpage for that dataset, where there is also additional information and a pretty graph.
owid_source(internet) #> Dataset Name: International Telecommunication Union (via World Bank) #> #> Published By: World Development Indicators - World Bank (2021.07.30) #> #> Link: http://data.worldbank.org/data-catalog/world-development-indicators
To create simple plots to see how internet use has changed over time simply use
owid_plot(), filtering to give the World total. Given that this function is a wrapper around
ggplot2 you can use normal
ggplot2 functions to further manipulate the graph. I’m going to add a title using
labs() and change the y axis scale so that it starts from 0 (this makes the graph clearer to interpret given that the value is a percentage, otherwise small variations can appear large).
owid_plot(internet, filter = "World") + labs(title = "Share of the World Population using the Internet") + scale_y_continuous(limits = c(0, 100)) #> Loading required namespace: showtext
We can see how internet use varies between countries by creating a choropleth map using
owid_map. It shows that, in 2018, there is still a large variation in the level of internet use in countries, with many African countries having particularly low use.
owid_map(internet, year = 2017) + labs(title = "Share of Population Using the Internet, 2017")
It’s also possible to compare countries level of internet use across time, again using
owid_plot(). By using the argument
summarise = FALSE,
owid_plot() will show individual countries instead of aggregating them into the total. You can then use the
filter argument to select which countries you want to be displayed.
owid_plot(internet, summarise = FALSE, filter = c("United Kingdom", "Spain", "Russia", "Egypt", "Nigeria")) + labs(title = "Share of Population with Using the Internet") + scale_y_continuous(limits = c(0, 100), labels = scales::label_number(suffix = "%")) # The labels argument allows you to make it clear that the value is a percentage
Now let’s get data on democracy, first searching for a data source and then importing it using
owid(). Using that data we’ll do some similar exploration to what we did with internet use data.
<- owid("political-regime-updated2016-distinction-democracies-and-full-democracies", democracy rename = "polity") %>% filter(year %in% 1960:2020) democracy#> # A tibble: 8,817 × 4 #> entity code year polity #> <chr> <chr> <int> <int> #> 1 Afghanistan AFG 1960 -10 #> 2 Afghanistan AFG 1961 -10 #> 3 Afghanistan AFG 1962 -10 #> 4 Afghanistan AFG 1963 -10 #> 5 Afghanistan AFG 1964 -7 #> 6 Afghanistan AFG 1965 -7 #> 7 Afghanistan AFG 1966 -7 #> 8 Afghanistan AFG 1967 -7 #> 9 Afghanistan AFG 1968 -7 #> 10 Afghanistan AFG 1969 -7 #> # … with 8,807 more rows owid_source(democracy) #> Dataset Name: Political Regime (OWID based on Polity IV and Wimmer & Min) #> #> Published By: Our World In Data combined two datasets: Wimmer and Min (2006) for information on whether a country was colonized; Center for Systemic Peace for a measure of the political regime. #> #> Link: #> #> Polity 2 Measure ranges from -10 (autocracy) to +10 (full democracy). If a country was colonized in a given year is encoded as -20. In cases in which there was data from both Min and Wimmer and also Polity IV the Polity IV data is shown. owid_map(democracy, palette = "YlGn") + labs(title = "Political Regime (Polity IV)")
So we’ve done some nice exploratory analysis and produced some pretty graphs, but now let’s get into some more in depth analysis. We’ll use a fixed effect (FE) regression analysis to estimate the average effect that an increase in internet use has on democracy within a country. If you aren’t familiar with FE regression this article explain its purpose well and this chapter from Introduction to Econometrics with R shows how to implement it in R. To estimate the effect of internet use we’re going to use a within-unit fixed effects model.
This model will require us to adjust for confounding factors, so we’ll need some extra data. I’m going to use data on variables that I think might be confounding the relationship between internet use and democracy. These are: GDP per Capita, Government Expenditure, Age Dependency and Unemployment. There are almost certainly more confounding factors so feel free to use
owid_search() to find data on other variables you think might be confounders and add them to the analysis.
<- owid("gdp-per-capita-worldbank", rename = "gdp") gdp <- owid("total-gov-expenditure-gdp-wdi", rename = "gov_exp") gov_exp <- owid("age-dependency-ratio-of-working-age-population", rename = "age_dep") age_dep <- owid("unemployment-rate", rename = "unemp")unemployment
In order to create an FE model, all these separate dataframes now need to combined into one. To do this I’m going to use the
left_join() function from
dplyr and create a new dataframe called
data that combines all the other dataframes.
<- internet %>% data left_join(democracy) %>% left_join(gdp) %>% left_join(gov_exp) %>% left_join(age_dep) %>% left_join(unemployment) #> Joining, by = c("entity", "code", "year") #> Joining, by = c("entity", "code", "year") #> Joining, by = c("entity", "code", "year") #> Joining, by = c("entity", "code", "year") #> Joining, by = c("entity", "code", "year")
Now that we have a combined dataset we can get to the analysis. First, let’s use
ggplot2 create a graph to see the correlation between internet access and democracy in 2015.
%>% data filter(year == 2015) %>% ggplot(aes(internet_use, polity)) + geom_point(colour = "#57677D") + geom_smooth(method = "lm", colour = "#DC5E78") + labs(title = "Relationship Between Internet Use and Polity IV Score", x = "Internet Use", y = "Polity IV") + theme_owid() #> `geom_smooth()` using formula 'y ~ x' #> Warning: Removed 90 rows containing non-finite values (stat_smooth). #> Warning: Removed 90 rows containing missing values (geom_point).
There appears to be some relationship but this could easily be explained by countries with higher development also being more democratic and not actually the result of internet access. That’s why we control for GDP and the other confounders. Next, we’ll create two models, one with just internet use and democracy, and the other with the confounders added.
<- plm(polity ~ internet_use, data, fe_model effect = c("individual"), index = "entity") <- plm(polity ~ internet_use + gdp + gov_exp + age_dep + unemp, data, fe_model_2 effect = c("individual"), index = "entity") htmlreg(list(fe_model, fe_model_2))
|Model 1||Model 2|
|p < 0.001; p < 0.01; p < 0.05|
You can see that internet use has a significant positive effect in the first model, but once the confounders are added the effect is insignificant. This means that our model provides no evidence that internet use has an effect on democracy. However, feel free to play around with this data yourself and see if you get a different result when other variables are used.