fplot plots of distributions, automatically adjusting the parameters for the user. The syntax uses formulas, allowing conditional/weighted distributions with minimum efforts. The many arguments are automatically adjusted to provide the most meaningful (and hopefully beautiful!) graphs depending on the data at hand.

Although the core of the package concerns distributions, fplot also offers functions to plot trends (possibly conditional) and conditional boxplots.

Since the main purpose of this package is to provide graphs ready for publications, you can globally relabel all the variables to be displayed with setFplot_dict. In this vignette this is done once:

setFplot_dict(c(Origin = "Exporting Country", Destination = "Importing Country", Euros = "Exports Value in €", jnl_top_25p = "Pub. in Top 25% journal", jnl_top_5p = "Publications in Top 5% journal", journal = "Journal", institution = "U.S. Institution", Petal.Length = "Petal Length"))

1 Distributions

The function plot_distr plots distributions. The syntax of the function plot_distr is as follows:

plot_distr(weight(s) ~ variable | moderator, data)

This function is extremely versatile: its default behavior totally depends on the data it deals with.

1.1 Categorical data

We will use data on publications extracted from the Microsoft Academic Graph.

library(fplot)
data(us_pub_econ, package = "fplot")

This data corresponds to publications from U.S. institution in the field of economics in the period 1985-1990. Here is a sample of the data:

paper_id year institution journal jnl_top_25p jnl_top_5p
865948 1988 university of california berkeley university of san francisco law review 0 0
2027349 1990 university of wisconsin madison climatic change 1 0
2299436 1988 columbia university marine resource economics 0 0
2558483 1988 yale university journal of accounting auditing & finance 0 0
2672463 1989 university of michigan administration in social work 0 0
2860349 1985 university of kansas administration in social work 0 0

What’s the distribution of publications per institutions?

# When there is only one variable, you can use a vector
plot_distr(us_pub_econ$institution)

What can we remark? First only the first 15 most important institutions are plotted. Indeed, since the institution variable is a categorical data, plotting all the 444 different would have just cluttered the graph and nothing could have been visible. We can increase the number of bars plotted with argument nbins.

Although only the first are plotted, we still have information on the institutions left with the last (broken) bar reporting the number of publications that are left. We can avoid displaying this column with the argument other = FALSE.

We can also notice that not only the graph of the distribution is shown, but the number of observations is reported on the top of each bar to have a quick insight of the total numbers. Further, the size of the numbers always fit the width of the bars. The argument top manages the information displayed on the top of the bars: it can be numbers (top = "nb"), fractions (top = "frac"), or nothing (top = "none"). The default depends on the type of data.

Finally, the labels in the x-axis: i) all appear and ii) are truncated meaningfully. Since the institution variable is categorical, there is no way to deduce which institution each bar represent if it is not displayed explicitly. They are further truncated at 20 characters (argument trunc) and the truncation method tries to keep the most information from the labels (trunc.method = "auto"). But there are other truncation methods, like trimming the right (resp. the middle) of the label with trunc.method = "right" (resp. trunc.method = "mid"). To display horizontal labels instead, you can use labels.tilted = FALSE.

Now let’s weight the production of the institutions by the quality of the journals:

plot_distr(jnl_top_5p ~ institution, us_pub_econ)

We added the weight directly in the left-hand-side of the formula. Notice that now Harvard ranks 1st (it was 3rd when not weighting by quality).

We can add several weights at the same time:

plot_distr(1 + jnl_top_25p + jnl_top_5p ~ institution, us_pub_econ)

As we can see, the more we weight by quality, the more the distribution is skewed towards the top institutions.

Now we illustrate the conditional graphs by asking the question of what are the journals in which the 3 most productive institutions publish.

plot_distr(~ journal | institution, us_pub_econ)
#> The 3 moderators were chosen based on frequency.

What happened? In the x-axis, the journals are displayed since they are the “variable”, and since we added the institution as a moderator, the data is broken down across each institution. But not all the 444 institutions are shown: by default only the three moderator values (here institutions) that appear the most are reported. When there is an automatic selection, a message is also prompted. The user can increase the number of moderator values reported with the argument mod.select, or provide the specific values to be shown.

The 7 most important journals are displayed for each institution and the scale corresponds to the share of the journal within the institution. Instead of displaying the within share, the total share could have been displayed with the option total = TRUE.

Here the numbers on the top of the bars come handy to provide comparisons across institutions. Importantly, because the journal names are long, it would have been impossible to display them in a standard way. As we can see, they are first truncated (meaningfully) and then tilted. in the end, the graph is fully informative.

By default, when the data is split, the bar containing the “other” information is not displayed to save space. But of course, we can add it with the appropriate option:

# Previous graph + "other" column
plot_distr(~ journal | institution, us_pub_econ, other = TRUE)
#> The 3 moderators were chosen based on frequency.