Histograms (and bar plots) are common tools to visualize a single variable. The x axis is often used to locate the bins and the y axis is for the counts. Density plots can be considered as the smoothed version of the histogram.
Boxplot is another method to visualize one dimensional data. Five summary statistics can be easily traced on the plot. However, compared with histograms and density plots, boxplot can accommodate two variables,
groups (often on the
x axis) and
ys (on the
geom_density only accept one variable,
y (swapped). Providing both positions is forbidden. Inspired by the boxplot (
ggplot2), we create functions
geom_density_ which can accommodate both variables, just like the
mtcars data set and suppose that we are interested in the relationship of number of gears given the
cyl (number of cylinders).
Though the Figure 1, we can tell that
Compare vertically: given the number of engines, tell the gears
Most V8 engine cars prefer 3 gear transmission. V8 cars do not use 4 gear transmission
Most V4 engine cars prefer 4 gears transmission.
Compare horizontally: given the number of gears, tell the engines
Most 3 gear transmission cars carry a V8 engine.
Most 4 gear transmission cars carry a V4 engine, then V6 engine, but never V8 engine.
Five gear transmission cars can carry either a V4, V6 or V8 engine. However, compared with other two transmissions, 5 gear is not a common choice.
Suppose now, we are interested in the distribution of
mpg (miles per gallon) with the respect to the
cyl (as “x” axis) and
gear (as “fill”)
Through the Figure 2, we can easily tell that as the number of cylinders rises, the miles/gallon drops significantly. Moreover, the number of six cylinder cars is much less that the other two in our data. In addition, the transmission of V8 cars is either 3 or 5 (identical to the conclusion we draw before).
geom_histogram_ is often used as one factor is discrete and the other is continuous, while function
geom_bar_ accommodate two discrete variables. The former one relies on
stat = bin_ and the latter one is on
stat = count_. However, if we turn the factor of interest as numerical in
geom_bar_, there would be no difference between the output of a bar plot and a histogram. Hence, function
geom_hist is created by simplifying the process. It understands both cases and users can just call
geom_hist to create either a bar plot or a histogram.
We could also draw density plot side by side to better convey the data of interest. With
geom_density_, both summaries can be displayed simultaneously in one chart.
scaleY is often used to set the scales of each density (or bar). The default “data” indicates that the area of each density is proportional to the count of such group.
The area of group cylinder 8 is approximately twice as much as the group cylinder 6.
If only one variable is provided in
geom_density_() (so does
geom_bar_()), the original function
geom_density() will be executed automatically.
which is identical to call function
geom_density(). However, if we take a look at this chart, we can realize that the area for each group is 1. In other words, the whole area is 3 in total. In
geom_density_, we have a parameter called
asOne. If it is set as
TRUE, the sum of the density area is 1 and the area for each group is proportional to its own count.
Note that when we set
position in function
geom_density, we should use the underscore case, that is “stack_”, “dodge_” or “dodge2_” (instead of “stack”, “dodge” or “dodge2”).
geom_density, we can stack the density on top of each other by setting
position = 'stack_' (default
position = 'identity_')
Dodging preserves the vertical position of an geom while adjusting the horizontal position (the default position of