5 Bars & histograms
The add_bars()
and add_histogram()
functions wrap the bar and histogram plotly.js trace types. The main difference between them is that bar traces require bar heights (both x
and y
), whereas histogram traces require just a single variable, and plotly.js handles binning in the browser.15 And perhaps confusingly, both of these functions can be used to visualize the distribution of either a numeric or a discrete variable. So, essentially, the only difference between them is where the binning occurs.
Figure 5.1 compares the default binning algorithm in plotly.js to a few different algorithms available in R via the hist()
function. Although plotly.js has the ability to customize histogram bins via xbins
/ybins
, R has diverse facilities for estimating the optimal number of bins in a histogram that we can easily leverage.16 The hist()
function alone allows us to reference 3 famous algorithms by name (Sturges 1926; Freedman and Diaconis 1981; Scott 1979), but there are also packages (e.g. the histogram package) which extend this interface to incorporate more methodology (Mildenberger, Rozenholc, and Zasada. 2009). The price_hist()
function below wraps the hist()
function to obtain the binning results, and map those bins to a plotly version of the histogram using add_bars()
.
p1 <- plot_ly(diamonds, x = ~price) %>%
add_histogram(name = "plotly.js")
price_hist <- function(method = "FD") {
h <- hist(diamonds$price, breaks = method, plot = FALSE)
plot_ly(x = h$mids, y = h$counts) %>% add_bars(name = method)
}
subplot(
p1, price_hist(), price_hist("Sturges"), price_hist("Scott"),
nrows = 4, shareX = TRUE
)
Figure 5.2 demonstrates two ways of creating a basic bar chart. Although the visual results are the same, its worth noting the difference in implementation. The add_histogram()
function sends all of the observed values to the browser and lets plotly.js perform the binning. It takes more human effort to perform the binning in R, but doing so has the benefit of sending less data, and requiring less computation work of the web browser. In this case, we have only about 50,000 records, so there is not much of a difference in page load times or page size. However, with 1 Million records, page load time more than doubles and page size nearly doubles.17
library(dplyr)
p1 <- plot_ly(diamonds, x = ~cut) %>%
add_histogram()
p2 <- diamonds %>%
count(cut) %>%
plot_ly(x = ~cut, y = ~n) %>%
add_bars()
subplot(p1, p2) %>% hide_legend()
5.1 Multiple numeric distributions
It is often useful to see how the numeric distribution changes with respect to a discrete variable. When using bars to visualize multiple numeric distributions, I recommend plotting each distribution on its own axis using a small multiples display, rather than trying to overlay them on a single axis.18. Chapter 13, and specifically Section 13.1.2.3, discuss small multiples in more detail, but Figure 13.9 demonstrates how it be done with plot_ly()
and subplot()
. Note how the one_plot()
function defines what to display on each panel, then a split-apply-recombine (i.e., split()
, lapply()
, subplot()
) strategy is employed to generate the trellis display.
one_plot <- function(d) {
plot_ly(d, x = ~price) %>%
add_annotations(
~unique(clarity), x = 0.5, y = 1,
xref = "paper", yref = "paper", showarrow = FALSE
)
}
diamonds %>%
split(.$clarity) %>%
lapply(one_plot) %>%
subplot(nrows = 2, shareX = TRUE, titleX = FALSE) %>%
hide_legend()
5.2 Multiple discrete distributions
Visualizing multiple discrete distributions is difficult. The subtle complexity is due to the fact that both counts and proportions are important for understanding multi-variate discrete distributions. Figure 5.4 presents diamond counts, divided by both their cut and clarity, using a grouped bar chart.
Figure 5.4 is useful for comparing the number of diamonds by clarity, given a type of cut. For instance, within “Ideal” diamonds, a cut of “VS1” is most popular, “VS2” is second most popular, and “I1” the least popular. The distribution of clarity within “Ideal” diamonds seems to be fairly similar to other diamonds, but it’s hard to make this comparison using raw counts. Figure 5.5 makes this comparison easier by showing the relative frequency of diamonds by clarity, given a cut.
# number of diamonds by cut and clarity (n)
cc <- count(diamonds, cut, clarity)
# number of diamonds by cut (nn)
cc2 <- left_join(cc, count(cc, cut, wt = n, name = 'nn'))
cc2 %>%
mutate(prop = n / nn) %>%
plot_ly(x = ~cut, y = ~prop, color = ~clarity) %>%
add_bars() %>%
layout(barmode = "stack")
This type of plot, also known as a spine plot, is a special case of a mosaic plot. In a mosaic plot, you can scale both bar widths and heights according to discrete distributions. For mosaic plots, I recommend using the ggmosaic package (Jeppson, Hofmann, and Cook 2016), which implements a custom ggplot2 geom designed for mosaic plots, which we can convert to plotly via ggplotly()
. Figure 5.6 shows a mosaic plot of cut by clarity. Notice how the bar widths are scaled proportional to the cut frequency.
library(ggmosaic)
p <- ggplot(data = cc) +
geom_mosaic(aes(weight = n, x = product(cut), fill = clarity))
ggplotly(p)
References
Freedman, D., and P. Diaconis. 1981. “On the Histogram as a Density Estimator: L2 Theory.” Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete 57: 453–76.
Jeppson, Haley, Heike Hofmann, and Di Cook. 2016. Ggmosaic: Mosaic Plots in the ’Ggplot2’ Framework. http://github.com/haleyjeppson/ggmosaic.
Mildenberger, Thoralf, Yves Rozenholc, and David Zasada. 2009. Histogram: Construction of Regular and Irregular Histograms with Different Options for Automatic Choice of Bins. https://CRAN.R-project.org/package=histogram.
Scott, David W. 1979. “On Optimal and Data-Based Histograms.” Biometrika 66: 605–10.
Sturges, Herbert A. 1926. “The Choice of a Class Interval.” Journal of the American Statistical Association 21 (153): 65–66. https://doi.org/10.1080/01621459.1926.10502161.
As we’ll see in Section 16.1, and specifically Figure 16.6, using ‘statistical’ a trace type like
add_histogram()
enables statistical graphical queries.↩︎Optimal in this context is the number of bins which minimizes the distance between the empirical histogram and the underlying density.↩︎
These tests were run on Google Chrome and loaded a page with a single bar chart. See https://www.webpagetest.org/result/160924_DP_JBX for
add_histogram()
and https://www.webpagetest.org/result/160924_QG_JA1 foradd_bars()
.↩︎It’s much easier to visualize multiple numeric distributions on a single axis using lines↩︎