2.3 Bars & histograms

The add_bars() and add_histogram() functions wrap the bar and histogram plotly.js trace types. The main difference between them is that bar traces require bar heights (both x and y), whereas histogram traces require just a single variable, and plotly.js handles binning in the browser.7 And perhaps confusingly, both of these functions can be used to visualize the distribution of either a numeric or a discrete variable. So, essentially, the only difference between them is where the binning occurs.

Figure 2.22 compares the default binning algorithm in plotly.js to a few different algorithms available in R via the hist() function. Although plotly.js has the ability to customize histogram bins via xbins/ybins, R has diverse facilities for estimating the optimal number of bins in a histogram that we can easily leverage.8 The hist() function alone allows us to reference 3 famous algorithms by name (Sturges 1926); (Freedman and Diaconis 1981); (Scott 1979), but there are also packages (e.g. the histogram package) which extend this interface to incorporate more methodology (Mildenberger, Rozenholc, and Zasada. 2009). The price_hist() function below wraps the hist() function to obtain the binning results, and map those bins to a plotly version of the histogram using add_bars().

p1 <- plot_ly(diamonds, x = ~price) %>% add_histogram(name = "plotly.js")

price_hist <- function(method = "FD") {
  h <- hist(diamonds$price, breaks = method, plot = FALSE)
  plot_ly(x = h$mids, y = h$counts) %>% add_bars(name = method)
}

subplot(
  p1, price_hist(), price_hist("Sturges"),  price_hist("Scott"),
  nrows = 4, shareX = TRUE
)

Figure 2.22: plotly.js’s default binning algorithm versus R’s hist() default

Figure 2.23 demonstrates two ways of creating a basic bar chart. Although the visual results are the same, its worth noting the difference in implementation. The add_histogram() function sends all of the observed values to the browser and lets plotly.js perform the binning. It takes more human effort to perform the binning in R, but doing so has the benefit of sending less data, and requiring less computation work of the web browser. In this case, we have only about 50,000 records, so there is much of a difference in page load times or page size. However, with 1 Million records, page load time more than doubles and page size nearly doubles.9

p1 <- plot_ly(diamonds, x = ~cut) %>% add_histogram()

p2 <- diamonds %>%
  dplyr::count(cut) %>%
  plot_ly(x = ~cut, y = ~n) %>% 
  add_bars()

subplot(p1, p2) %>% hide_legend()

Figure 2.23: Number of diamonds by cut.

2.3.1 Multiple numeric distributions

It is often useful to see how the numeric distribution changes with respect to a discrete variable. When using bars to visualize multiple numeric distributions, I recommend plotting each distribution on its own axis, rather than trying to overlay them on a single axis.10. This is where the subplot() infrastructure, and its support for trellis displays, comes in handy. Figure 2.24 shows a trellis display of diamond price by diamond color. Note how the one_plot() function defines what to display on each panel, then a split-apply-recombine strategy is employed to generate the trellis display.

one_plot <- function(d) {
  plot_ly(d, x = ~price) %>%
    add_annotations(
      ~unique(clarity), x = 0.5, y = 1, 
      xref = "paper", yref = "paper", showarrow = FALSE
    )
}

diamonds %>%
  split(.$clarity) %>%
  lapply(one_plot) %>% 
  subplot(nrows = 2, shareX = TRUE, titleX = FALSE) %>%
  hide_legend()