2.1 Scatter traces

A plotly visualization is composed of one (or more) trace(s), and every trace has a type. The default trace type, “scatter”, can be used to draw a large amount of geometries, and actually powers many of the add_*() functions such as add_markers(), add_lines(), add_paths(), add_segments(), add_ribbons(), and add_polygons(). Among other things, these functions make assumptions about the mode of the scatter trace, but any valid attribute(s) listed under the scatter section of the figure reference may be used to override defaults.

The plot_ly() function has a number of arguments that make it easier to scale data values to visual aesthetics (e.g., color/colors, symbol/symbols, linetype/linetypes, size/sizes). These arguments are unique to the R package and dynamically determine what objects in the figure reference to populate (e.g., marker.color vs line.color). Generally speaking, the singular form of the argument defines the domain of the scale (data) and the plural form defines the range of the scale (visuals). To make it easier to alter default visual aesthetics (e.g., change all points from blue to black), “AsIs” values (values wrapped with the I() function) are interpreted as values that already live in visual space, and thus do not need to be scaled. The next section on scatterplots explores detailed use of the color/colors, symbol/symbols, & size/sizes arguments. The section on lineplots explores detailed use of the linetype/linetypes.

2.1.1 Scatterplots

The scatterplot is useful for visualizing the correlation between two quantitative variables. If you supply a numeric vector for x and y in plot_ly(), it defaults to a scatterplot, but you can also be explicit about adding a layer of markers/points via the add_markers() function. A common problem with scatterplots is overplotting, meaning that there are multiple observations occupying the same (or similar) x/y locations. There are a few ways to combat overplotting including: alpha transparency, hollow symbols, and 2D density estimation. Figure 2.1 shows how alpha transparency and hollow symbols can provide an improvement over the default.

subplot(
  plot_ly(mpg, x = ~cty, y = ~hwy, name = "default"),
  plot_ly(mpg, x = ~cty, y = ~hwy) %>% 
    add_markers(alpha = 0.2, name = "alpha"),
  plot_ly(mpg, x = ~cty, y = ~hwy) %>% 
    add_markers(symbol = I(1), name = "hollow")
)

Figure 2.1: Three versions of a basic scatterplot

In Figure 2.1, hollow circles are specified via symbol = I(1). By default, the symbol argument (as well as the color/size/linetype arguments) assumes value(s) are “data”, which need to be mapped to a visual palette (provided by symbols). Wrapping values with the I() function notifies plot_ly() that these values should be taken “AsIs”. If you compare the result of plot(1:25, 1:25, pch = 1:25) to Figure 2.2, you’ll see that plot_ly() can translate R’s plotting characters (pch), but you can also use plotly.js’ symbol syntax, if you desire.

subplot(
  plot_ly(x = 1:25, y = 1:25, symbol = I(1:25), name = "pch"),
  plot_ly(mpg, x = ~cty, y = ~hwy, symbol = ~cyl, 
          symbols = 1:3, name = "cyl")
)

Figure 2.2: Specifying symbol in a scatterplot

When mapping a numeric variable to symbol, it creates only one trace, so no legend is generated. If you do want one trace per symbol, make sure the variable you’re mapping is a factor, as Figure 2.3 demonstrates. When plotting multiple traces, the default plotly.js color scale will apply, but you can set the color of every trace generated from this layer with color = I("black"), or similar.

p <- plot_ly(mpg, x = ~cty, y = ~hwy, alpha = 0.3) 
subplot(
  add_markers(p, symbol = ~cyl, name = "A single trace"),
  add_markers(p, symbol = ~factor(cyl), color = I("black"))
)

Figure 2.3: Mapping symbol to a factor

The color argument adheres to similar rules as symbol:

  • If numeric, color produces one trace, but colorbar is also generated to aide the decoding of colors back to data values. The colorbar() function can be used to customize the appearance of this automatically generated guide. The default colorscale is viridis, a perceptually-uniform colorscale (even when converted to black-and-white), and perceivable even to those with common forms of color blindness (Data Science 2016).

  • If discrete, color produces one trace per value, meaning a legend is generated. If an ordered factor, the default colorscale is viridis (Garnier 2016); otherwise, it is the “Set2” palette from the RColorBrewer package (Neuwirth 2014)

p <- plot_ly(mpg, x = ~cty, y = ~hwy, alpha = 0.5)
subplot(
  add_markers(p, color = ~cyl, showlegend = FALSE) %>% 
    colorbar(title = "Viridis"),
  add_markers(p, color = ~factor(cyl))
)

Figure 2.4: Variations on a numeric color mapping.

There are a number of ways to alter the default colorscale via the colors argument. This argument excepts: (1) a color brewer palette name (see the row names of RColorBrewer::brewer.pal.info for valid names), (2) a vector of colors to interpolate, or (3) a color interpolation function like colorRamp() or scales::colour_ramp(). Although this grants a lot of flexibility, one should be conscious of using a sequential colorscale for numeric variables (& ordered factors) as shown in 2.5, and a qualitative colorscale for discrete variables as shown in 2.6. (TODO: touch on lurking variables?)

col1 <- c("#132B43", "#56B1F7")
col2 <- viridisLite::inferno(10)
col3 <- colorRamp(c("red", "white", "blue"))
subplot(
  add_markers(p, color = ~cyl, colors = col1) %>%
    colorbar(title = "ggplot2 default"),
  add_markers(p, color = ~cyl, colors = col2) %>% 
    colorbar(title = "Inferno"),
  add_markers(p, color = ~cyl, colors = col3) %>% 
    colorbar(title = "colorRamp")
) %>% hide_legend()

Figure 2.5: Three variations on a numeric color mapping

col1 <- "Pastel1"
col2 <- colorRamp(c("red", "blue"))
col3 <- c(`4` = "red", `5` = "black", `6` = "blue", `8` = "green")
subplot(
  add_markers(p, color = ~factor(cyl), colors = col1),
  add_markers(p, color = ~factor(cyl), colors = col2),
  add_markers(p, color = ~factor(cyl), colors = col3)
) %>% hide_legend()

Figure 2.6: Three variations on a discrete color mapping

For scatterplots, the size argument controls the area of markers (unless otherwise specified via sizemode), and must be a numeric variable. The sizes argument controls the minimum and maximum size of circles, in pixels:

subplot(
  add_markers(p, size = ~cyl, name = "default"),
  add_markers(p, size = ~cyl, sizes = c(1, 500), name = "custom")
)

Figure 2.7: Controlling the size range via sizes (measured in pixels).

2.1.1.1 3D scatterplots

To make a 3D scatterplot, just add a z attribute:

plot_ly(mpg, x = ~cty, y = ~hwy, z = ~cyl) %>%
  add_markers(color = ~cyl)

Figure 2.8: A 3D scatterplot

2.1.1.2 Scatterplot matrices

Scatterplot matrices can be made via plot_ly() and subplot(), but ggplotly() has a special method for translating ggmatrix objects from the GGally package to plotly objects (Schloerke et al. 2016). These objects are essentially a matrix of ggplot objects and are the underlying data structure which powers higher level functions in GGally, such as ggpairs() – a function for creating a generalized pairs plot (Emerson et al. 2013). The generalized pairs plot can be motivated as a generalization of the scatterplot matrix with support for categorical variables and different visual representations of the data powered by the grammar of graphics. Figure 2.9 shows an interactive version of the generalized pairs plot made via ggpairs() and ggplotly(). In Linking views without shiny, we explore how this framework can be extended to enable linked brushing in the generalized pairs plot.

pm <- GGally::ggpairs(iris)
ggplotly(pm)

Figure 2.9: An interactive version of the generalized pairs plot made via the ggpairs() function from the GGally package

2.1.2 Dotplots & error bars

A dotplot is similar to a scatterplot, except instead of two numeric axes, one is categorical. The usual goal of a dotplot is to compare value(s) on a numerical scale over numerous categories. In this context, dotplots are preferable to pie charts since comparing position along a common scale is much easier than comparing angle or area (Cleveland and McGill 1984); (Bostock 2010). Furthermore, dotplots can be preferable to bar charts, especially when comparing values within a narrow range far away from 0 (Few 2006). Also, when presenting point estimates, and uncertainty associated with those estimates, bar charts tend to exaggerate the difference in point estimates, and lose focus on uncertainty (Messing 2012).

A popular application for dotplots (with error bars) is the so-called “coefficient plot” for visualizing the point estimates of coefficients and their standard error. The coefplot() function in the coefplot package (Lander 2016) and the ggcoef() function in the GGally both produce coefficient plots for many types of model objects in R using ggplot2, which we can translate to plotly via ggplotly(). Since these packages use points and segments to draw the coefficient plots, the hover information is not the best, and it’d be better to use error objects. Figure 2.10 uses the tidy() function from the broom package (Robinson 2016) to obtain a data frame with one row per model coefficient, and produce a coefficient plot with error bars along the x-axis.

m <- lm(Sepal.Length~Sepal.Width*Petal.Length*Petal.Width, data = iris)
# to order categories sensibly arrange by estimate then coerce factor 
d <- broom::tidy(m) %>% 
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term))
plot_ly(d, x = ~estimate, y = ~term) %>%
  add_markers(error_x = ~list(value = std.error)) %>%
  layout(margin = list(l = 200))

Figure 2.10: A coefficient plot

2.1.3 Line plots

This section surveys useful applications of add_lines() and add_paths(). The only difference between these functions is that add_lines() connects x/y pairs from left to right, instead of the order in which the data appears. Both functions understand the color, linetype, and alpha attributes6, as well as groupings defined by group_by().

Figure 1.2 uses group_by() to plot one line per city in the txhousing dataset using a single trace. Since there can only be one tooltip per trace, hovering over that plot does not reveal useful information. Although plotting many traces can be computationally expensive, it is necessary in order to display better information on hover. Since the color argument produces one trace per value (if the variable (city) is discrete), hovering on Figure 2.11 reveals the top ~10 cities at a given x value. Since 46 colors is too many to perceive in a single plot, Figure 2.11 also restricts the set of possible colors to black.

plot_ly(txhousing, x = ~date, y = ~median) %>%
  add_lines(color = ~city, colors = "black", alpha = 0.2)

Figure 2.11: Median house sales with one trace per city.

Generally speaking, it’s hard to perceive more than 8 different colors/linetypes/symbols in a given plot, so sometimes we have to filter data to use these effectively. Here we use the dplyr package to find the top 5 cities in terms of average monthly sales (top5), then effectively filter the original data to contain just these cities via semi_join(). As Figure 2.12 demonstrates, once we have the data is filtered, mapping city to color or linetype is trivial. The color palette can be altered via the colors argument, and follows the same rules as scatterplots. The linetype palette can be altered via the linetypes argument, and accepts R’s lty values or plotly.js dash values.

library(dplyr)
top5 <- txhousing %>%
  group_by(city) %>%
  summarise(m = mean(sales, na.rm = TRUE)) %>%
  arrange(desc(m)) %>%
  top_n(5)

p <- semi_join(txhousing, top5, by = "city") %>%
  plot_ly(x = ~date, y = ~median)

subplot(
  add_lines(p, color = ~city),
  add_lines(p, linetype = ~city),
  shareX = TRUE, nrows = 2
)

Figure 2.12: Using color and/or linetype to differentiate groups of lines.

2.1.3.1 Density plots

In Bars & histograms, we leveraged a number of algorithms in R for computing the “optimal” number of bins for a histogram, via hist(), and routing those results to add_bars(). We can leverage the density() function for computing kernel density estimates in a similar way, and routing the results to add_lines(), as is done in 2.13.

kerns <- c("gaussian", "epanechnikov", "rectangular", 
          "triangular", "biweight", "cosine", "optcosine")
p <- plot_ly()
for (k in kerns) {
  d <- density(txhousing$median, kernel = k, na.rm = TRUE)
  p <- add_lines(p, x = d$x, y = d$y, name = k)
}
layout(p, xaxis = list(title = "Median monthly price"))