Preface
Figures should be autogenerated as part of the data analysis pipeline (which should also be automated). Using manual programs, like Illustrator, to edit the graphs in post is bad practice, generally because: it is immutable, time-consuming, and hard to reproduce consistently.
Therefore, interactive plot software like Excel and GraphPad Prism 9 are a bad idea. It is much better to autogenerate figures using software like R’s ggplot2.
Coordinate Systems and Axes
There are either linear or nonlinear axes (most commonly log scale).
A log scale can be generated with either transforming the data into a log scale or keeping the original data and plotting them on a log scale.
- Sometimes, a linear scale can overemphasize ratios over 1 and obscure ratios under 1. Therefore, ratios should be displayed on a log scale.
Polar coordinate scales are useful typically for displaying periodic data values, such as days in a year.
Color
There are three fundamental use cases for color in data visualizations: (i) we can use color to distinguish groups of data from each other; (ii) we can use color to represent data values; and (iii) we can use color to highlight. The types of colors we use and the way in which we use them are quite different for these three cases.
For qualitative data (color that doesn’t sequentially represent something), ColorBrewer project provides a nice selection of qualitative color scales (such as ggplot2 hue).
For a sequential color scale, a sequence of colors that clearly indicate (i) which values are larger or smaller than which other ones and (ii) how distant two specific values are from each other are needed. The second point implies that the color scale needs to be perceived to vary uniformly across its entire range. The ColorBrewer Blues scale is a monochromatic scale while the Heat and Viridis scales are multi-hue scales.
For a diverging color scale, two sets of sequential scales are put together at a common midpoint. The ColorBrewer PiYG or CARTO Earth scales are examples.
Color can also be used to highlight specific data points in the graph. This contains subdued colors with an accent color scale.
Directory of Visualizations
Bar plots or dots to show amounts.
- If the x-axis labels are too long, swap the x and y axes
- Dots should be used if the bars take up most of the graph with little variation between each one
Grouped or stacked bars to showcase 2+ sets of categories. Can also be done using a heatmap.
Histograms, density plots, cumulative density, or quantile-quantile plot can show distribution of a data set.
Boxplots, violins, strip, and sina plots are great for visualizing multiple distributions at once.
- violins can show bimodal data sets, and are better for visualizing larger amounts of data distribution
- it is better to use a sina plot instead though - plotting either boxplot/violins with jittered data points
- For a more in-depth comparison of smaller distribution amounts, overlapping densities or ridgeline plots are good. they also work well if you want to compare two trends over time, such as voting patterns of the members of two different parties over time.
Pie charts can show proportions. For multiple proportions, grouped bars work well.
To represent x-y relationships, scatterplots, bubble charts, and slope graphs work well. For large amounts of data points, regular scatterplots become uninformative due to over plotting. Thus, density contours, 2D bins, hex bins, or correlogram work better.
Line graphs can represent time to a variable. A smooth line graph can represent trends in a larger data set.
Chloropleth graphs show a map of geographic regions colored by data values. They most often display data based on color values. However, cartograms or cartogram heatmaps can also be used (cartograms distort regions according to a data value).
Uncertainty can be plotted using error bars. For line graphs, confidence or graded confidence bands can be used.
Heatmaps are used if a dot or bar graph becomes too saturated with data.
A histogram uses the binning of data. You can selectively choose the width of the bins - always explore multiple bin widths to determine which one would fit your data the best.
- However, histograms are being replaced with density plots, using a curve estimation procedure (Gaussian kernel).
- To visualize several distributions at once, kernel density plots will generally work better than histograms.
Visualize proportions described by more than two categorical variables with a parallel sets plot
If data contains 2+ quantitative variables with a potential relationship…
- 2 variables: scatterplots
- 3+: bubble chart (affects size of dots, somewhat confusing though), scatter plot matrix, correlogram (visualizations of correlation coefficients)
- many variables: dimension reduction with principal components analysis ???
for paired data sets (measuring a change in value over 2 or more points across time)
- for small data sets: slope-graphs
for datasets over time, use line plots or multiple line plots
Visualizing Trends
- Smooth series to capture patterns in data while removing minor noise (stock market moving averages)
- The LOESS or splines is a popular smoothing approach but is resource intensive
- If there a lot of data points, partial transparency and jittering is beneficial for scatter plots. Contour lines can also be made, or binning the data into hexagons
Color Usage
- Don’t use colors to encode so much data or irrelevant information. Generally, qualitative color scales work with 3-5 different categories to prevent cognitive load.
- If there are a lot of categories, you can instead label a representative subset (such as the region a state is in), and then individually text label each dot on a scatterplot.
- Don’t use too many colors.
- A good color range:
- Orange: E69F00
- Cyan: 56B4E9
- Green: 009E73
- Yellow: F0E442
- Blue: 0072B2
- Red: D55E00
- Pink: CC79A7
- Black: 000000
- If there is a clear visual ordering in the data, match it in the legend or just add a text label next to the curve
Figure Labels
Preferred to put figure titles under the graph, right above the caption. The caption should start with the figure label (Figure 22.2: title here), and include the title, which does not need to be a complete sentence.
Appearances
Don’t use outlined bars, fill them in
Don’t use dashed lines and variations in line graphs, use different colors, ideally with shaded color underneath
Use solid points over open points in general
Light shading for box plots are beneficial
Avoid 3D except topographical maps and protein modeling
To add a story using data, have a conflict resolution style if posisble.
Don’t overcomplicate your figures, it might prevent any sort of data transfer