Data science: Why bar (dynamite) plots are terrible (use ggbeeswarm instead)

Hause Lin

Why bar (dynamite) plots are terrible (use ggbeeswarm instead)

Why are barplots or dynamite plots so bad? Comparing four different types of plots: barplot, boxplot, violinplot, and geom_quasirandom plot

Author

Affiliation

Hause Lin

Published

Feb. 22, 2019

Citation

Lin, 2019

Get source code for this RMarkdown script here.

Consider being a patron and supporting my work?

Donate and become a patron: If you find value in what I do and have learned something from my site, please consider becoming a patron. It takes me many hours to research, learn, and put together tutorials. Your support really matters.

Dynamite or bar plots must die?

A recent open letter to journal editors said that “dynamite plots must die”.

Statisticians have been pointing out the problem with dynamite plots

simply statistics site

¹ , also known as bar and line graphs, for years. Karl Broman lists them as one of the top ten worst graphs

Check out the top 10 worst graphs here

² . The problem has even been documented in the peer reviewed literature.

One of my professional pet peeves is dynamite plots. Sometimes they are incorrectly referred to as bar plots. Dynamite plots do not have a formal name because they are not a part of conventional statistical graphics that should be used in reporting scientific results. But they are everywhere! ~ Tatsuki Koyama (Vanderbilt Biostatistics)

Graphics reveal data, communicate complex ideas and dependencies with clarity, precision and efficiency. ~ Edward Tufte

Many others have also suggested we need to plot better

Show the data, don't conceal them. [link]
G. Drummond, S. Vowler.
Adv Physiol Educ, Vol 35(2), pp. 130–132. 2011.
Beyond bar and line graphs: time for a new data presentation paradigm. [link]
T. Weissgerber, N. Milic, S. Winham, V. Garovic.
PLoS Biol., Vol 13(4), pp. e1002128. 2015.

[1, 2]

. Here, I’m simulating datasets to compare barplots (dynamite plots), boxplots, violin plots, and another kind of plot I prefer (ggbeeswarm).

# load packages
library(data.table); library(tidyverse); library(gganimate); 
library(ggbeeswarm); library(rmarkdown); library(knitr)
theme_set(theme_bw())  # set overall ggplot theme

Randomly generate 10 datasets with same mean and sd

Within each dataset, we have data from two groups, with 50 data points (e.g., subjects) per group. One group has mean 100 (sd = 15) and the other has mean 105 (sd = 45).

# extend R's rnorm() function 
# rnormFixed generates random data from a normal distribution 
# but ensures mean and sd values are exactly what we wanted
rnormFixed <- function(n, mean, sd) {
    mean + sd * scale(rnorm(n))
}

set.seed(1) # to ensure results/figures can be replicated
n <- 50 # datapoints per group
datasets <- 10

# generate datasets
dt1 <- lapply(1:datasets, # loop iterator 
              function(x) {data.table(
                scores = c(rnormFixed(n = n, mean = 100, sd = 15), # mean of group 1
                           rnormFixed(n = n, mean = 105, sd = 45)), # mean of group 2
                group = rep(x = c("group 1", "group 2"), each = n), # condition label
                dataset = x)})
dt1 <- bind_rows(dt1) # bind list containing separate datasets into a dataframe 
dt1

         scores   group dataset
   1:  86.88524 group 1       1
   2: 101.50100 group 1       1
   3:  83.11131 group 1       1
   4: 126.96975 group 1       1
   5: 104.13269 group 1       1
  ---                          
 996:  96.99936 group 2      10
 997:  73.10838 group 2      10
 998:  92.53620 group 2      10
 999:  47.75614 group 2      10
1000:  80.82894 group 2      10

Compute mean and standard deviation for each dataset and group.

dt1[, .(scores = mean(scores), scores_stdev = sd(scores)), by = .(dataset, group)]

    dataset   group scores scores_stdev
 1:       1 group 1    100           15
 2:       1 group 2    105           45
 3:       2 group 1    100           15
 4:       2 group 2    105           45
 5:       3 group 1    100           15
 6:       3 group 2    105           45
 7:       4 group 1    100           15
 8:       4 group 2    105           45
 9:       5 group 1    100           15
10:       5 group 2    105           45
11:       6 group 1    100           15
12:       6 group 2    105           45
13:       7 group 1    100           15
14:       7 group 2    105           45
15:       8 group 1    100           15
16:       8 group 2    105           45
17:       9 group 1    100           15
18:       9 group 2    105           45
19:      10 group 1    100           15
20:      10 group 2    105           45

R code to generate four figures below

Compare barplot, boxplot, violin plot, and plot with geom_quasirandom() from ggbeeswarm package.

# barplot
plot_bar <- ggplot(dt1, aes(group, scores)) + 
    stat_summary(fun.y = mean, geom = 'bar', size = 6) +
    stat_summary(fun.data = mean_cl_normal, geom = 'errorbar', width = 0, size = 1.1) +
    transition_states(dataset) +
    labs(x = 'Sample {closest_state}')
plot_bar
# anim_save("./attachments/plot_bar.gif", plot_bar)

plot_boxplot <- ggplot(dt1, aes(group, scores)) + 
    geom_boxplot() +
    stat_summary(fun.y = mean, geom = 'point', shape = 95, size = 6) +
    stat_summary(fun.data = mean_cl_normal, geom = 'errorbar', width = 0) +
    transition_states(dataset) +
    labs(x = 'Sample {closest_state}')
plot_boxplot
# anim_save("./attachments/plot_boxplot.gif", plot_boxplot)

plot_violin <- ggplot(dt1, aes(group, scores)) +
    geom_violin() +
    stat_summary(fun.y = mean, geom = 'point', shape = 95, size = 6) +
    stat_summary(fun.data = mean_cl_normal, geom = 'errorbar', width = 0, size = 1.1) +
    scale_colour_viridis_d(begin = 0, end = 0.5) +
    transition_states(dataset) +
    labs(x = 'Sample {closest_state}')
plot_violin
# anim_save("./attachments/plot_violin.gif", plot_violin)

plot_dotdist <- ggplot(dt1, aes(group, scores)) +
    geom_quasirandom(alpha = 0.2) +
    stat_summary(fun.y = mean, geom = 'point', shape = 95, size = 6) +
    stat_summary(fun.data = mean_cl_normal, geom = 'errorbar', width = 0, size = 1.1) +
    scale_colour_viridis_d(begin = 0, end = 0.5) +
    transition_states(dataset) +
    labs(x = 'Sample {closest_state}')
plot_dotdist
# anim_save("./attachments/plot_dotdist.gif", plot_dotdist)

Comparing bar plots, boxplots, violin plots, and `geom_quasirandom()` plots

Note what happens in the four figures below.

Barplots hide data. Means and standard deviation don’t change across the 10 datasets.
Boxplot are slightly more informative and shows the outliers.
Violin plots are also informative. They show the distributions.
geom_quasirandom from ggbeeswarm package plots the distribution and dots. Such plots are most informative.

Support my work

Support my work and become a patron here!

0 Comments Share:

Footnotes

simply statistics site [↩]
Check out the top 10 worst graphs here [↩]

References

Show the data, don't conceal them. [link]
Drummond, G. and Vowler, S., 2011. Adv Physiol Educ, Vol 35(2), pp. 130–132.
Beyond bar and line graphs: time for a new data presentation paradigm. [link]
Weissgerber, T., Milic, N., Winham, S. and Garovic, V., 2015. PLoS Biol., Vol 13(4), pp. e1002128.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hauselin/rtutorialsite, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lin (2019, Feb. 22). Data science: Why bar (dynamite) plots are terrible (use ggbeeswarm instead). Retrieved from https://hausetutorials.netlify.com/posts/2019-02-22-why-we-should-never-use-barplots-use-geomquasirandom-instead/

BibTeX citation

@misc{lin2019why,
  author = {Lin, Hause},
  title = {Data science: Why bar (dynamite) plots are terrible (use ggbeeswarm instead)},
  url = {https://hausetutorials.netlify.com/posts/2019-02-22-why-we-should-never-use-barplots-use-geomquasirandom-instead/},
  year = {2019}
}

[drummond2011] Show the data, don't conceal them. [link]
Drummond, G. and Vowler, S., 2011. Adv Physiol Educ, Vol 35(2), pp. 130–132.

[Weissgerber2015] Beyond bar and line graphs: time for a new data presentation paradigm. [link]
Weissgerber, T., Milic, N., Winham, S. and Garovic, V., 2015. PLoS Biol., Vol 13(4), pp. e1002128.

Why bar (dynamite) plots are terrible (use ggbeeswarm instead)

Author

Affiliation

Published

Citation

Table of Contents

Consider being a patron and supporting my work?

Dynamite or bar plots must die?

Randomly generate 10 datasets with same mean and sd

R code to generate four figures below

Comparing bar plots, boxplots, violin plots, and geom_quasirandom() plots

Support my work

Footnotes

References

Corrections

Reuse

Citation

Comparing bar plots, boxplots, violin plots, and `geom_quasirandom()` plots