Draw your assumptions 🔀

Causal thinking with DAGs, confounders, colliders and simulation

statistics
causality
DAG
ggdag
simulation
Correlation is not causation –but why, exactly, and what would be? Learn the three atoms of causal structure (fork, chain, collider), draw them as DAGs with ggdag, and simulate each one to see when adjusting for a variable helps and when it manufactures bias.
Author

Nelson Amaya

Published

July 4, 2026

Modified

July 5, 2026

“Correlation is not causation but it sure is a hint.”
–Edward Tufte

PART I: The question regression can’t answer alone

The previous session ended on a warning: a regression coefficient describes an association. But the questions worth money and lives are about intervention: what happens to sales if we cut the price? To health if we take the drug? Association answers “what do I see?”; causation answers “what if I act?” –and no formula converts one into the other1.

What does convert one into the other is an ingredient statistics cannot supply: your assumptions about how the world works. The modern discipline is to draw those assumptions as a graph –a DAG (directed acyclic graph): variables as nodes, arrows meaning “causes”, no cycles allowed. Once drawn, the graph tells you –mechanically– which variables you must adjust for and, just as important, which ones you must leave alone.

Every causal structure, however monstrous, is built from three atoms. Let’s draw them with ggdag and then –this track’s signature move– simulate each one to see what it does to a regression.

Click me!
library(tidyverse)
library(ggdag)

fork     <- ggdag::dagify(x ~ z, y ~ z, coords = ggdag::time_ordered_coords())
chain    <- ggdag::dagify(m ~ x, y ~ m, coords = ggdag::time_ordered_coords())
collider <- ggdag::dagify(c ~ x, c ~ y, coords = ggdag::time_ordered_coords())

list(Fork = fork, Chain = chain, Collider = collider) |>
  purrr::imap(\(dag, name) {
    ggdag::ggdag(dag, node_size = 14, text_size = 4) +
      ggdag::theme_dag() +
      labs(title = name)
    }) |>
  patchwork::wrap_plots(nrow = 1)
1
patchwork glues the three plots side by side. Fork: z causes both x and y. Chain: x causes y through m. Collider: x and y both cause c –the arrows collide.

PART II: The fork –confounding, the classic villain

Ice cream sales correlate with drowning deaths. The fork explains it: summer (z) causes both. In a fork, x and y correlate without any arrow between them –and the fix is to adjust for the confounder. Watch it in twelve lines:

Click me!
set.seed(44)

fork_world <- tibble(
  summer    = rbinom(2000, 1, 0.5),
  ice_cream = 10 + 5 * summer + rnorm(2000),
  drownings =  2 + 3 * summer + rnorm(2000)
  )

lm(drownings ~ ice_cream, data = fork_world) |>
  broom::tidy() |> dplyr::filter(term == "ice_cream")

lm(drownings ~ ice_cream + summer, data = fork_world) |>
  broom::tidy() |> dplyr::filter(term == "ice_cream")
1
A coin flip: is it summer?
2
Ice cream depends on summer –note it does not depend on drownings.
3
Drownings depend on summer –and not on ice cream. We built this world; we know the true effect of ice cream on drowning is exactly zero.
4
The naive regression finds a strong, “significant” effect. It is pure confounding.
5
Adjust for the fork and the coefficient collapses to ~0 –the truth we wired in. Adjustment worked because the graph said it would.
# A tibble: 1 × 5
  term      estimate std.error statistic p.value
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
1 ice_cream    0.517   0.00947      54.6       0
# A tibble: 1 × 5
  term      estimate std.error statistic p.value
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
1 ice_cream  0.00289    0.0223     0.130   0.897

This is the trap behind most “X linked to Y” headlines: wine drinkers live longer (income is the fork), private schools outperform (parental resources), coffee “causes” whatever it causes this week. The knee-jerk fix –“control for everything!”– seems to follow. It doesn’t. Meet the atom that punishes it.

PART III: The collider –where adjusting creates bias

In a collider, x and y are truly independent, but both cause c. Leave c alone and all is well. Adjust for it –or select your sample on it– and you manufacture a correlation out of nothing.

The classic example: are good-looking actors less talented? Suppose looks and talent are utterly unrelated, but either gets you into Hollywood:

Click me!
set.seed(44)

hollywood <- tibble(
  looks  = rnorm(5000),
  talent = rnorm(5000),
  famous = (looks + talent + rnorm(5000, sd = 0.5)) > 1.5
  )

lm(talent ~ looks, data = hollywood) |>
  broom::tidy() |> dplyr::filter(term == "looks")

lm(talent ~ looks, data = dplyr::filter(hollywood, famous)) |>
  broom::tidy() |> dplyr::filter(term == "looks")
1
Independent by construction: the correlation between looks and talent is zero in this world.
2
Fame is the collider: you get in on looks, talent, or luck.
3
In the full population: no relationship, correctly.
4
Among the famous only: a strong negative effect appears from thin air. Among people who cleared the bar, being gorgeous means you needed less talent to get in –selection did the distorting, no villain required.
# A tibble: 1 × 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 looks  -0.0162    0.0141     -1.15   0.251
# A tibble: 1 × 5
  term  estimate std.error statistic  p.value
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 looks   -0.592    0.0303     -19.6 3.14e-69
Click me!
hollywood |>
  ggplot(aes(looks, talent, color = famous)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  scale_color_manual(values = c("grey70", "#F75431")) +
  labs(
    title = "Collider bias, a.k.a. Berkson's paradox",
    subtitle = "No relationship in the population (grey). A strong negative one among the selected (orange).",
    color = "Famous"
    ) +
  theme_minimal()

This is not a curiosity –it is everywhere your data was filtered before you got it: hospital patients (admission is a collider), survey respondents (responding is), hired employees, published papers, surviving companies. If your dataset exists because its rows cleared a bar, Berkson is already in it.

NoteYou’ve seen these before, animated

The interactive explainers in session 4 of the workshop let you drag the strength of a collider and an omitted variable and watch the bias move. This session is the theory those toys were built on. The chain (x → m → y), the third atom, gets an exercise below: adjusting for a mediator blocks the very effect you’re trying to measure –a third distinct way “controlling for more” backfires.

The moral of the two experiments, and arguably of the whole causal revolution: whether to adjust for a variable is not a statistical question. The same + z in your formula is the cure in a fork and the poison in a collider –and only the DAG, i.e. your drawn assumptions, can tell you which world you’re in2.

PART IV: Earning the arrow

So how do you ever get to say “causes”? By design, in descending order of purity:

  • Randomization. Flip the coin yourself. Random assignment cuts every incoming arrow to the treatment –no forks left, nothing to adjust. It’s the reason A/B tests and clinical trials are the gold standard, and why sample() is secretly the most powerful causal tool in R.
  • Natural experiments. When you can’t randomize, hunt for situations where the world almost did: policies applied to some regions and not others (difference-in-differences), arbitrary thresholds like exam cutoffs (regression discontinuity), lottery-like exposures (instrumental variables). Each is a chapter of Causal Inference: The Mixtape and The Effect –both free, both excellent, both R-based.
  • Adjustment with a defended DAG. The weakest but most common: observational data plus a graph you are willing to draw in public. The graph doesn’t make it true –it makes it criticizable, which is what science runs on.

What you may never do is run lm(y ~ x + everything) and read causality off the stars. Now you know the two reasons why, and you can simulate both from scratch.

TipExercises 🏋️
  1. Simulate the third atom, the chain: x causes m (m = 2*x + noise), m causes y (y = 3*m + noise). Regress y ~ x, then y ~ x + m. The total effect of x is 6 –which regression finds it, and what does the other one estimate instead?
  2. Draw last week’s news. Take one “X linked to Y” headline, draw a DAG with at least one plausible fork, and use ggdag_adjustment_set() to state what the study would need to have adjusted for.
  3. Break the fork fix: in the ice-cream world, adjust for a noisy version of summer (summer_reported = ifelse(runif(2000) < 0.8, summer, 1 - summer)). How much of the confounding comes back? (Measurement error makes “we controlled for it” a matter of degree.)
  4. Randomize: rebuild the ice-cream world but assign ice_cream with rnorm(2000) –independent of summer, as an experiment would. Show the naive regression is now unbiased, no adjustment needed.

Next –and last– in the track: Think Bayes, where beliefs become distributions and data updates them.

Back to top

Footnotes

  1. This is Judea Pearl’s “ladder of causation”, from The Book of Why –the best non-technical entry to this whole subject.↩︎

  2. ggdag::ggdag_adjustment_set(dag, exposure = "x", outcome = "y") automates the deduction: give it your graph and it returns the set(s) of variables to adjust for. The graph is your responsibility; the graph-reading is mechanical.↩︎

Citation

BibTeX citation:
@online{amaya2026,
  author = {Amaya, Nelson},
  title = {Draw Your Assumptions 🔀},
  date = {2026-07-04},
  url = {https://r4dev.netlify.app/sessions_thinking/04-causal/04-causal},
  langid = {en}
}
For attribution, please cite this work as:
Amaya, Nelson. 2026. “Draw Your Assumptions 🔀.” July 4. https://r4dev.netlify.app/sessions_thinking/04-causal/04-causal.