Draw your assumptions before drawing your conclusions 🔀

Causal thinking with DAGs –plus an interactive gallery of DiD, RDD, IV, matching, fixed effects and synthetic control

statistics

causality

DAG

ggdag

simulation

natural experiments

Correlation is not causation –but why, exactly, and what would be? Learn the three atoms of causal structure (fork, chain, collider), simulate each one to see when adjusting helps and when it manufactures bias, then step into an interactive gallery of six research designs –difference-in-differences, regression discontinuity, instrumental variables, matching, fixed effects and synthetic control– where a slider breaks each identifying assumption and you watch the estimate fall apart in real time.

Author

Nelson Amaya

Published

July 4, 2026

Modified

July 18, 2026

“No causes in; no causes out.”
–Nancy Cartwright

Level: Advanced · Time: ~90 min · Prerequisites: All models are wrong · Tools: ggdag, dplyr, interactive OJS explainers (no installation needed)

You will learn to

Draw a causal diagram (DAG) and identify its three basic atoms: fork, chain, collider.
Simulate a confounder and see adjusting for it fix a biased estimate.
Simulate a collider and see adjusting for it create bias that wasn’t there.
Recognise post-treatment bias: adjusting for something measured after the treatment.
Explore six natural-experiment research designs (DiD, RDD, IV, matching, fixed effects, synthetic control) and see each one’s identifying assumption break in an interactive gallery.

PART I: The question regression can’t answer alone

The previous session ended on a warning: a regression coefficient describes an association. But the questions worth money and lives are about intervention: what happens to sales if we cut the price? To health if we take the drug? Association answers “what do I see?”; causation answers “what if I act?” –and no formula converts one into the other¹.

What does convert one into the other is an ingredient statistics cannot supply: your assumptions about how the world works. This session’s title is Miguel Hernán’s motto², and it is the whole method in one sentence. The modern discipline is to draw those assumptions as a graph –a DAG (directed acyclic graph): variables as nodes, arrows meaning “causes”, no cycles allowed. Once drawn, the graph tells you –mechanically– which variables you must adjust for and, just as important, which ones you must leave alone.

Every causal structure, however monstrous, is built from three atoms. Let’s draw them with ggdag and then –this track’s signature move– simulate each one to see what it does to a regression.

Show the code

library(tidyverse)
library(ggdag)

fork     <- ggdag::dagify(x ~ z, y ~ z, coords = ggdag::time_ordered_coords())
chain    <- ggdag::dagify(m ~ x, y ~ m, coords = ggdag::time_ordered_coords())
collider <- ggdag::dagify(c ~ x, c ~ y, coords = ggdag::time_ordered_coords())

list(Fork = fork, Chain = chain, Collider = collider) |>
  purrr::imap(\(dag, name) {
    ggdag::ggdag(dag, node_size = 14, text_size = 4) +
      ggdag::theme_dag() +
      labs(title = name)
    }) |>
  patchwork::wrap_plots(nrow = 1)

1: patchwork glues the three plots side by side. Fork: z causes both x and y. Chain: x causes y through m. Collider: x and y both cause c –the arrows collide.

PART II: The fork –confounding, the classic villain

Ice cream sales correlate with drowning deaths. The fork explains it: summer (z) causes both. In a fork, x and y correlate without any arrow between them –and the fix is to adjust for the confounder. Here is the atom wearing its story:

Now watch it in twelve lines:

Show the code

set.seed(44)

fork_world <- tibble(
  summer    = rbinom(2000, 1, 0.5),
  ice_cream = 10 + 5 * summer + rnorm(2000),
  drownings =  2 + 3 * summer + rnorm(2000)
  )

lm(drownings ~ ice_cream, data = fork_world) |>
  broom::tidy() |> dplyr::filter(term == "ice_cream")

lm(drownings ~ ice_cream + summer, data = fork_world) |>
  broom::tidy() |> dplyr::filter(term == "ice_cream")

1: A coin flip: is it summer?
2: Ice cream depends on summer –note it does not depend on drownings.
3: Drownings depend on summer –and not on ice cream. We built this world; we know the true effect of ice cream on drowning is exactly zero.
4: The naive regression finds a strong, “significant” effect. It is pure confounding.
5: Adjust for the fork and the coefficient collapses to ~0 –the truth we wired in. Adjustment worked because the graph said it would.

# A tibble: 1 × 5
  term      estimate std.error statistic p.value
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
1 ice_cream    0.517   0.00947      54.6       0
# A tibble: 1 × 5
  term      estimate std.error statistic p.value
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>
1 ice_cream  0.00289    0.0223     0.130   0.897

This is the trap behind most “X linked to Y” headlines: wine drinkers live longer (income is the fork), private schools outperform (parental resources), coffee “causes” whatever it causes this week. The knee-jerk fix –“control for everything!”– seems to follow. It doesn’t. Meet the atom that punishes it.

PART III: The collider –where adjusting creates bias

In a collider, x and y are truly independent, but both cause c. Leave c alone and all is well. Adjust for it –or select your sample on it– and you manufacture a correlation out of nothing.

The classic example: are good-looking actors less talented? Suppose looks and talent are utterly unrelated, but either gets you into Hollywood:

Show the code

set.seed(44)

hollywood <- tibble(
  looks  = rnorm(5000),
  talent = rnorm(5000),
  famous = (looks + talent + rnorm(5000, sd = 0.5)) > 1.5
  )

lm(talent ~ looks, data = hollywood) |>
  broom::tidy() |> dplyr::filter(term == "looks")

lm(talent ~ looks, data = dplyr::filter(hollywood, famous)) |>
  broom::tidy() |> dplyr::filter(term == "looks")

1: Independent by construction: the correlation between looks and talent is zero in this world.
2: Fame is the collider: you get in on looks, talent, or luck.
3: In the full population: no relationship, correctly.
4: Among the famous only: a strong negative effect appears from thin air. Among people who cleared the bar, being gorgeous means you needed less talent to get in –selection did the distorting, no villain required.

# A tibble: 1 × 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 looks  -0.0162    0.0141     -1.15   0.251
# A tibble: 1 × 5
  term  estimate std.error statistic  p.value
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 looks   -0.592    0.0303     -19.6 3.14e-69

Show the code

hollywood |>
  ggplot(aes(looks, talent, color = famous)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
  scale_color_manual(values = c("grey70", "#F75431")) +
  labs(
    title = "Collider bias, a.k.a. Berkson's paradox",
    subtitle = "No relationship in the population (grey). A strong negative one among the selected (orange).",
    color = "Famous"
    ) +
  theme_minimal()

This is not a curiosity –it is everywhere your data was filtered before you got it: hospital patients (admission is a collider), survey respondents (responding is), hired employees, published papers, surviving companies. If your dataset exists because its rows cleared a bar, Berkson is already in it.

You’ve seen these before, animated

The interactive explainers in session 4 of the workshop let you drag the strength of a collider and an omitted variable and watch the bias move. This session is the theory those toys were built on. The chain (x → m → y), the third atom, gets an exercise below: adjusting for a mediator blocks the very effect you’re trying to measure –a third distinct way “controlling for more” backfires.

The moral of the two experiments, and arguably of the whole causal revolution: whether to adjust for a variable is not a statistical question. The same + z in your formula is the cure in a fork and the poison in a collider –and only the DAG, i.e. your drawn assumptions, can tell you which world you’re in³.

PART IV: Earning the arrow

So how do you ever get to say “causes”? By design, in descending order of purity:

Randomization. Flip the coin yourself. Random assignment cuts every incoming arrow to the treatment –no forks left, nothing to adjust. It’s the reason A/B tests and clinical trials are the gold standard, and why sample() is secretly the most powerful causal tool in R.
Natural experiments and panel designs. When you can’t randomize, hunt for situations where the world almost did: policies applied to some regions and not others (difference-in-differences), arbitrary thresholds like exam cutoffs (regression discontinuity), lottery-like exposures (instrumental variables), the same units observed over and over (fixed effects), a lone treated unit rebuilt from a weighted blend of untreated ones (synthetic control), and honest adjustment when treatment goes to observably different units (matching). Each is a chapter of Causal Inference: The Mixtape and The Effect –both free, both excellent, both R-based. And every one of them gets its own interactive machine in Part V, right below.
Adjustment with a defended DAG. The weakest but most common: observational data plus a graph you are willing to draw in public. The graph doesn’t make it true –it makes it criticizable, which is what science runs on.

Randomization deserves its own picture, because it is the only design that removes arrows instead of arguing about them –the coin flip owns the treatment, so nothing else can:

How to ruin a perfect experiment: post-treatment bias

Randomization guards the front door –but there is a back door you can open yourself, after the coin flip did its job. A post-treatment variable is anything measured after treatment that treatment itself can affect. Adjust for one, and you can bias a flawless RCT⁴.

Say a job-training program is randomly assigned and raises earnings by exactly 2 (thousand, per year –we wire it in). Trainees can also earn a certificate, which depends on the training and on unobserved motivation. A well-meaning analyst reasons: “let’s compare people with the same qualifications” and controls for the certificate. Look at the graph before the code –the certificate is our collider atom, grown downstream of the treatment:

Show the code

drawDAG({
  id: "dag-posttreat",
  w: 375, h: 175,
  nodes: [
    {id: "D", x: 60, y: 120, label: "training (randomized!)"},
    {id: "C", x: 170, y: 120, label: "certificate --post-treatment"},
    {id: "U", x: 227, y: 40, dashed: true, label: "motivation (unseen)", ldy: -24},
    {id: "Y", x: 285, y: 120, label: "earnings"}
  ],
  edges: [
    {from: "D", to: "C"},
    {from: "U", to: "C", color: C.ghost, dashed: true},
    {from: "U", to: "Y", color: C.ghost, dashed: true},
    {from: "D", to: "Y", width: 2.4, bend: 52},
    {from: "C", to: "Y", color: C.danger, dashed: true, opacity: 0.65,
     label: "adjusting here links D to U", ly: 20}
  ]
})

Show the code

set.seed(44)

rct <- tibble(
  training    = rbinom(5000, 1, 0.5),
  motivation  = rnorm(5000),
  certificate = as.integer(0.8 * training + motivation
                           + rnorm(5000, sd = 0.5) > 0.5),
  earnings    = 30 + 2 * training + 3 * motivation + rnorm(5000, sd = 2)
  )

lm(earnings ~ training, data = rct) |>
  broom::tidy() |> dplyr::filter(term == "training")

lm(earnings ~ training + certificate, data = rct) |>
  broom::tidy() |> dplyr::filter(term == "training")

1: A genuine experiment: the coin flip owns the treatment.
2: Motivation is unobserved –the analyst never sees this column.
3: The post-treatment variable: getting certified takes training or motivation. D → C ← U –the collider atom, wearing a lanyard.
4: Earnings: exactly +2 for training, +3 per unit of motivation. The certificate itself pays nothing.
5: The clean comparison: randomization works, and the estimate lands on the truth, ~2.
6: “Controlling for qualifications” cuts the estimate to a third of the truth –in a randomized experiment, with an honest analyst, using the most routine line of R imaginable.

# A tibble: 1 × 5
  term     estimate std.error statistic  p.value
  <chr>       <dbl>     <dbl>     <dbl>    <dbl>
1 training     1.99     0.103      19.2 1.75e-79
# A tibble: 1 × 5
  term     estimate std.error statistic  p.value
  <chr>       <dbl>     <dbl>     <dbl>    <dbl>
1 training    0.698    0.0876      7.96 2.05e-15

Where did the effect go? Conditioning on the certificate opened the collider: among certificate-holders, the untrained must be unusually motivated (they got certified without the program’s help). Check:

Show the code

rct |>
  dplyr::filter(certificate == 1) |>
  dplyr::summarise(mean_motivation = mean(motivation), .by = training)

1: Among the certified, the control group is the more motivated one –so “comparing like with like” actually compares trained-and-ordinary against untrained-and-driven, and hands part of the training effect to motivation. The same logic biases the certificate-free stratum in the same direction.

# A tibble: 2 × 2
  training mean_motivation
     <int>           <dbl>
1        1           0.583
2        0           0.993

You have seen both halves of this before: the certificate is partly a mediator (blocking a path the effect travels through) and partly a collider with the unseen (opening a path that was shut). Either half alone is enough; real post-treatment variables are usually both. And it is everywhere: the gender wage gap “controlling for occupation” (occupation is post-treatment to discrimination), drug trials that condition on a side effect or analyse “only those who complied”, school-effect studies that control for end-of-year test scores. The defense is a question so simple it fits in a code review: when was this variable measured? If the answer is “after treatment”, it does not go on the right-hand side –randomization bought you a clean estimate of the total effect, and adjusting downstream sells it back.

What you may never do is run lm(y ~ x + everything) and read causality off the stars. Now you know the three reasons why –the fork you must adjust for, the collider you must not, and the post-treatment trap that smuggles a collider into your own experiment– and you can simulate all of them from scratch.

PART V: The gallery –break the assumption, watch the estimate 🎛️

Reading about natural experiments is one thing; feeling where they break is another. Each machine below wires up a world where the true effect is known, because we built it –and then estimates that effect the way a real study would. Next to every plot sits the design’s DAG, and next to the DAG, sliders.

Here is the trick, and it is Hernán’s motto run in reverse: every design in this gallery works because of an arrow that isn’t there. Parallel trends is a missing arrow. Exclusion is a missing arrow. The red slider on each machine draws the forbidden arrow into existence –you will see it appear on the DAG– and the readout tiles show the estimate drifting away from the truth as you do⁵.

Show the code

tiles = (items) => htl.html`<div class="ci-tiles">${items.map(d => htl.html`<div class="ci-tile" style="border-top-color:${d.color ?? "var(--bs-border-color)"}"><div class="lab">${d.lab}</div><div class="val">${d.val}</div></div>`)}</div>`

Show the code

drawDAG = ({id, w = 340, h = 170, nodes, edges}) => {
  const R = 17;
  const N = new Map(nodes.map(d => [d.id, d]));
  const parts = edges
    .filter(e => (e.opacity ?? 1) > 0.03)
    .map((e, i) => {
      const s = N.get(e.from), t = N.get(e.to);
      const rs = s.hidden ? 2 : R + 3, rt = t.hidden ? 2 : R + 8;
      const dx = t.x - s.x, dy = t.y - s.y, L = Math.hypot(dx, dy);
      const ux = dx / L, uy = dy / L;
      const x1 = s.x + ux * rs, y1 = s.y + uy * rs;
      const x2 = t.x - ux * rt, y2 = t.y - uy * rt;
      const bend = e.bend ?? 0;
      const cx = (x1 + x2) / 2 - uy * bend, cy = (y1 + y2) / 2 + ux * bend;
      const mid = {x: (x1 + 2 * cx + x2) / 4, y: (y1 + 2 * cy + y2) / 4};
      const col = e.color ?? "currentColor";
      return htl.svg`<g opacity=${e.opacity ?? 1}>
        <marker id="${id}-arr${i}" viewBox="0 0 10 10" refX="8" refY="5" markerWidth="6.5" markerHeight="6.5" orient="auto-start-reverse"><path d="M0,0L10,5L0,10z" fill=${col}></path></marker>
        <path d="M${x1},${y1} Q${cx},${cy} ${x2},${y2}" fill="none" stroke=${col} stroke-width=${e.width ?? 1.8} stroke-dasharray=${e.dashed ? "5 4" : null} marker-end="url(#${id}-arr${i})"></path>
        ${e.label ? htl.svg`<text x=${mid.x + (e.lx ?? 0)} y=${mid.y + (e.ly ?? -7)} fill=${col} font-size="10.5" font-weight="600" text-anchor="middle">${e.label}</text>` : null}
      </g>`;
    });
  const dots = nodes.filter(n => !n.hidden).map(n => htl.svg`<g>
      <circle cx=${n.x} cy=${n.y} r=${R} fill="var(--bs-body-bg, white)" stroke=${n.dashed ? "var(--bs-secondary-color, #888)" : "currentColor"} stroke-width="1.6" stroke-dasharray=${n.dashed ? "4 3" : null}></circle>
      <text x=${n.x} y=${n.y + 4.5} text-anchor="middle" font-size="13" font-weight="700" fill=${n.dashed ? "var(--bs-secondary-color, #888)" : "currentColor"}>${n.id}</text>
      ${n.label ? htl.svg`<text x=${n.x} y=${n.y + (n.ldy ?? R + 13)} text-anchor="middle" font-size="10" fill="var(--bs-secondary-color, #888)">${n.label}</text>` : null}
    </g>`);
  return htl.svg`<svg viewBox="0 0 ${w} ${h}" style="max-width:${w}px;width:100%;height:auto;overflow:visible">${parts}${dots}</svg>`;
}

A job-training program arrives in one region at \(t = 6\) and never reaches its neighbor. Comparing the regions after the program is confounded (they differ for a hundred reasons); comparing the treated region before vs after is confounded too (things trend over time anyway). DiD’s move: compare the changes –how much the treated region moved, minus how much the control moved.

That subtraction is only an effect under one assumption: absent treatment, both regions would have moved in parallel. Level differences are fine –the arrow \(G \to Y\) can exist. What must not exist is an arrow from group to trend. Drag the red slider and draw it.

Show the code

tiles([
  {lab: "truth (wired in)", val: fmt(did_effect), color: "currentColor"},
  {lab: "DiD estimate", val: fmt(didEst), color: C.ctrl},
  {lab: "bias", val: d3.format("+.2f")(didEst - did_effect),
   color: Math.abs(didEst - did_effect) > 0.3 ? C.danger : C.ghost}
])

Show the code

did = {
  const noise = d3.randomNormal.source(d3.randomLcg(0.4242))(0, 0.9);
  const rows = [];
  for (const g of ["Control", "Treated"]) {
    for (let t = 1; t <= 10; t++) {
      for (let i = 0; i < 30; i++) {
        const y = (g === "Treated" ? 4 : 1) + 0.5 * t
          + (g === "Treated" ? did_drift * (t - 1) : 0)
          + (g === "Treated" && t >= 6 ? did_effect : 0)
          + noise();
        rows.push({g, t, y});
      }
    }
  }
  return rows;
}

Show the code

didMeans = {
  const rows = [];
  for (const [g, ts] of d3.rollups(did, v => d3.mean(v, d => d.y), d => d.g, d => d.t)) {
    for (const [t, m] of ts) rows.push({g, t, m});
  }
  return rows.sort((a, b) => a.t - b.t);
}

Try this. With the violation at 0, drag the true effect around: the estimate tracks the truth, because the dashed counterfactual –“treated, had nothing happened”– is exactly right. Now set the differential trend to \(+0.4\): the treated region was rising faster anyway, the counterfactual is drawn too flat, and DiD hands the extra trend to the program. Then look left of the dashed line: the pre-treatment lines are visibly not parallel. That is why every serious DiD paper opens with a plot of pre-trends –it is the one part of the assumption the data lets you audit.

A scholarship goes to everyone scoring at or above a cutoff, and to no one below. Students at 59.9 and 60.1 are, for all practical purposes, the same person on two sides of a bureaucratic line –so any jump in later outcomes exactly at the cutoff is the scholarship’s doing. The DAG shows why this works: the running variable \(X\) may affect the outcome however it likes, as long as its effect is smooth at the cutoff; only \(D\) is allowed to jump there.

The threat here is not a hidden arrow but a bad model of the smooth part. You estimate the jump by fitting a line on each side within a bandwidth around the cutoff. If the world is curved and your window is wide, straight lines mis-fit the curvature and the mistake lands in the jump.

Show the code

tiles([
  {lab: "truth (wired in)", val: fmt(rdd_effect), color: "currentColor"},
  {lab: "RDD estimate", val: fmt(rddFit.est), color: C.ctrl},
  {lab: "bias", val: d3.format("+.2f")(rddFit.est - rdd_effect),
   color: Math.abs(rddFit.est - rdd_effect) > 0.35 ? C.danger : C.ghost},
  {lab: "points in window", val: `${rddFit.n} / 280`, color: C.ghost}
])

Show the code

rddPts = {
  const ux = d3.randomUniform.source(d3.randomLcg(0.111))(-10, 10);
  const ne = d3.randomNormal.source(d3.randomLcg(0.222))(0, 1.1);
  return d3.range(280).map(() => ({x: ux(), e: ne()}));
}

Show the code

rddFit = {
  const L = olsFit(rdd.filter(d => d.side === "Below cutoff" && d.used));
  const R = olsFit(rdd.filter(d => d.side === "Above cutoff" && d.used));
  return {L, R, est: R.a - L.a, n: rdd.filter(d => d.used).length};
}

Show the code

Plot.plot({
  width: Math.min(width, 780),
  height: 420,
  x: {domain: [-10.5, 10.5], label: "Running variable (distance to cutoff)"},
  y: {domain: [-8, 17], label: "Outcome", grid: true},
  color: {legend: true, domain: ["Below cutoff", "Above cutoff"], range: [C.ctrl, C.treat]},
  marks: [
    Plot.rect([{}], {x1: -rdd_bw, x2: rdd_bw, y1: -8, y2: 17, fill: "currentColor", fillOpacity: 0.05}),
    Plot.ruleX([0], {stroke: "currentColor", strokeDasharray: "3 3", opacity: 0.55}),
    Plot.dot(rdd.filter(d => !d.used), {x: "x", y: "y", fill: C.ghost, opacity: 0.18, r: 2.6, title: d => `x = ${fmt(d.x)}, y = ${fmt(d.y)} (outside bandwidth)`}),
    Plot.dot(rdd.filter(d => d.used), {x: "x", y: "y", fill: "side", opacity: 0.55, r: 2.6, title: d => `x = ${fmt(d.x)}, y = ${fmt(d.y)}`}),
    Plot.line([{x: -rdd_bw, y: rddFit.L.a - rddFit.L.b * rdd_bw}, {x: 0, y: rddFit.L.a}], {x: "x", y: "y", stroke: C.ctrl, strokeWidth: 2.6}),
    Plot.line([{x: 0, y: rddFit.R.a}, {x: rdd_bw, y: rddFit.R.a + rddFit.R.b * rdd_bw}], {x: "x", y: "y", stroke: C.treat, strokeWidth: 2.6}),
    Math.abs(rddFit.est) > 0.15
      ? Plot.arrow([{}], {x1: 0, y1: rddFit.L.a, x2: 0, y2: rddFit.R.a, stroke: "currentColor", strokeWidth: 1.6})
      : null,
    Plot.text([{x: 0.35, y: (rddFit.L.a + rddFit.R.a) / 2}], {x: "x", y: "y", text: [`jump = ${fmt(rddFit.est)}`], textAnchor: "start", fontWeight: 700, fontSize: 11.5, fill: "currentColor"})
  ]
})

Try this. With curvature at 0, the bandwidth barely matters –straight lines fit a straight world at any width. Now set curvature to \(-1\) and the bandwidth to 10: the lines chase the curve, miss it most exactly at the cutoff, and the bias tile lights up. Shrink the bandwidth and watch the bias drain away –along with the points-in-window count. Narrow is honest but noisy, wide is precise but wrong: that trade-off is the entire craft of RDD, and choosing the bandwidth well is what packages like rdrobust are for.

You want the effect of \(X\) on \(Y\), but an unobserved \(U\) pushes on both –you can’t adjust for what you can’t see, so OLS is doomed. The IV escape: find a variable \(Z\) that (1) genuinely moves \(X\) –relevance, the thick arrow– and (2) touches \(Y\) through \(X\) only –exclusion, the missing arrow. A lottery, a draft number, distance to the nearest college. Then whatever wiggle \(Z\) induces in \(Y\) must have traveled via \(X\), and the ratio of the two wiggles is the causal effect –confounding and all.

The true effect below is fixed at 2, and both estimators run 120 times on fresh samples, so you can see not just where each one lands but how much it scatters.

Show the code

{
  const med = w => d3.median(ivEsts.filter(d => d.which === w), d => d.est);
  const qs = ivEsts.filter(d => d.which === "IV").map(d => d.est).sort(d3.ascending);
  return tiles([
    {lab: "truth (wired in)", val: "2.00", color: "currentColor"},
    {lab: "OLS (median of 120)", val: fmt(med("OLS")), color: C.treat},
    {lab: "IV (median of 120)", val: fmt(med("IV")), color: C.ctrl},
    {lab: "IV spread (IQR)", val: fmt(d3.quantile(qs, 0.75) - d3.quantile(qs, 0.25)), color: C.ghost}
  ]);
}

Show the code

ivBase = {
  const rn = d3.randomNormal.source(d3.randomLcg(0.333))(0, 1);
  const ru = d3.randomUniform.source(d3.randomLcg(0.444))(-1, 1);
  return d3.range(120).map(() => ({
    jit: ru(),
    draws: d3.range(60).map(() => ({z: rn(), u: rn(), e1: rn(), e2: rn()}))
  }));
}

Show the code

ivEsts = {
  const truth = 2;
  const rows = [];
  for (const rep of ivBase) {
    const pts = rep.draws.map(d => {
      const x = iv_strength * d.z + iv_conf * d.u + 0.5 * d.e1;
      const y = truth * x + iv_conf * d.u + iv_excl * d.z + 0.5 * d.e2;
      return {z: d.z, x, y};
    });
    const cov = (fa, fb) => {
      const ma = d3.mean(pts, fa), mb = d3.mean(pts, fb);
      return d3.mean(pts, p => (fa(p) - ma) * (fb(p) - mb));
    };
    rows.push({which: "OLS", est: cov(p => p.x, p => p.y) / cov(p => p.x, p => p.x), jit: rep.jit});
    rows.push({which: "IV", est: cov(p => p.z, p => p.y) / cov(p => p.z, p => p.x), jit: rep.jit});
  }
  return rows;
}

Try this. With the defaults, OLS piles up around 2.5 –confounded, tightly and confidently wrong– while IV scatters honestly around the truth. Crank confounding to 2: OLS gets worse, IV doesn’t care. Now weaken the instrument to 0.1 and watch IV’s cloud detonate across the axis (check the IQR tile) –a weak instrument trades OLS’s bias for variance so wild the cure rivals the disease. Finally, restore strength and drag the exclusion violation: the red arrow appears, and IV –still narrow, still confident– slides off the truth. An IV that violates exclusion doesn’t fail loudly; it lies precisely.

Treated units are simply different: a training program enrolls the already-motivated, a new drug goes to the sickest. Comparing raw group means is hopeless. Matching’s move: for every treated unit, find the untreated unit that looks most like it on the observables, and compare each pair⁶. The plot below draws every match as a thin line –treated points paired with their nearest control on \(x\), unmatched controls fading into the background.

The identifying assumption even has “assumption” in its name: selection on observables. Treatment may depend on \(x\) as strongly as it likes –you can see \(x\), so you can match on it. What must not exist is an arrow from anything unobserved into both treatment and outcome. That is not a technique; it is a claim about the world. Drag the red slider and make it false.

Show the code

tiles([
  {lab: "truth (wired in)", val: fmt(m_effect), color: "currentColor"},
  {lab: "naive gap in means", val: fmt(mMatch.naive), color: C.treat},
  {lab: "matching estimate", val: fmt(mMatch.att), color: C.ctrl},
  {lab: "matching bias", val: d3.format("+.2f")(mMatch.att - m_effect),
   color: Math.abs(mMatch.att - m_effect) > 0.3 ? C.danger : C.ghost}
])

Show the code

mMatch = {
  const T = mWorld.filter(d => d.D), Cc = mWorld.filter(d => !d.D);
  const pairs = T.map(t => {
    let best = Cc[0], bd = Infinity;
    for (const c of Cc) {
      const dd = Math.abs(c.x - t.x);
      if (dd < bd) { bd = dd; best = c; }
    }
    return {t, c: best};
  });
  return {
    pairs,
    att: d3.mean(pairs, p => p.t.y - p.c.y),
    naive: d3.mean(T, d => d.y) - d3.mean(Cc, d => d.y),
    used: new Set(pairs.map(p => p.c.i))
  };
}

Try this. With the defaults, treatment chases high \(x\): the naive gap badly overstates the effect (treated units were headed for high outcomes anyway), while matching –comparing like with like– lands on the truth. Crank selection on \(x\) to 2: naive gets worse, matching doesn’t blink; watch the match lines stretch as overlap thins. Now the red slider: selection on the unobservable biases naive and matched alike, and nothing on screen warns you –the match lines look exactly as tidy as before. That silence is the point: “we matched on everything we had” defends nothing. It is the DAG, not the algorithm, doing the work.

Follow the same units over time –five firms, ten quarters each– and something wonderful happens: every confounder that doesn’t change within a unit (management quality, location, culture) gets absorbed by the unit’s own average. That is the fixed-effects (within) estimator⁷: subtract each unit’s means and only within-unit movement remains.

Below, each firm’s baseline is negatively related to its typical \(x\) –so the pooled cloud can slope one way while the truth inside every firm goes the other. Flip the toggle to perform the FE move yourself, and then remember its blind spot: a confounder that moves over time within a unit survives demeaning untouched.

Show the code

tiles([
  {lab: "truth (wired in)", val: fmt(fe_effect), color: "currentColor"},
  {lab: "pooled OLS slope", val: fmt(feEst.pooled), color: C.treat},
  {lab: "fixed-effects slope", val: fmt(feEst.within), color: C.ctrl},
  {lab: "FE bias", val: d3.format("+.2f")(feEst.within - fe_effect),
   color: Math.abs(feEst.within - fe_effect) > 0.25 ? C.danger : C.ghost}
])

Show the code

feBase = {
  const rn = d3.randomNormal.source(d3.randomLcg(0.666))(0, 1);
  return d3.range(5).map(i => ({
    unit: "ABCDE"[i],
    xbar: i - 2,
    obs: d3.range(24).map(() => ({v: rn(), w: rn(), e: rn()}))
  }));
}

Show the code

feData = {
  const rows = [];
  for (const u of feBase) {
    const alpha = -fe_between * 1.3 * u.xbar;
    for (const o of u.obs) {
      const x = u.xbar + 0.6 * o.v + 0.5 * fe_tvc * o.w;
      rows.push({unit: u.unit, x, y: alpha + fe_effect * x + fe_tvc * o.w + 0.4 * o.e});
    }
  }
  return rows;
}

Show the code

{
  const pts = fe_view ? feEst.dem : feData;
  const b = fe_view ? feEst.within : feEst.pooled;
  const a = fe_view ? 0 : feEst.pooledA;
  const xd = fe_view ? [-2.4, 2.4] : [-4, 4];
  const yd = fe_view ? [-4, 4] : [-9.5, 9.5];
  return Plot.plot({
    width: Math.min(width, 780),
    height: 420,
    x: {domain: xd, label: fe_view ? "x, within-unit (demeaned)" : "x", grid: true},
    y: {domain: yd, label: fe_view ? "Outcome, within-unit (demeaned)" : "Outcome"},
    color: {legend: true, domain: ["A", "B", "C", "D", "E"], range: feColors},
    marks: [
      Plot.line([{x: xd[0], y: a + b * xd[0]}, {x: xd[1], y: a + b * xd[1]}], {x: "x", y: "y", stroke: "currentColor", strokeWidth: 2, strokeDasharray: "6 4", opacity: 0.75, clip: true}),
      Plot.dot(pts, {x: "x", y: "y", fill: "unit", opacity: 0.65, r: 3.4, clip: true, title: d => `firm ${d.unit}: x = ${fmt(d.x)}, y = ${fmt(d.y)}`}),
      !fe_view ? Plot.dot(feEst.means, {x: "mx", y: "my", fill: "unit", r: 9, clip: true}) : null,
      !fe_view ? Plot.text(feEst.means, {x: "mx", y: "my", text: "unit", fill: "white", fontWeight: 700, fontSize: 11, clip: true}) : null,
      Plot.text([{x: xd[1] * 0.97, y: a + b * xd[1] * 0.97}], {x: "x", y: "y", text: [`${fe_view ? "FE" : "pooled"} slope = ${fmt(b)}`], textAnchor: "end", dy: -12, fontWeight: 700, fontSize: 11.5, fill: "currentColor"})
    ]
  });
}

Try this. With the defaults the pooled slope is far from the truth –push unit-level confounding to 2.5 and it changes sign: a within-firm effect of \(+1\) masquerades as a negative relationship, a full Simpson’s paradox. Flip the FE toggle: each firm’s cluster snaps to the origin, the between-firm variation vanishes, and the slope through what remains is the truth. Now the red slider: a time-varying confounder rides inside each firm’s own movement, and the FE slope drifts off while the toggle still looks reassuring. Fixed effects kill confounders that hold still –nothing kills the ones that move.

One state passes a policy; no other single state resembles it. Synthetic control’s move –born with California’s Prop 99 tobacco study⁸– is to build the missing control: find non-negative weights, summing to one, that make a blend of untreated “donor” units track the treated unit’s pre-policy path as closely as possible. If the blend shadowed the treated unit for years before the policy, its path after the policy is a credible counterfactual –and the gap between the lines is the effect.

The assumption: the treated unit lives inside the donor pool’s span –it must be interpolated, never extrapolated. The audit is built in: the pre-policy fit. Drag the red slider to make the treated unit move like no donor does, and watch the pre-fit RMSE tile confess before the estimate lies.

Show the code

tiles([
  {lab: "truth (wired in)", val: fmt(sc_effect), color: "currentColor"},
  {lab: "synth estimate (post gap)", val: fmt(scFit.gap), color: C.ctrl},
  {lab: "bias", val: d3.format("+.2f")(scFit.gap - sc_effect),
   color: Math.abs(scFit.gap - sc_effect) > 0.35 ? C.danger : C.ghost},
  {lab: "pre-policy fit (RMSE)", val: fmt(scFit.rmse),
   color: scFit.rmse > 0.6 ? C.danger : C.ghost}
])

Show the code

htl.html`<small style="color: var(--bs-secondary-color)">Donor weights found by the optimizer: ${scFit.w.map(v => fmt(v)).join(" · ")} &nbsp;(non-negative, sum to 1 --most donors get <em>zero</em>, a signature of the method)</small>`

Show the code

scBase = {
  const rn = d3.randomNormal.source(d3.randomLcg(0.777))(0, 1);
  const ru = d3.randomUniform.source(d3.randomLcg(0.888))(0, 1);
  return {
    donors: d3.range(6).map(() => ({
      a: 4 + 6 * ru(), b: 0.05 + 0.65 * ru(), c: 0.2 + 1.6 * ru(),
      noise: d3.range(16).map(() => rn())
    })),
    wStar: [0.35, 0.3, 0.2, 0.15, 0, 0],
    tNoise: d3.range(16).map(() => rn())
  };
}

Show the code

scFit = {
  const pre = d3.range(0, 10);
  const D = scSeries.donors.map(d => d.vals);
  const yT = scSeries.treated;
  const J = D.length;
  const proj = v => {
    const u = [...v].sort((a, b) => b - a);
    const css = [];
    let s = 0, rho = 0;
    for (let i = 0; i < u.length; i++) { s += u[i]; css.push(s); }
    for (let i = 0; i < u.length; i++) if (u[i] - (css[i] - 1) / (i + 1) > 0) rho = i;
    const theta = (css[rho] - 1) / (rho + 1);
    return v.map(x => Math.max(0, x - theta));
  };
  let w = new Array(J).fill(1 / J);
  for (let it = 0; it < 4000; it++) {
    const g = new Array(J).fill(0);
    for (const t of pre) {
      const r = d3.sum(w.map((wk, k) => wk * D[k][t])) - yT[t];
      for (let k = 0; k < J; k++) g[k] += (2 / pre.length) * r * D[k][t];
    }
    w = proj(w.map((wk, k) => wk - 0.0004 * g[k]));
  }
  const synth = d3.range(0, 16).map(t => d3.sum(w.map((wk, k) => wk * D[k][t])));
  return {
    w, synth,
    gap: d3.mean(d3.range(10, 16), t => yT[t] - synth[t]),
    rmse: Math.sqrt(d3.mean(pre, t => (yT[t] - synth[t]) ** 2))
  };
}

Show the code

{
  const donorLines = scSeries.donors.flatMap(d => d.vals.map((v, i) => ({j: d.j, t: i + 1, y: v})));
  const tr = scSeries.treated.map((v, i) => ({t: i + 1, y: v}));
  const sy = scFit.synth.map((v, i) => ({t: i + 1, y: v}));
  const ymax = 28 + 8 * sc_unique;
  return Plot.plot({
    width: Math.min(width, 780),
    height: 420,
    x: {domain: [1, 18.4], label: "Time", ticks: d3.range(1, 17, 3)},
    y: {domain: [0, ymax], label: "Outcome", grid: true},
    marks: [
      Plot.ruleX([10.5], {stroke: "currentColor", strokeDasharray: "3 3", opacity: 0.5}),
      Plot.text([{x: 10.5, y: ymax * 0.97}], {x: "x", y: "y", text: ["policy →"], dx: 6, textAnchor: "start", fill: "currentColor", opacity: 0.6, fontSize: 11}),
      Plot.line(donorLines, {x: "t", y: "y", z: "j", stroke: C.ghost, strokeWidth: 1, opacity: 0.35}),
      Plot.text([{t: 2.2, y: donorLines[1].y}], {x: "t", y: "y", text: ["donor pool"], dy: -12, fill: C.ghost, fontSize: 11}),
      Plot.line(tr, {x: "t", y: "y", stroke: C.treat, strokeWidth: 2.6}),
      Plot.line(sy, {x: "t", y: "y", stroke: C.ctrl, strokeWidth: 2.2, strokeDasharray: "6 4"}),
      Plot.text([tr.at(-1)], {x: "t", y: "y", text: ["treated"], dx: 8, textAnchor: "start", fill: C.treat, fontWeight: 600, fontSize: 11.5}),
      Plot.text([sy.at(-1)], {x: "t", y: "y", text: ["synthetic"], dx: 8, dy: 12, textAnchor: "start", fill: C.ctrl, fontWeight: 600, fontSize: 11.5}),
      Math.abs(tr.at(-1).y - sy.at(-1).y) > 0.4
        ? Plot.arrow([{}], {x1: 16.6, y1: sy.at(-1).y, x2: 16.6, y2: tr.at(-1).y, stroke: "currentColor", strokeWidth: 1.6})
        : null,
      Plot.text([{x: 16.75, y: (tr.at(-1).y + sy.at(-1).y) / 2}], {x: "x", y: "y", text: [`gap = ${fmt(scFit.gap)}`], textAnchor: "start", fontWeight: 700, fontSize: 11.5, fill: "currentColor"})
    ]
  });
}

Try this. With the violation at 0, the dashed synthetic line shadows the treated unit for ten periods, then the paths split at the policy line by exactly the effect you dialed in –and check the weights under the tiles: the optimizer found the recipe (and gave most donors zero). Drag the true effect around; the pre-period fit never moves, because the effect hasn’t happened yet. Now the red slider: the treated unit grows a personality no blend of donors can imitate. The pre-fit RMSE tile turns red first –synthetic control is unusual among these six designs in that its key assumption leaves fingerprints in plain sight. An analyst who reports the gap while hiding a bad pre-fit isn’t unlucky; they’re lying with a chart.

Six designs, one lesson –the same lesson as the fork and the collider, at higher stakes: the estimate is only as good as the arrows you assumed away, and the data can never fully audit them for you. Draw your assumptions before drawing your conclusions.

Exercises 🏋️

Simulate the third atom, the chain: x causes m (m = 2*x + noise), m causes y (y = 3*m + noise). Regress y ~ x, then y ~ x + m. The total effect of x is 6 –which regression finds it, and what does the other one estimate instead?
Draw last week’s news. Take one “X linked to Y” headline, draw a DAG with at least one plausible fork, and use ggdag_adjustment_set() to state what the study would need to have adjusted for.
Break the fork fix: in the ice-cream world, adjust for a noisy version of summer (summer_reported = ifelse(runif(2000) < 0.8, summer, 1 - summer)). How much of the confounding comes back? (Measurement error makes “we controlled for it” a matter of degree.)
Randomise: rebuild the ice-cream world but assign ice_cream with rnorm(2000) –independent of summer, as an experiment would. Show the naive regression is now unbiased, no adjustment needed.
Dose the post-treatment poison: in the rct world, make the certificate depend only on training (drop motivation from line 3). Is adjusting for it still biased –and in which direction? Then make it depend only on motivation. Which atom is each variant, and which regression is safe in each world?
Reverse-engineer the DiD machine: with the differential trend at \(d\), the bias settles at almost exactly \(5d\). Why 5? (Hint: treatment starts at \(t=6\) of 10 periods –compare the average pre-period to the average post-period.) Then rebuild the whole machine in R with tibble() + lm(y ~ group * post) and confirm.
In the RDD machine, set curvature to \(-1\) and find the bandwidth where you’d stop trusting the estimate. Now simulate the same world in R and fit lm(y ~ x * above) inside your chosen window –does your jump match the machine’s?
Set the IV machine to strength 0.1 and confounding 2. Between confidently-biased OLS and wildly-scattered IV, which would you rather see published, and what would you demand from the authors either way? (This is not rhetorical –“weak instruments” is a named disease with named tests.)
In the matching machine, set selection on the unobservable to 0.8: naive and matched are now both wrong –are they wrong in the same direction, and why? (Read it off the DAG.) Then simulate the same world in R and match properly with MatchIt –does method = "nearest" agree with the machine?
Prove the fixed-effects machine to yourself in R: simulate five units with unit effects correlated with x, then show lm(y ~ x + factor(unit)) and the demeaned lm(I(y - ave(y, unit)) ~ I(x - ave(x, unit))) give the same slope –and that fixest::feols(y ~ x | unit) matches both.
The synthetic-control machine’s pre-fit RMSE is a placebo test you can run before believing anything. Crank the uniqueness slider slowly: at what RMSE would you personally refuse to report the estimate? Defend your threshold, then read how the real method picks donors and predictors in the Mixtape chapter.

Key takeaways

Every causal structure is built from three atoms: fork (confounder), chain (mediator), collider.
Adjusting for a confounder removes bias; adjusting for a collider creates it – the same action, opposite effect, depending on the atom.
Post-treatment variables (measured after the treatment) are dangerous to adjust for, whichever atom they turn out to be.
Every natural-experiment design (DiD, RDD, IV, matching, fixed effects, synthetic control) rests on one identifying assumption – know what would have to be true for the estimate to be believed.

← Previous: All models are wrong

Track overview: Think with data →

Next: Think Bayes →

Footnotes

This is Judea Pearl’s “ladder of causation”, from The Book of Why –the best non-technical entry to this whole subject.↩︎
It is the subtitle of Hernán’s free HarvardX course Causal Diagrams: Draw Your Assumptions Before Your Conclusions, and the working philosophy of the book he co-wrote with James Robins, Causal Inference: What If –free, rigorous, and the standard reference once you outgrow this session.↩︎
ggdag::ggdag_adjustment_set(dag, exposure = "x", outcome = "y") automates the deduction: give it your graph and it returns the set(s) of variables to adjust for. The graph is your responsibility; the graph-reading is mechanical.↩︎
The definitive paper carries the moral in its title: Montgomery, Nyhan & Torres (2018), “How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do About It”. Their audit found the practice in roughly half of the experimental papers in top political-science journals.↩︎
This gallery is a love letter to two masters of the genre: Nick Huntington-Klein’s animated causal graphs, which showed how each design moves, and Kristoffer Magnusson’s interactive visualizations, which set the bar for how one should look. The interactives are written in Observable JS inside this very .qmd –the same {ojs} engine as the toggles in session 4 of the workshop, no server required.↩︎
The visual is a direct homage to Nick Huntington-Klein’s matching animation; the theory is the Mixtape’s matching and subclassification chapter. In real work, don’t hand-roll nearest neighbors like this machine does –use MatchIt, which also handles propensity scores, calipers and diagnostics.↩︎
The Mixtape’s panel data chapter covers the theory; in R the modern tool is fixest –feols(y ~ x | unit) and you’re done. The demeaning toggle below is Nick Huntington-Klein’s fixed-effects animation rebuilt as a switch.↩︎
Abadie, Diamond & Hainmueller’s California study is the canonical application; the Mixtape’s synthetic control chapter walks it end to end. In R: tidysynth (tidy interface) or the original Synth. The machine below solves for the weights live –projected gradient descent on the simplex, right in your browser– every time you move a slider.↩︎

Citation

BibTeX citation:

@online{amaya2026,
  author = {Amaya, Nelson},
  title = {Draw Your Assumptions Before Drawing Your Conclusions 🔀},
  date = {2026-07-04},
  url = {https://r4dev.netlify.app/sessions_thinking/04-causal/04-causal},
  langid = {en}
}

For attribution, please cite this work as:

Amaya, Nelson. 2026. “Draw Your Assumptions Before Drawing Your Conclusions 🔀.” July 4. https://r4dev.netlify.app/sessions_thinking/04-causal/04-causal.