Permutation, the bootstrap, and what a p-value actually says
statistics
inference
bootstrap
infer
p-value
All of hypothesis testing is one idea: build a world where nothing is going on, simulate it, and ask if your data would be surprising there. Learn it by shuffling with the tidyverse, then industrialize it with infer.
Author
Nelson Amaya
Published
July 2, 2026
Modified
July 5, 2026
âTo consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.â âRonald A. Fisher
PART I: One idea wearing twenty names
Open any statistics textbook and you find a zoo: t-tests, chi-squared tests, ANOVA, Mann-Whitney, each with its own formula, table and existence conditions. Here is the secret the zoo hides1: they are all the same test, and you already have the tools to run it, because the previous session taught you to simulate sampling distributions.
The one test:
Compute a statistic from your data âa difference in means, a correlation, anything.
Build a null world: a version of reality where the effect youâre testing does not exist.
Simulate that world many times, computing the same statistic each time.
Ask: where does my real statistic fall in that distribution of nothing-going-on?
If your number sits comfortably in the middle of what nothing-going-on produces, you have learned nothing. If it sits in the far tail, the boring explanation strains âthat tail probability is the infamous p-value.
Letâs run it on real penguins, imported straight from the web as this workshop likes it:
Now the null world. If species had nothing to do with bill length, the species labels would be arbitrary stickers âso we can build that world by shuffling the stickers and leaving the bills alone:
Click me!
set.seed(2024)null_world <-tibble(replicate =1:5000) |> dplyr::mutate(diff = purrr::map_dbl(replicate, \(i) { penguins |> dplyr::mutate(species =sample(species)) |> dplyr::summarise(m =mean(bill_length_mm), .by = species) |> dplyr::summarise(d = m[species =="Chinstrap"] - m[species =="Adelie"]) |> dplyr::pull(d) }) )null_world |>ggplot(aes(diff)) +geom_histogram(bins =60, fill ="#226F7F") +geom_vline(xintercept = observed_diff, color ="#F75431", linewidth =1) +labs(x ="Difference in means when species labels are shuffled",y =NULL,title ="5,000 worlds where species doesn't matter",subtitle ="The orange line is what we actually observed. It is not even close." ) +theme_minimal()
1
sample(species) with no other arguments shuffles the column âevery penguin keeps its bill, gets a random label. This single line is the null hypothesis, written in R instead of Greek.
The verdict needs no formula: in 5,000 label-shuffled worlds, the biggest difference noise ever produced is a fraction of what we observed. The p-value is the share of null worlds at least as extreme as reality:
Click me!
mean(abs(null_world$diff) >=abs(observed_diff))
[1] 0
Zero out of 5,0002. A t-test would have told you the same thing through a formula ânow you know what the formula is estimating.
PART III: The same machine, industrialized âinfer
The infer package (tidymodels family) turns the four steps into four verbs âspecify(), hypothesize(), generate(), calculate()â so the code reads like the logic:
Click me!
library(infer)set.seed(2024)null_dist <- penguins |> infer::specify(bill_length_mm ~ species) |> infer::hypothesize(null ="independence") |> infer::generate(reps =5000, type ="permute") |> infer::calculate(stat ="diff in means",order =c("Chinstrap", "Adelie"))null_dist |> infer::visualize() + infer::shade_p_value(obs_stat = observed_diff, direction ="two-sided")
1
The statisticâs recipe: outcome ~ grouping.
2
The null world: species and bill length are independent.
3
Build it 5,000 times, by permutation âour shuffle, industrialized.
4
Which statistic to track. Change "diff in means" to "diff in medians", "t", "Chisq"⌠and you have replaced half the textbook zoo without learning a new formula.
Bootstrap: the other side of the same coin
Permutation destroys a relationship to test it. The bootstrap does the reverse: it preserves the data and resamples it (with replacement) to measure how much your estimate wobbles âa confidence interval, without algebra:
Same pipeline, two changes: no hypothesize() (weâre estimating, not testing) and type = "bootstrap" (resample rows with replacement instead of shuffling labels).
Read it as: âif we could rerun this study endlessly, 95% of intervals built this way would capture the true difference.â Notice what the interval gives you that the p-value doesnât: a magnitude with uncertainty, not just a verdict. Prefer it in almost every writeup.
PART IV: What the p-value is not
The p-value may be the most misquoted number in science, enough that the American Statistical Association issued a formal statement about it. Tattoo the correct reading somewhere: the probability of data at least this extreme, assuming nothing is going on. It is not:
The probability the null hypothesis is true âthat question needs Bayes, the last stop of this track.
The probability the finding replicates.
A measure of importance âwith n large enough, a difference of 0.1mm in bill length becomes âsignificantâ while mattering to no penguin.
And one misuse you can now demonstrate to yourself, rather than take on faith. Test twenty hypotheses where nothing âtruly nothingâ is going on:
Both groups drawn from the same distribution: the null hypothesis is true by construction, twenty times. Yet significance arrives anyway âat 5% error per test, twenty tests make a false alarm more likely than not. This is the green jelly bean problem, and it is why âwe tested many things and report the ones that workedâ is not evidence, itâs arithmetic.
The jelly-bean trap doesnât require running twenty literal tests. Choosing which outcome to measure, which subgroups to inspect, which observations count as outliers âeach choice is a fork, and enough forks guarantee a âfindingâ. The defense is the discipline this workshop has preached from session 1: decide your analysis before looking, write it as code, version it. A committed script is a pre-registration you gave yourself.
TipExercises đď¸
Rerun the permutation test with stat = "diff in medians". Does the conclusion survive a change of statistic? (Robust findings usually do.)
Use the full penguins data to test whether Gentoo and Adelie differ in flipper_length_mm, start to finish with infer.
Shrink the data: dplyr::slice_sample(n = 10, by = species) before the pipeline. What happens to the width of the bootstrap interval? Connect this to the \(\sqrt{n}\) rule from the previous session.
Simulate the forking-paths trap: generate one fake dataset with a null effect and 10 outcome variables; test all ten; report âthe significant oneâ. Then explain, in one sentence, to your former self why this felt like science.
Next in the track: All models are wrong puts uncertainty inside the most-used model in the world âthe regression line.
The framing is Allen Downeyâs, from his classic post âThere is still only one testâ. The zoo of named tests exists because before computers, each special case needed its own hand-derivable algebra. You have a computer.âŠď¸
Which we honestly report as p < 1/5000, not p = 0 âour simulation just wasnât big enough to see an event that rare.âŠď¸
Citation
BibTeX citation:
@online{amaya2026,
author = {Amaya, Nelson},
title = {There Is Only One Test đ},
date = {2026-07-02},
url = {https://r4dev.netlify.app/sessions_thinking/02-inference/02-inference},
langid = {en}
}