Making data move 🚶

Learn to build animations from static graphs

gganimate

causal inference

ggplot2

ggimage

ggpp

treemap

Transform static plots into animations using gganimate. Examples include millenia of chery tree blossoms, COVID-19 aftermath and why causation does not imply correlation.

Author

Nelson Amaya

Published

July 31, 2022

Modified

November 22, 2024

“The greatest value of a picture is when it forces us to notice what we never expected to see”
–John Tukey

PART I: Cherry tree blossoms

People in Kyoto have been keeping track of the day the cherry trees blossom since the year 812. You can find the data here.

Sakura

Let’s first plot it and add a smoothing LOESS curve. We’ll use ggimages to change the point to an image of a cherry blossom and ggpp to add a plot inside of another plot that focuses attention on one part, with the geom_plot() function. This way we can focus attention on a particular part of the graph.

Notice that during the XX century the cherries started to blossom way earlier than they did for over a millenia.

Click me!

library(tidyverse)
library(ggimage)
library(ggpp)
library(readxl)

# First we upload the data and create a date variable ####
sakura_data <- readxl::read_excel(paste0(getwd(),"/KyotoFullFlower7.xlsx"), skip = 25) |> # <1> #"/sessions_workshop/04-animate/",
  tidyr::drop_na() |>
  dplyr::rename("year"=1,"blossom"=2) |>
  dplyr::mutate(year = lubridate::make_date(year))
  
# Now we create a plot for only the XX century.  ####
sakura_xx <- tibble(x=lubridate::as_date("1800-01-01"), y=60,
                    # I save the plot as a list
                    plot = list(sakura_data |>
                                  dplyr::filter(year>1900) |>
                                  ggplot(aes(x=year, y=blossom))+
                                  geom_point()+
                                  geom_smooth(method="lm", color="red")+
                                  ylim(c(85,110))+
                                  scale_x_date(date_breaks = "5 years", date_labels = "%Y")+
                                  labs(y=NULL, x=NULL)+
                                  theme_classic()+
                                  theme(axis.text.x = element_text(angle=90),
                                        text = element_text(size=7))
                                )
                    )

# Now I create the plot ####
sakura <- sakura_data |> 
  dplyr::mutate(sakura = paste0(getwd(),"/sakura.jpg")) |> #"/sessions_workshop/04-animate/",
  ggplot(aes(x=year, y=blossom))+ 
  geom_plot(data = sakura_xx, aes(x,y,label=plot))+
  annotate(geom="rect", xmin=lubridate::as_date("1900-01-01"), xmax=lubridate::as_date("2022-01-01"),
           ymin=82,ymax=110,
           linetype="dotted", fill="grey80", color="black")+
  annotate(geom="rect", xmin=lubridate::as_date("1300-01-01"), xmax=lubridate::as_date("1800-01-01"),
           ymin=60,ymax=90,
           linetype="dotted", fill=NULL, color="black")+
  geom_image(aes(image=sakura), size=0.02)+
  geom_smooth(method="loess", color="pink")+
  ylim(c(60,130))+
  scale_x_date(date_breaks = "50 years", date_labels = "%Y")+
  labs(y="Days to blossom from January 1st",
       x=NULL,
       caption = "Source: NOAA paleo phenology data and \nHistorical Series of Phenological data for Cherry Tree Flowering at Kyoto City\n by Aono and Kazui, 2008; Aono and Saito, 2010.")+
  theme_classic()+
  theme(axis.text.x = element_text(angle=90))

sakura

2: Drop years without dates
3: Rename variables
4: Transform variable to a date using make_date() function from lubridate package
5: We first create a tibble with the coordinates where we want our inside plot to be, and a ggplot object saved as a list (all ggplot objects are lists!). We customise the plot, as always
6: To use ggimages we need to create a variable with the location of the image file in our directory. The image can vary per observation.
7: We use the geom_plot() geometry to embed a plot within a plot, using the label aesthetic to place the plot and data from the tibble we just created.
8: We create some annotations to box the plot within the other plot
9: We change the shape of the point for an image stored locally or in a URL using the geom_image() geometry and the image aesthetic.
10: We add a LOESS curve and some additional features

A thousand years of cherry tree blossoms in Kyoto

Handle dates with lubridate

Dates are notoriously annoying to handle, but the lubridate package makes it way easier.

Sakura animated

To introduce animation with gganimate package, let’s drop the inference from the smooth curve and add two lines of code to the ggplot: transition_time(), which will create one frame per year in the data, and shadow_mark(), which will keep old values once they are shown.

What do these functions mean? How do they work? How do you export the animations?

Click me!

library(tidyverse)
library(lubridate)
library(ggimage)
library(gganimate)

sakura_anim <- sakura_data |> 
  dplyr::mutate(sakura = paste0(getwd(), "/sakura.jpg")) |> #"/sessions/04-animate",
  dplyr::mutate(y=lubridate::year(year)) |>
  ggplot(aes(x=year, y=blossom))+
  geom_image(aes(image=sakura), size=0.02)+
  ylim(c(80,130))+
  scale_x_date(date_breaks = "10 years", date_labels = "%Y")+
  labs(y="Days to blossom from January 1st",
       x=NULL,
       subtitle = "Year {frame_time}")+
  theme_classic()+ 
  theme(axis.text.x = element_blank())+
  gganimate::transition_time(as.integer(y))+
  gganimate::shadow_mark()

11: We create a variable that indicates where the image we plan to use as a point is stored
12: Create a separate year variable for the animation using the year() function
13: We create out ggplot
14: We use the {frame_time} option within the brackets so the animation saves the year for each frame
15: We define the transition_time() as an integer of years, which will be our animation variable. as.integer() makes sure the dates don’t include decimals.
16: We add shadow_mark() so old points are displayed as well.

Always check the package vignettes!

Check out the gganimate references to learn what options best suit your project

PART II: COVID-19 & C02 emissions

Now let’s turn to another topics. First we’ll create an animation at what has happened over the past few years with the COVID-19 pandemic, then we’ll turn to pollution.

COVID-19 data from OWID

First, we’re going to use the data from the COVID-19 repository from Our World in Data available in GitHub.

Click me!

library(tidyverse)

owid_covid <- readr::read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")

COVID-19, a grim movie

Now we’re going to paint a grim picture: The evolution of COVID-19 related deaths per million inhabitants, in each of our countries, in 2020 and 2021.

We will first create ggplot objects for each year using a simple loop and store them in a list. Then we’ll add a few simple gganimate parameters and voila, animation done. The key parameter is the one that defines the transition between different moments transition_time() where we add our time variable. The rest is just decoration to smooth the change between one state and the next one, showing each new state while keeping the old ones, and define how each state is introduced.

Click me!

library(tidyverse)
library(RColorBrewer)
library(lubridate) # Useful for handling dates
library(gganimate) # Animate ggplot objects

# Create empty lists ####
covid_plot <- list()
covid_animation <- list()

# Loop ####
for(i in c(2020:2021)) {

 # Step 1:  Create plot for each year 
 covid_plot[[paste0("year_",i)]] <- owid_covid %>%
  # Format date
  dplyr::mutate(date = lubridate::as_date(date),
                n = row_number(),
                year = lubridate::year(date)) |>
  # Filter countries using the %in% operator
  dplyr::filter(iso_code %in% c("COL","ITA","FRA","USA"), year %in% i) |>
  ggplot(aes(x=date,y=total_deaths_per_million,color=location))+
  geom_point(alpha=0.7)+
  labs(title = i,
       # Notice the bracket term, which will show the frame rate for the animation we define below
       subtitle = "Month: {frame_time}",
       x=NULL,
       y=NULL)+
  scale_color_manual(values=c("gold2","navy","limegreen","red4"))+
  ylim(c(0,3000))+
  theme_classic()+
  guides(fill=guide_legend(ncol=1))+
  theme(legend.position = "top",
        legend.title = element_blank(),
        legend.text = element_text(size = 5),
        axis.text.x = element_text(angle = 90,size = 5),
        axis.text.y = element_text(size = 5)
        )
   
# Step 2: Animate plot for each year
 covid_animation[[paste0("year_",i)]] <- covid_plot[[paste0("year_",i)]]  +
  # Add the animation
  gganimate::transition_time(as.integer(lubridate::month(date)))+
  # Add shadow to show previous points 
  gganimate::shadow_mark()+
  # Enter options for each new state/time
  gganimate::enter_fade()
}

Even though Italy was the locus for the spread of the virus in Europe, the speed of deaths tapered off before the end of the first year of the pandemic. Same for France. In Colombia the deaths grew very quickly, exponentially, and ended up being higher than the two European countries.

CO2 emissions

We find the repository from the Our World in Data and import the data.

Now we use the package treemapify to build a treemap of the distribution of emissions through time using the geom_treemap() geometry. We place the the inferno option for the fill in the viridis package and done.

Click me!

library(treemapify)
library(tidyverse)
library(janitor)
library(gganimate)
library(viridis)

owid_co2 <- readr::read_csv("https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Annual%20share%20of%20CO2%20emissions%20(OWID%20based%20on%20GCP%2C%202017)/Annual%20share%20of%20CO2%20emissions%20(OWID%20based%20on%20GCP%2C%202017).csv") |>
  janitor::clean_names() |>
  dplyr::rename(co2=3)

co2_anim <- owid_co2 |>
  dplyr::filter(year>1899) |>
  ggplot(aes(area = co2,fill=co2, label=entity)) +
  treemapify::geom_treemap()+
  treemapify::geom_treemap_text(
    fontface = "italic", 
    colour = "white", 
    place = "centre",
    grow = TRUE)+
  scale_fill_viridis_c(option = "inferno", name = "CO2 emissions (% total)")+
  labs(title = "{closest_state}", caption = "Source: OWID based on GPC 2017")+
  theme(legend.position = "top")+
  gganimate::transition_states(as.integer(year))+
  ease_aes("linear")

Figure 3: C02 Emissions in the XX century

Bonus track: Correlation, causation –and why data never speaks for itself

🛠️ Section under construction 🛠

“While correlation does not imply causation, causation does imply correlation” wrote Daniel Kahneman in his latest book Noise. Mr Kahneman, I’m afraid you are wrong. By a mile. Let’s see why.

We only need to prove that a causal relationship might not reflect a correlation. This is surprisingly easy to do, conceptually and statistically. Let’s first see a simple simulation and then look at a nice example from Scott Cunningham’s Causal Inference: The Mixtape”.

The simple setup below shows two random variables \(X\) and \(Y\) that represent 10 000 independent tosses of a fair coin. Now let’s build another variable \(Z\) that takes the value of 1 when \(X\) and \(Y\) both land the same result, heads of tails. This means that \(Z\) is, by definition, causally dependent on both \(X\) and \(Y\), as when both variables are equal, \(Z\) is “turned on”.

Click me!

X <- rbinom(seq(1:1e4),1,0.5)
Y <- rbinom(seq(1:1e4),1,0.5)
Z <- ifelse(X==Y,1,0)

Now what do you think is the correlation between \(X\) and \(Z\)? and what about the correlation between \(Y\) and \(Z\)? Let’s calculate them both:

correlations	values
Between X and Y	-0.01477349
Between X and Z	-0.00201390
Between Y and Z	0.01256885

No, causation does not imply correlation

How the hell are these correlations bordering 0 if these variables are actually causally determined? Aren’t these correlations supposed to be high, close to 1 or -1? What is going on?

Run the code again. Make the samples bigger. This is not a fluke.

Well, it turns out that this is a case of a collider, and one of the many reasons why you cannot ignore causal inference theory.

‘You like DAGs?’ - Mickey, Snatch (2000)

Animate to understand: Causal inference

🛠️ Under construction 🛠

Click me!

library(dagitty)
library(ggdag)

dagify(
  W ~ Y,
  W ~ Z,
  coords = list(x = c(Y = 2, Z = 1, W = 1.5),
                y = c(Y = 1, Z = 1, W = 2))
) |>
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_edges() +
  geom_dag_point(color = "grey80", size = 10) +
  geom_dag_text() +
  theme_void()

A spurious correlation is born: Collider bias

Click me!

library(tidyverse)
library(gganimate)
library(ggthemes)

# Create collider ####
collider_data <- tibble(X = rnorm(200)+1,
                        Y = rnorm(200)+1,
                        time="1") %>%
  dplyr::mutate(C = ifelse((X + Y + rnorm(200)/2)>2,1,0)) %>%
  dplyr::group_by(C) %>%
  dplyr::mutate(mean_X=mean(X),mean_Y=mean(Y)) %>%
  dplyr::ungroup() 
  #dplyr::mutate(Y_hat = predict(lm(Y~X,data=collider_data)))

cor(collider_data$X,collider_data$Y)

[1] 0.003829323

Click me!

# Calculate correlations ####
collider_before_cor <- paste("1. Start with raw data, ignoring C. Correlation between X and Y: ",round(cor(collider_data$X,collider_data$Y),3),sep='')

collider_after_cor <- paste("7. Analyze what's left! Correlation between X and Y controlling for C: ",round(cor(collider_data$X-collider_data$mean_X,collider_data$Y-collider_data$mean_Y),3),sep='')

# Add animation steps: 2 (demean X), 3 (demean X and Y) and 4 (label)
collider_full <- dplyr::bind_rows(
  #Step 1: Raw data only
  collider_data %>% 
    dplyr::mutate(mean_X=NA,mean_Y=NA,C=0,time=collider_before_cor),
  #Step 2: Raw data only
  collider_data %>% 
    dplyr::mutate(mean_X=NA,mean_Y=NA,time='2. Separate data by the values of C.'),
  #Step 3: Add x-lines
  collider_data %>% 
    dplyr::mutate(mean_Y=NA,time='3. Figure out what differences in X are explained by C'),
  #Step 4: X de-meaned 
  collider_data %>% 
    dplyr::mutate(X = X - mean_X,mean_X=0,mean_Y=NA,time="4. Remove differences in X explained by C"),
  #Step 5: Remove X lines, add Y
  collider_data %>% 
    dplyr::mutate(X = X - mean_X,mean_X=NA,time="5. Figure out what differences in Y are explained by C"),
  #Step 6: Y de-meaned
  collider_data %>% 
    dplyr::mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=0,time="6. Remove differences in Y explained by C"),
  #Step 7: Raw demeaned data only
  collider_data %>% 
    dplyr::mutate(X = X - mean_X,Y = Y - mean_Y,mean_X=NA,mean_Y=NA,time=collider_after_cor)) |>
  dplyr::mutate(C = factor(C))

collider_anim <- collider_full |>
  ggplot(aes(y=Y,x=X,color=C))+
  geom_point(aes(shape=C), size=6)+
  geom_vline(aes(xintercept=mean_X))+
  geom_hline(aes(yintercept=mean_Y))+
  guides(color=guide_legend(title="C"))+
  scale_color_brewer(palette="Set1")+
  scale_shape_manual(values = c(18,19))+
  labs(title = 'Inventing a Correlation Between X and Y by Controlling for Collider C \n{next_state}')+
  theme(legend.position = "none")+
  gganimate::transition_states(time,
                    transition_length=c(1,12,32,12,32,12,12),
                    state_length=c(500,500,500,500,500,500,500),
                    wrap=FALSE)+
  gganimate::ease_aes('sine-in-out')+
  gganimate::exit_fade()+
  gganimate::enter_fade()

Figure 4: Collider bias, or how a spurious correlations is born

🏗 Practice 4: Make it move

Easy

Make an animation of any data we’ve used so far
Save it as a .gif and share it with the group

Intermediate

Choose one causal graph from Nick Huntington Klein’s GitHub page and replicate it.
Describe it

Citation

BibTeX citation:

@online{amaya2022,
  author = {Amaya, Nelson},
  title = {Making Data Move 🚶},
  date = {2022-07-31},
  url = {https://r4dev.netlify.app/sessions_workshop/04-animate/04-animate},
  langid = {en}
}

For attribution, please cite this work as:

Amaya, Nelson. 2022. “Making Data Move 🚶.” July 31, 2022. https://r4dev.netlify.app/sessions_workshop/04-animate/04-animate.