Sometimes you need to bring data straight from a website. Scrape IMDb movie rankings, U2 lyrics and untranslatable words.
Author
Nelson Amaya
Published
July 31, 2022
Modified
November 22, 2024
If it doesn’t exist on the internet, it doesn’t exist. –Kenneth Goldsmith
PART I: Pick it, grab it
Although you’ve learned to use API and dedicated packages to get information from online sources, sometimes you want to just take a table from Wikipedia, or some text from a webpage and bring it into R.
To do this you need two tools:
The rvest package, which helps read HTML pages and gather information from them
Install the Selector Gadget Extension in your browser. This will help you select the object you want to import.
How does the Selector Gadget work? By clicking in an HTML page until the objects you want are selected in green.
Imagine you want to extract text from the Home page of R4DEV, but you only want one paragraph. Clicking the elements you want until only what you want is highlighted in green, as shown below, will let you know which element you’ll scrap. The selected element p:nth-child(4) will appear below and you can copy it to use it as it is shown below.
Top movies according to IMDb
IMDb has a list of the best ranked movies. We’ll pull it straight from the website using rvest. Using the Selector Gadget, we click on the objects of the webpage we want and bring them to the code that will pull that information and transform it into text.
We will extract the title of the movie, the year, ranking and country from the Wikipedia entry.
Pull the 4 chosen objects by separating them with a comma, inside of the quotations
4
Make the result into text
5
Turn into a tibble
A little HTML is enough
Learn a little HTML, it will help you identify quickly the information you want to retrieve from a website. Read the Get Started section in rvest.
Now we do a few tricks to turn this information into a database we can use and create a searchable table using reactable, which should be your go-to package for interactive tables.
Create a new variable that identifies each type of row, by repeating a vector for all observations
8
Now pivot_wider() to make the rows into columns. Given that we have lots of variables with the same values (years, for instance), we have to do another trick by unnesting the lists that are created by default. Notice that the values_fn=list will turn the result into a list
9
As if you were pulling down a curtain, unnest() makes the first row into a database with all the results we want
10
We change the score to a numeric value using across(), which is used to modify multiple columns
Rejoinder: Feeling U2
Now that we know how to get data from websites, carry out text analysis and visualise, let’s put together all three to something concrete: doing sentiment analysis of all U2 lyrics.
First, we get all lyrics from U2, which are collected in their website. We go to the lyrics page and retrieve the URL of each song. There are over 240 lyrics, but the URLs for each one are not sequential, and include random large numbers. So after extracting the element we want using html_elements(), we use the html_attr() function to retrieve the URL for each lyric. Then we can pass all URLs through a loop and save into an empty list, as we did with songs before.
Click me!
library(tidyverse)library(rvest)library(tidytext)# Extract all song lyrics URLs ####u2_urls <-"https://www.u2.com/music/lyrics"|># First step as before rvest::read_html() |># We select the element we want, which is the link to every song rvest::html_elements(".lyricItem--link") |># We extract the link to each song by picking an attribute, which for links is href rvest::html_attr("href") |> tibble::as_tibble() |># We keep only one variable which is the URL of each song dplyr::transmute(url =str_c("https://www.u2.com",value))# Create an empty list and save all song lyrics using a loop ###u2_lyrics <-list()for(i in u2_urls$url) { u2_lyrics[[i]] <- rvest::read_html(i) |># Extract what we want rvest::html_elements("p") |># Make it into text using html_text2 rvest::html_text2() |>as_tibble() }
Second, we do a little wrangling to transform the list to a tibble, our favorite layout and format for rectangular datasets. We use imap() to name each element in the list, and then map_df() to bind them all together. This leaves us with a dataset where each row includes the lyrics of a song.
Third, we tokenise by words, remove stopwords and join the sentiment lexicon. Just as we did with books before. With a twist: we’ll use both words and n-grams as tokens. For words, we calculate the frequency for each value of the sentiment lexicon, and feed into ggplot. For flair, we add the most frequent word for each sentiment value at the top of the graph and plot the frequency of all words lyrics. For ngrams, we just count the expressions that are more frequent in the entire lyrics space.
library(tidyverse)library(tidytext)u2_lyrics |># Add a variable inside of every tibble in the list with the url, using imap purrr::imap(~mutate(.x, url=.y)) |># Bind into a data frame all lyrics purrr::map_df(bind_rows) |> dplyr::filter(!str_detect(value,"No lyrics")) |># Unnest words tidytext::unnest_tokens(input ="value", output ="text", token ="words") |># Remove stopwords dplyr::anti_join(stopwords::stopwords("en") |>as_tibble(), by=c("text"="value")) |># Join sentiment lexicon dplyr::inner_join(tidytext::get_sentiments("afinn"), by=c("text"="word")) |># Calculate most frequent word by sentiment lexicon dplyr::group_by(text) |> dplyr::mutate(word_freq =n()) |> dplyr::ungroup() |># Summarise #### dplyr::group_by(value) |> dplyr::mutate(n =n(),p = n/sum(n),c =case_when(value<0~"Negative", value==0~"Neutral", value>0~"Positive") |>factor()) |># Feed to ggplotggplot(aes(x=value,y=n,fill=c, color=c))+geom_col()+# Notice how the data is filtered inside of the geom ####geom_text(data = . %>% dplyr::group_by(value) %>% dplyr::slice_max(n=1, order_by = word_freq) %>% dplyr::distinct(text, .keep_all =TRUE),aes(y=700000, label=text), angle=60, size=5)+geom_vline(xintercept =0)+scale_x_continuous(breaks=seq(from=-5,to=5,by=1))+scale_y_continuous(limits =c(0, 750000))+scale_fill_brewer(palette ="Set1")+scale_color_brewer(palette ="Set1")+labs(title ="U2 lyrics: A feel-good band, mostly",subtitle ="Sentiment analysis based on 241 song lyrics.\nMost frequent words by sentiment level at the top",x="Sentiment lexicon: AFINN",y="Word frequency")+theme_void()+theme(legend.position ="none")
Click me!
library(tidyverse)library(tidytext)library(ggwordcloud)u2_lyrics |># Identify each list with the url using imap purrr::imap(~mutate(.x, url=.y)) |># Bind into a data frame all lyrics purrr::map_df(bind_rows) |> dplyr::filter(!str_detect(value,"No lyrics")) |># Unnest tidytext::unnest_tokens(input ="value", output ="ngram", token ="ngrams", n=3) |> dplyr::count(ngram, sort=TRUE) |> dplyr::distinct_all() |> dplyr::slice_max(order_by = n, n =50) |>ggplot()+ ggwordcloud::geom_text_wordcloud(aes(label = ngram, size = n, color=n)) +scale_size_area(max_size =12) +scale_color_gradient(low ="pink", high ="red4")+theme_minimal()
More ggplot2 extensions
You now know how to work with ggplot2 and some extensions like ggridges or ggiraph. But there are many more you might want to check out. The gallery of over a hundred extensions is here, and some of my favorites are listed below:
ggdist provides stats and geoms for visualizing distributions and uncertainty.
ggExtra lets you add marginal density plots or histograms to ggplot2 scatterplots.
This site holds words and concepts in multiple languages that don’t translate. Getting all the words and definitions is not easy, as each time you enter the site, a random set of words pops up. There is no easy way to download the list of words, but we can still do it.
How many words are there, how can we scrap them? As of late 2021, there were over 500. In 2023, that number grew even more, to over 700.
We are doing this task by throwing computational brute force: we will scrape the site multiple times until no new words are retrieved, using while function that will run until no new words are found.
The code takes time to run, but it does the trick.
Click me!
library(rvest)library(tidyverse)# URL ####eunoia_url <-"https://eunoia.world/"# Step 1 - Create an empty list and an empty data frame. We will store the results there ####eunoia <-list()eunoia_df <-tibble()# Step 2 - We'll use while function to scrap the website until we have 700 different words ####while (tally(eunoia_df) <=690) {# i) Loop 100 scraps of the website and save them into the empty listfor (i inc(1:100)) {# A - Start with empty list and double bracket to save as elements eunoia[[i]] <- eunoia_url |> rvest::read_html() |> rvest::html_elements("td:nth-child(3) , td:nth-child(2) , td:nth-child(1)") |> rvest::html_text() |> tibble::as_tibble() |> dplyr::mutate(names =rep(c("word","description","language"), times =n()/3)) |> tidyr::pivot_wider(values_from = value, names_from = names, values_fn = list) |> tidyr::unnest(cols =everything())# B - Put together all the results into a data frame using map_df and removing repeated words eunoia_df <- eunoia |> purrr::map_df(bind_rows) |> dplyr::distinct_all() }}
Now let’s check all the downloaded untranslatable words is a reactable table, using a bootswatch theme and aggregating all words by language groups using groupBy() and a bit of column styling.
reactable tables can be very fancy
You can customise these tables a lot. Check out these examples and re-use any elements that you want to style the tables.
Scrape any Wikipedia table and plot the result in R
Intermediate
Scrape the lyrics for Bob Dylan, Pearl Jam or another artist you can find online. Re-create the sentiment analysis workflow for their lyrics.
The U2 page with lyrics also shows every time each song has been played during a tour since the late 80s. Scrap the tour dates and visualise song popularity through time.