March 28, 2018
Text Analysis of Tim Ferriss Podcast Episodes Using R
I recently came across a great respository of transcripts from Tim Ferriss’ podcast, courtesy of transcripts.io. I don’t actually listen to Ferriss’ podcast (too heavily monetized for my taste), but I know that many do and he gets great guests. I thought I’d analyze the text from the transcripts a bit and ended up using Term Frequency/Inverse Document Frequencys to identify the main topics of each episode.
I used the great Rvest package to scrape the data from the web, and then used tidytext to analyze it. I won’t go into depth on these but there are great resources out there to learn about them; I recommend this post from Justin Law and Jordan Rosenblum for Rvest and this great site from Julia Silge and David Robinson for tidytext.
This is a good introductory web scraping / text analysis project, and is especially easy to start out with because the transcript repository is structured very simply and is straightforward to scrape.
If you have any questions or would like the original data, let me know via email (joe at this website) or twitter.
R Packages Used
Load the packages we’ll use:
library(rvest)
## Loading required package: xml2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(ggplot2)
library(stringr)
library(tidyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(babynames)
library(readr)
##
## Attaching package: 'readr'
## The following object is masked from 'package:rvest':
##
## guess_encoding
Using http://transcripts.io for the transcripts, thanks very much for putting that together!
We’ll use the rvest package to pull the data into R. I use the SelectorGadget chrome extension to find css elements which is incredibly useful.
Scraping the Data
# save the url; this is the page with links to each episode
url <- "https://www.transcripts.io/transcripts/tim_ferriss_show/"
# read the html of the url
tran <- read_html(url)
# start html session
session <- html_session(url)
# all the episode names. Found ".post-link" using the SelectorGadget chrome extension
episode_names <- tran %>%
html_nodes('.post-link') %>%
html_text()
Now that we have a list of the episode links, I loop through each episode and extract the text and dates, and create a dataframe
# new empty data frame which we will then bind all the episode information to iteratively
ep_df <- data_frame()
# looping through each episode
## Not run:
for (i in episode_names){
ep_name <- i # setting the episode name to the link title
ep_date <- session %>% # pulling the date
follow_link(i) %>% # following each episode link
html_nodes("time") %>% # again found the node using SelectorGadget
html_text()
ep_text <- session %>% # pulling actual transcript text
follow_link(i) %>%
html_nodes("p") %>%
html_text() %>%
data_frame() %>% # making this a dataframe, mutate() to ad the other variables
mutate(ep_name = ep_name,
ep_date = ep_date)
ep_df <- rbind(ep_df, ep_text) # binding them all together
Sys.sleep(3) # Sys.sleep(3) pauses the loop for 3s so as not to overwhelm website's server
}
Cleaning the data
Now have a dataframe of all of the text from these episodes. It’s messy, so first will remove some of the stuff that we don’t want– “tweet” and stuff like that which snuck into the transcription
#rename first column, this is annoying because it's called "."
ep_df$text <- ep_df$.
ep_df <- ep_df %>%
select(2:4)
# trim white spice from text, removes lurking spaces at the end of
ep_df$text <- trimws(ep_df$text, which = c("both"))
# remove "Tweet", "Listen", "Link", etc. which snuck into the episode text
ep_df <- ep_df %>%
filter(text != "Listen") %>%
filter(text != "Tweet") %>%
filter(text != "Link") %>%
filter(text != "[applause]")
head(ep_df)
Analysis using tidytext
Now that we have a reasonably clean dataframe, going to tokenize and look at common words etc
# tokenized words, using tidytext package. This breaks the dataframe up by word
ep_words <- ep_df %>%
unnest_tokens(word, text)
head(ep_words)
Calculating term frequnecy/inverse document frequencies (tf-idfs). This is a metric often used in search engines to determine a word’s uniqueness / importance to a body of text
# calculating the tf-idf for each word, each year
tfs <- ep_words %>%
count(ep_name, word, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(word, ep_name, n)
# removing names from the dataset, because they generally have the highest tf-idf score for a given episode. Thanks Hadley Wickham for the babynames package!
names <- babynames %>%
filter(year > 1980) %>% # taking names from the past 40ish years
rename(word = name) %>% # renaming the column to "word" to make the anti-join simple
mutate(word = tolower(word)) # making the names lowercase
topics <- tfs %>%
anti_join(names, by = "word") %>% # removingt he names
arrange(-tf_idf)
# removing certain unusual names that don't appear in the babynames dataset.
topics <- topics %>%
filter(word != "tripleh") %>%
filter(word != "naval") %>%
filter(word != "jocko")
# writing a csv to save this locally write.csv(topics, "tf_topics.csv")
We have a list of top terms by tf_idf for each episode, and can now visualize these
Showing highest TF-IDF by episode, this is a useful way of doing topic analysis.
I bring in the data here from local sources so as to not have the above code run in the demo.
# I cleaned up the episode titles by hand a little in google sheets to make them clearer because they weren't all in the same format, easy because there were only ~20 episodes
topics <- read_csv("~/RProjects/tf_topics.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## ep_name = col_character(),
## word = col_character(),
## n = col_integer(),
## tf = col_double(),
## idf = col_double(),
## tf_idf = col_double(),
## name = col_character()
## )
tf_ep_titles_cleaned <- read_csv("~/RProjects/tf_ep_titles_cleaned.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## ep_name = col_character()
## )
# joining the topics dataset with the cleaned up names
topics <- topics %>%
inner_join(tf_ep_titles_cleaned)
## Joining, by = c("ep_name", "name")
head(topics)
## # A tibble: 6 x 8
## X1 ep_name word n tf idf tf_idf name
## <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 1 #244: The Quiet Master… bitco… 149 0.00527 2.22 0.0117 Naval …
## 2 2 Ep 57: Pavel Tsatsouli… kettl… 8 0.00240 3.61 0.00866 Pavel …
## 3 3 Ep 65: Supplements, Bl… gluco… 39 0.00339 2.22 0.00754 Dr. Pe…
## 4 4 Ep 65: Supplements, Bl… vo2 24 0.00209 3.61 0.00753 Dr. Pe…
## 5 5 Ep 65: Supplements, Bl… insul… 34 0.00296 2.51 0.00743 Dr. Pe…
## 6 6 #187: Jocko Willink on… leade… 29 0.00310 2.22 0.00689 Jocko …
Plotting using ggplot2
Plotting the tf-idfs
# now, we can group the episodes by name and word, and find the mean tf_idf for each word, and plot them.
topics %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(name, word) %>%
summarise(freq = mean(tf_idf)) %>%
arrange(-freq) %>%
top_n(3) %>%
ungroup %>%
ggplot(aes(x = word, y = freq)) +
geom_col(show.legend = FALSE, fill = "navyblue") +
labs(x = NULL,
y = "Topic Frequency",
title = "Tim Ferris Podcast Episode Topics") +
facet_wrap(~name, ncol = 5, scales = "free") +
coord_flip() + # flip coordinates so that words are easily readable
scale_y_continuous(breaks = c()) # gets rid of the tf-idf values which aren't particularly meaningful
## Selecting by freq
Thanks for reading! Let me know if you would like the data or if you have any recommendations for how to extend this data or clean up the analysis.
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });