I recently worked on data for a story with Charlie Warzel at Buzzfeed about media coverage of Facebook over time. I am really interested in the decline of public opinion towards Big Tech, and wanted to quantify it.
Getting Data from the NYT API
I started by using the New York Times API to pull articles about facebook. It is a really fun, useful API and provides access to archived articles, best-seller lists, movie reviews, and comments, among other things. It was a time-intensive process because I was making a lot of requests to the API and often had to wait a few hours between requests so as not to be denied access.
I used the jsonlite
package in R to access the API. It was pretty straightforward technically and I followed this great tutorial which was tremendously helpful. You can check out my code on github if you’re interested.
It feels like public sentiment towards Big Tech has soured in the past 2 years or so, accelerated by the Cambridge Analytica scandal, revelations of Russian interference in the 2016 election and the increasingly prevalent view that social media is bad for mental health.
I decided to use sentiment analysis to see if I could detect a negative trend over time in the Times’ coverage of Facebook. I expected the early coverage (2004 – 2008ish) to be relatively glowing (startups! college! growth!). Then I assumed there would be some less flattering coverage as the company prepared for the 2012 IPO; I remember seeing lots of coverage questioning the valuation and whether it would ever be profitable. Finally, I expected coverage over the last 2 years to be predominantly negative.
I was excited to see that the data reflected my intuition. Here’s the monthly sentiment:
Sentiment Analysis Process
The actual sentiment analysis techniques I used are really straightforward. Using the tidytext
library, I load a “sentiment dictionary,” which is a bunch of words manually rated by their sentiment. In the dictionary I used, called “afinn”, the words are rated from -5 to +5, with more positive words closer to +5 and less positive words closer to -5.
From there, I “tokenize” the summaries of the NYT articles, meaning I break them up into individual words. This can be done in one line of code using the tidytext
package.
After the summaries are tokenized, I perform an inner join with the afinn sentiment dictionary, thus assigning sentiment values to the words in my dataset.
From there, I just average the sentiments in each article and am left with a rough proxy for the article’s overall sentiment. I can group by day, week, month, year etc. to get a picture of the trends over time.
I know there are machine learning approaches to sentiment analysis which can be more accurate; I’d be curious to hear if you’ve experimented with them and what your thoughts on them are.
Bigrams and other Textual Analysis
Beyond sentiment, I wanted to see what some of the themes in the news coverage of Facebook were. It was fascinating to see media coverage evolve from treating FB as a tech company to almost a sovereign political entity. I used text mining techniques like bigram visualization and tf-idfs which you can learn about at the wonderful website https://tidytextmining.com.
Here are some of the charts I made; you can check out the code I used in this analysis on github and please reach out to me if you have any questions. I’m available by email at joe @ this website, or on twitter @jhovde2121.
Bigrams, by year
Bigram Graph
I really like how reading the above graph feels like reading a brief history of America over the past 10 years. Pretty remarkable that a company founded by college students in 2004 could have such a massive impact on the world.
Thanks for reading! If you’re interested in this sort of thing, sign up to be notified when I write a new article: