How to Use R to Scrape Tweets: Super Tuesday 2016

Super Tuesday 2016 has come and gone, we have most of the election results, but what were the American public saying on Twitter?

The twitteR package for R allows you to scrape tweets from Twitter’s API and use them to form sentiment analysis. The Plotly chart below shows what the Twitter-verse was saying about the candidates during last night’s poll results.

A basic tutorial is below, but if you want to see my full code including the charts, you can check it out on my GitHub page. For the chart below, I scraped 20,000 tweets that mentioned each candidate and then ran them through a dictionary of positive and negative words to calculate the sentiment score, excluding neutral comments.

SuperTeus

Basic Method for Using twitteR

First, you’ve got to have a Twitter account, (with your phone number attached). After you’ve got that, just go to Twitter’s apps page and create an application. Add the name and description of your app along with a website name. The website can be a test website. There’s also a field for callback URL, but that’s optional.

Sentiment Analysis with R

Grab your API keys and access tokens from Twitter, you’ll need them for the R script.

library(twitteR)
library(ROAuth)
library(httr)

# Set API Keys
api_key <- "xxxxxxxxxxxxxxxxxxxx"
api_secret <- "xxxxxxxxxxxxxxxxxxxx"
access_token <- "xxxxxxxxxxxxxxxxxxxx"
access_token_secret <- "xxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

By now you should be in. Now time to grab some data.

# Grab latest tweets
tweets_sanders <- searchTwitter('@BernieSanders', n=1500)

# Loop over tweets and extract text
library(plyr)
feed_sanders = laply(tweets_sanders, function(t) t$getText())

Now you’ve got a bunch of text data for Bernie Sanders, so how do we decide what’s a “good” tweet and a “bad” tweet? This is where I turned to the Hu and Liu Opinion Lexicon, a list of 6800 positive and negative words compiled by Bing Liu and Minqing Hu of the University of Illinois at Chicago.

Unpack the Opinion Lexicon into your working directory and you should be ready to roll.

# Read in dictionary of positive and negative works
yay = scan('opinion-lexicon-English/positive-words.txt',
                  what='character', comment.char=';')
boo = scan('opinion-lexicon-English/negative-words.txt',
                  what='character', comment.char=';')
# Add a few twitter-specific negative phrases
bad_text = c(boo, 'wtf', 'epicfail', 'douchebag')
good_text = c(yay, 'upgrade', ':)', '#iVoted', 'voted')

Now, you’ve got your list of tweets and your list of opinionated words. The next thing to do is score the text of the tweets compared to how many of the “bad” and “good” words show up in each.

For this we’ll need a giant R function filled with lots of good gsub and match functions. Thanks to Jeff Breen for the function on which this was based.

score.sentiment = function(sentences, good_text, bad_text, .progress='none')
{
    require(plyr)
    require(stringr)
    # we got a vector of sentences. plyr will handle a list
    # or a vector as an "l" for us
    # we want a simple array of scores back, so we use
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, good_text, bad_text) {
        
        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)
        #to remove emojis
        sentence <- iconv(sentence, 'UTF-8', 'ASCII')
        sentence = tolower(sentence)        
        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)
        
        # compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, good_text)
        neg.matches = match(words, bad_text)
        
        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)
        
        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)
        
        return(score)
    }, good_text, bad_text, .progress=.progress )
    
    scores.df = data.frame(score=scores, text=sentences)
    return(scores.df)
}

The good news about the obnoxiously long function, is it spits out a nice data frame that can be manipulated very easily.

# Call the function and return a data frame
feelthabern <- score.sentiment(feed_sanders, good_text, bad_text, .progress='text')
feelthabern$name <- 'Sanders'
# Cut the text, just gets in the way
plotdat <- feelthabern[c("name", "score")]
# Remove neutral values of 0
plotdat <- plotdat[!plotdat$score == 0, ]

# Nice little quick plot
qplot(factor(score), data=plotdat, geom="bar", 
      fill=factor(name),
      xlab = "Sentiment Score")

7 thoughts on “How to Use R to Scrape Tweets: Super Tuesday 2016

  1. Getting the following error on the last section…any thoughts on how to resolve?

    > plotdata <- plotdata[c("name", "score")]
    Error: object 'plotdata' not found

  2. Thank you very much for the nice tutorial, really helpful and informative!

    Just a single question: Is there any possibility to change the color of the plots, for example by defining it via “col” code under the “# Nice little quick plot” step?
    I’m just asking since I wanted to create a plot featuring German political parties with the respective colors.

    Thanks in advance! 😉

    Kevin

Leave a Reply