Sentiment Analysis

Patrick O. Perry, NYU Stern School of Business

Preliminaries

Computing environment

We will use the following R packages.

library("LiblineaR")
library("Matrix")
library("nnet") # for multinom
library("tm")

To ensure consistent runs, we set the seed before performing any analysis.

set.seed(0)

Data

We will look at tweets from the first GOP debate. These were collected and annotated by CrowdFlower.

tweet <- read.csv("GOP_REL_ONLY.csv")
tweet$text <- as.character(tweet$text) # don't tweet 'text' as factor

# IMPORTANT: we make 'Neutral' the reference class for the sentiment
tweet$sentiment <- relevel(tweet$sentiment, ref="Neutral")
summary(tweet)

                  candidate    candidate.confidence relevant_yn relevant_yn.confidence
 No candidate mentioned:7491   Min.   :0.2222       yes:13871   Min.   :0.3333        
 Donald Trump          :2813   1st Qu.:0.6742                   1st Qu.:1.0000        
 Jeb Bush              : 705   Median :1.0000                   Median :1.0000        
 Ted Cruz              : 637   Mean   :0.8557                   Mean   :0.9273        
 Ben Carson            : 404   3rd Qu.:1.0000                   3rd Qu.:1.0000        
 Mike Huckabee         : 393   Max.   :1.0000                   Max.   :1.0000        
 (Other)               :1428                                                          
    sentiment    sentiment.confidence                              subject_matter
 Neutral :3142   Min.   :0.1860       None of the above                   :8148  
 Negative:8493   1st Qu.:0.6517       FOX News or Moderators              :2900  
 Positive:2236   Median :0.6813       Religion                            : 407  
                 Mean   :0.7569       Foreign Policy                      : 366  
                 3rd Qu.:1.0000       Women's Issues (not abortion though): 362  
                 Max.   :1.0000       Racial issues                       : 353  
                                      (Other)                             :1335  
 subject_matter.confidence                candidate_gold               name       relevant_yn_gold
 Min.   :0.2222                                  :13843   msgoddessrises :   55      :13839       
 1st Qu.:0.6413            No candidate mentioned:   10   RT0787         :   42   yes:   32       
 Median :1.0000            Donald Trump          :    6   b140tweet      :   34                   
 Mean   :0.7828            Mike Huckabee         :    3   jojo21         :   33                   
 3rd Qu.:1.0000            Jeb Bush              :    2   SupermanHotMale:   32                   
 Max.   :1.0000            Marco Rubio           :    2   EusebiaAq      :   30                   
                           (Other)               :    5   (Other)        :13645                   
 retweet_count              sentiment_gold                                      subject_matter_gold
 Min.   :   0.0                    :13856                                                 :13853   
 1st Qu.:   0.0   Negative         :   10   FOX News or Moderators                        :    4   
 Median :   2.0   Negative\nNeutral:    1   Abortion                                      :    3   
 Mean   :  45.8   Positive         :    3   Abortion\nWomen's Issues (not abortion though):    2   
 3rd Qu.:  44.0   Positive\nNeutral:    1   Immigration                                   :    2   
 Max.   :4965.0                             Religion                                      :    2   
                                            (Other)                                       :    5   
     text                                 tweet_coord                      tweet_created  
 Length:13871                                   :13850   2015-08-07 08:44:44 -0700:    8  
 Class :character   [41.7599487, -72.7024307]   :    2   2015-08-07 09:29:29 -0700:    8  
 Mode  :character   [21.97850292, -159.34894421]:    1   2015-08-07 08:39:31 -0700:    7  
                    [25.07852081, -80.44294696] :    1   2015-08-07 09:37:23 -0700:    7  
                    [26.26744224, -80.20737825] :    1   2015-08-07 09:47:57 -0700:    7  
                    [29.58026, -95.2541389]     :    1   2015-08-07 09:52:47 -0700:    7  
                    (Other)                     :   15   (Other)                  :13827  
    tweet_id                tweet_location                    user_timezone 
 Min.   :6.295e+17                 :3912                             :4403  
 1st Qu.:6.295e+17   USA           : 239   Eastern Time (US & Canada):3474  
 Median :6.297e+17   Washington, DC: 128   Central Time (US & Canada):1943  
 Mean   :6.296e+17   New York, NY  : 120   Pacific Time (US & Canada):1655  
 3rd Qu.:6.297e+17   Texas         :  88   Atlantic Time (Canada)    : 461  
 Max.   :6.297e+17   New York      :  75   Quito                     : 432  
                     (Other)       :9309   (Other)                   :1503

These tweets have been hand-labeled by human crowdworkers.

Here are the first few tweets:

head(tweet$text)

[1] "RT @NancyLeeGrahn: How did everyone feel about the Climate Change question last night? Exactly. #GOPDebate"                                  
[2] "RT @ScottWalker: Didn't catch the full #GOPdebate last night. Here are some of Scott's best lines in 90 seconds. #Walker16 http://t.co/ZSfF…"
[3] "RT @TJMShow: No mention of Tamir Rice and the #GOPDebate was held in Cleveland? Wow."                                                        
[4] "RT @RobGeorge: That Carly Fiorina is trending -- hours after HER debate -- above any of the men in just-completed #GOPdebate says she's on …"
[5] "RT @DanScavino: #GOPDebate w/ @realDonaldTrump delivered the highest ratings in the history of presidential debates. #Trump2016 http://t.co…"
[6] "RT @GregAbbott_TX: @TedCruz: \"On my first day I will rescind every illegal executive action taken by Barack Obama.\" #GOPDebate @FoxNews"

Pre-Processing

# We will use bigrams

corpus <- VCorpus(VectorSource(tweet$text))
control <- list(tolower = TRUE, removePunctuation = TRUE,
                removeNumbers = TRUE, wordLengths=c(1, Inf))
dtm <- DocumentTermMatrix(corpus, control=control)
dtm <- sparseMatrix(dtm$i, dtm$j, x = dtm$v, dim=dim(dtm),
                     dimnames=dimnames(dtm))

Training and Test Sets

To compare the methods, we will use a random sample of 80% of the dataset for training, and the remaining 20% for testing

train_ix <- sample(nrow(dtm), floor(0.8 * nrow(dtm)))
train <- logical(nrow(dtm))
train[train_ix] <- TRUE
test <- !train

Naive Bayes Method

In the naive Bayes method, we just predict the same sentiment probabilities for all tweets. We learn these probabilities from the training data. We can do this by fitting a multinomial logit model with no covariates:

(nb <- multinom(sentiment ~ 1, tweet, subset=train))

# weights:  6 (2 variable)
initial  value 12190.201955 
final  value 10323.084189 
converged

Call:
multinom(formula = sentiment ~ 1, data = tweet, subset = train)

Coefficients:
         (Intercept)
Negative   0.9887498
Positive  -0.3547696

Residual Deviance: 20646.17 
AIC: 20650.17

Here are the class probabilities:

predict(nb, newdata=data.frame(row.names=1), "probs")

  Neutral  Negative  Positive 
0.2278316 0.6123821 0.1597863

Dictionary Method

The simplest sentiment detection methods are based on counting the numbers of positive and negative words in the texts. To use such methods, we use Bing Liu's lists of positive and negative sentiment words (RAR archive).

pos_words <- scan("positive-words.txt", character(), comment.char=";")
neg_words <- scan("negative-words.txt", character(), comment.char=";")

Here are some of the words from Liu's lists:

# Positive words
sample(pos_words, 30)

 [1] "flawless"      "fortunate"     "posh"          "overtakes"     "rapid"         "dotingly"     
 [7] "god-send"      "award"         "dignity"       "cashback"      "free"          "refunded"     
[13] "evenly"        "reasonable"    "invulnerable"  "unrivaled"     "breakthroughs" "pardon"       
[19] "wholesome"     "bolster"       "multi-purpose" "honest"        "cure-all"      "well-managed" 
[25] "admiringly"    "affable"       "sharper"       "nourishment"   "immaculate"    "astutely"

# Negative words
sample(neg_words, 30)

 [1] "infamous"       "victimize"      "frightening"    "feckless"       "pander"        
 [6] "condescension"  "unreachable"    "inflict"        "forgetfulness"  "garbage"       
[11] "paucity"        "oppose"         "inconsistency"  "accusing"       "scars"         
[16] "bristle"        "wreak"          "assail"         "reproachful"    "oppressiveness"
[21] "annoyingly"     "partisans"      "blasphemous"    "blinding"       "stuffy"        
[26] "inessential"    "bothered"       "adverse"        "denied"         "persecution"

We form vectors with weights for the positive and negative words:

vocab <- colnames(dtm)
nvocab <- length(vocab)
pos_wt <- numeric(nvocab)
pos_wt[match(pos_words, vocab, 0)] <- 1

neg_wt <- numeric(nvocab)
neg_wt[match(neg_words, vocab, 0)] <- 1

We then form features for each tweet, counting the number of positive and negative words.

tweet$pos_count <- as.numeric(dtm %*% pos_wt)
tweet$neg_count <- as.numeric(dtm %*% neg_wt)

We get weights on these features using a multinomial logistic model:

(dict <- multinom(sentiment ~ pos_count + neg_count, tweet, subset=train))

# weights:  12 (6 variable)
initial  value 12190.201955 
iter  10 value 9828.090585
iter  20 value 9730.272442
iter  20 value 9730.272392
iter  20 value 9730.272378
final  value 9730.272378 
converged

Call:
multinom(formula = sentiment ~ pos_count + neg_count, data = tweet, 
    subset = train)

Coefficients:
         (Intercept) pos_count  neg_count
Negative   0.6077960 0.1291945  0.7762683
Positive  -0.8320178 0.7074497 -0.1689209

Residual Deviance: 19460.54 
AIC: 19472.54

This model uses the prior class probabilities to inform the predictions. This will help predictions when the training set sentiment probabilities are representative of the test set.

Here are some predictions from the model:

# positive words only
predict(dict, newdata=data.frame(pos_count=1, neg_count=0), "probs")

  Neutral  Negative  Positive 
0.2517297 0.5260237 0.2222466

predict(dict, newdata=data.frame(pos_count=2, neg_count=0), "probs")

  Neutral  Negative  Positive 
0.1934605 0.4600146 0.3465250

predict(dict, newdata=data.frame(pos_count=10, neg_count=0), "probs")

    Neutral    Negative    Positive 
0.001916385 0.012809427 0.985274188

# negative words only
predict(dict, newdata=data.frame(pos_count=0, neg_count=1), "probs")

   Neutral   Negative   Positive 
0.18661507 0.74479749 0.06858744

predict(dict, newdata=data.frame(pos_count=0, neg_count=5), "probs")

    Neutral    Negative    Positive 
0.011082590 0.986844915 0.002072495

# both types of words
predict(dict, newdata=data.frame(pos_count=10, neg_count=1), "probs")

    Neutral    Negative    Positive 
0.002223457 0.032300162 0.965476381

predict(dict, newdata=data.frame(pos_count=10, neg_count=5), "probs")

    Neutral    Negative    Positive 
0.001831347 0.593557447 0.404611206

Equal-Weighted Dictionary Method

For a simpler predictor, we can force the coefficients on pos_count and neg_count to have the same absolute value using the following method:

(dict_eq <- multinom(sentiment ~ I(pos_count - neg_count), tweet,
                     subset=train))

# weights:  9 (4 variable)
initial  value 12190.201955 
final  value 9903.034703 
converged

Call:
multinom(formula = sentiment ~ I(pos_count - neg_count), data = tweet, 
    subset = train)

Coefficients:
         (Intercept) I(pos_count - neg_count)
Negative    1.013631               -0.2386327
Positive   -0.606270                0.4866102

Residual Deviance: 19806.07 
AIC: 19814.07

Other Covariates

It is of course possible to use other covariates to aid with the predictions (we will not do this here).

N-Gram Models

The main problem with dictionary methods is that many texts do not contain the sentiment lexicon words:

# Raw counts:
table(tweet$pos_count, tweet$neg_count)


       0    1    2    3    4    5    6    7    9
  0 4973 1982  549  106   30    1    0    0    0
  1 2821 1003  331   69   12    1    1    1    1
  2  965  445   83   26    5    2    0    0    0
  3  295   87   25    2    0    0    0    0    0
  4   27   12    1    3    0    0    0    0    0
  5    9    2    0    0    0    0    0    0    0
  6    1    0    0    0    0    0    0    0    0

# Relative counts (%):
round(100 * table(tweet$pos_count, tweet$neg_count) / nrow(tweet), 2)


        0     1     2     3     4     5     6     7     9
  0 35.85 14.29  3.96  0.76  0.22  0.01  0.00  0.00  0.00
  1 20.34  7.23  2.39  0.50  0.09  0.01  0.01  0.01  0.01
  2  6.96  3.21  0.60  0.19  0.04  0.01  0.00  0.00  0.00
  3  2.13  0.63  0.18  0.01  0.00  0.00  0.00  0.00  0.00
  4  0.19  0.09  0.01  0.02  0.00  0.00  0.00  0.00  0.00
  5  0.06  0.01  0.00  0.00  0.00  0.00  0.00  0.00  0.00
  6  0.01  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00

For 35.85% of all texts, we just predict sentiment using the prior probabilities. This is clearly undesirable.

# Compute bigrams
BigramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
           use.names = FALSE)
}
control2 <- control
control2$tokenize <- BigramTokenizer
dtm2 <- DocumentTermMatrix(corpus, control=control2)
dtm2 <- sparseMatrix(dtm2$i, dtm2$j, x = dtm2$v, dim=dim(dtm2),
                     dimnames=dimnames(dtm2))

x <- cbind(dtm, dtm2)    # predictors: unigrams and bigrams
df <- colSums(x > 0)     # + remove rate terms
x <- x[,df >= 5]

df <- colSums(x > 0)     # + use tf-idf scaling
x <- t(t(x) * log(1 + ncol(x) / df))

x <- as.matrix(x)        # + convert to dense matrix (needed for LiblineaR)

y <- tweet$sentiment     # response: sentiment


# Choose cost by cross-validation
do_fit <- function(x, y, type, costs=10^c(-6, -3, 0, 3, 6)) {
    best_cost <- NA
    best_acc <- 0
    for (co in costs) {
        acc <- LiblineaR(data=x, target=y, type=type, cost=co, bias=TRUE,
                         cross=5)
        cat("Results for C=", co, " : ",acc," accuracy.\n",sep="")
        if (acc > best_acc) {
            best_cost <- co
            best_acc <- acc
        }
    }

    LiblineaR(data=x, target=y, type=type, cost=best_cost, bias=TRUE)
}

ix <- train & y != "Neutral"

# l2-regularized logistic regression (type = 0)
fit_l2 <- do_fit(x[ix,], y[ix], type=0)

Results for C=1e-06 : 0.7934174 accuracy.
Results for C=0.001 : 0.8388189 accuracy.
Results for C=1 : 0.8457049 accuracy.
Results for C=1000 : 0.8284314 accuracy.
Results for C=1e+06 : 0.8295985 accuracy.

# l1-regularized logistic regression (type = 6)
fit_l1 <- do_fit(x[ix,], y[ix], type=6)

Results for C=1e-06 : 0.6764706 accuracy.
Results for C=0.001 : 0.7930672 accuracy.
Results for C=1 : 0.8425537 accuracy.
Results for C=1000 : 0.8161765 accuracy.
Results for C=1e+06 : 0.8005369 accuracy.

pred_l2 <- predict(fit_l2, x, proba=TRUE)$probabilities
pred_l2 <- cbind(Neutral=0, pred_l2)[,levels(y)]

pred_l1 <- predict(fit_l1, x, proba=TRUE)$probabilities
pred_l1 <- cbind(Neutral=0, pred_l1)[,levels(y)]

Comparison

loss <- function(pred, y) {
    as.numeric(apply(pred, 1, which.max) != as.integer(y))
}

risk <- function(pred, y, train, test) {
    neut <- y == "Neutral"
    l <- loss(pred, y)
    c(train=mean(l[train]),
      train_polar=mean(l[train & !neut]),
      test=mean(l[test]),
      test_polar=mean(l[test & !neut]))
}

# Naive Bayes
risk(predict(nb, newdata=tweet, "probs"), y, train, test)

      train train_polar        test  test_polar 
  0.3876172   0.2069328   0.3881081   0.2142527

# Dictionary
risk(predict(dict, newdata=tweet, "probs"), y, train, test)

      train train_polar        test  test_polar 
  0.3911319   0.2114846   0.3909910   0.2179547

# l1-regularized logistic
risk(pred_l1, y, train, test)

      train train_polar        test  test_polar 
 0.24576424  0.02322596  0.33981982  0.15224433

# l2-regularized logistic
risk(pred_l2, y, train, test)

      train train_polar        test  test_polar 
 0.24441240  0.02147526  0.34054054  0.15316983

Session information

sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] tm_0.6-2           NLP_0.1-8          nnet_7.3-11        Matrix_1.2-3       LiblineaR_1.94-2  
[6] RColorBrewer_1.1-2 knitr_1.12.3      

loaded via a namespace (and not attached):
 [1] codetools_0.2-14 lattice_0.20-33  digest_0.6.8     slam_0.1-32      grid_3.2.3      
 [6] formatR_1.1      magrittr_1.5     evaluate_0.8     stringi_1.0-1    tools_3.2.3     
[11] stringr_1.0.0    parallel_3.2.3   methods_3.2.3