Patrick O. Perry, NYU Stern School of Business
We will use the following R packages.
library("LiblineaR")
library("Matrix")
library("nnet") # for multinom
library("tm")
To ensure consistent runs, we set the seed before performing any analysis.
set.seed(0)
We will look at tweets from the first GOP debate. These were collected and annotated by CrowdFlower.
tweet <- read.csv("GOP_REL_ONLY.csv")
tweet$text <- as.character(tweet$text) # don't tweet 'text' as factor
# IMPORTANT: we make 'Neutral' the reference class for the sentiment
tweet$sentiment <- relevel(tweet$sentiment, ref="Neutral")
summary(tweet)
candidate candidate.confidence relevant_yn relevant_yn.confidence
No candidate mentioned:7491 Min. :0.2222 yes:13871 Min. :0.3333
Donald Trump :2813 1st Qu.:0.6742 1st Qu.:1.0000
Jeb Bush : 705 Median :1.0000 Median :1.0000
Ted Cruz : 637 Mean :0.8557 Mean :0.9273
Ben Carson : 404 3rd Qu.:1.0000 3rd Qu.:1.0000
Mike Huckabee : 393 Max. :1.0000 Max. :1.0000
(Other) :1428
sentiment sentiment.confidence subject_matter
Neutral :3142 Min. :0.1860 None of the above :8148
Negative:8493 1st Qu.:0.6517 FOX News or Moderators :2900
Positive:2236 Median :0.6813 Religion : 407
Mean :0.7569 Foreign Policy : 366
3rd Qu.:1.0000 Women's Issues (not abortion though): 362
Max. :1.0000 Racial issues : 353
(Other) :1335
subject_matter.confidence candidate_gold name relevant_yn_gold
Min. :0.2222 :13843 msgoddessrises : 55 :13839
1st Qu.:0.6413 No candidate mentioned: 10 RT0787 : 42 yes: 32
Median :1.0000 Donald Trump : 6 b140tweet : 34
Mean :0.7828 Mike Huckabee : 3 jojo21 : 33
3rd Qu.:1.0000 Jeb Bush : 2 SupermanHotMale: 32
Max. :1.0000 Marco Rubio : 2 EusebiaAq : 30
(Other) : 5 (Other) :13645
retweet_count sentiment_gold subject_matter_gold
Min. : 0.0 :13856 :13853
1st Qu.: 0.0 Negative : 10 FOX News or Moderators : 4
Median : 2.0 Negative\nNeutral: 1 Abortion : 3
Mean : 45.8 Positive : 3 Abortion\nWomen's Issues (not abortion though): 2
3rd Qu.: 44.0 Positive\nNeutral: 1 Immigration : 2
Max. :4965.0 Religion : 2
(Other) : 5
text tweet_coord tweet_created
Length:13871 :13850 2015-08-07 08:44:44 -0700: 8
Class :character [41.7599487, -72.7024307] : 2 2015-08-07 09:29:29 -0700: 8
Mode :character [21.97850292, -159.34894421]: 1 2015-08-07 08:39:31 -0700: 7
[25.07852081, -80.44294696] : 1 2015-08-07 09:37:23 -0700: 7
[26.26744224, -80.20737825] : 1 2015-08-07 09:47:57 -0700: 7
[29.58026, -95.2541389] : 1 2015-08-07 09:52:47 -0700: 7
(Other) : 15 (Other) :13827
tweet_id tweet_location user_timezone
Min. :6.295e+17 :3912 :4403
1st Qu.:6.295e+17 USA : 239 Eastern Time (US & Canada):3474
Median :6.297e+17 Washington, DC: 128 Central Time (US & Canada):1943
Mean :6.296e+17 New York, NY : 120 Pacific Time (US & Canada):1655
3rd Qu.:6.297e+17 Texas : 88 Atlantic Time (Canada) : 461
Max. :6.297e+17 New York : 75 Quito : 432
(Other) :9309 (Other) :1503
These tweets have been hand-labeled by human crowdworkers.
Here are the first few tweets:
head(tweet$text)
[1] "RT @NancyLeeGrahn: How did everyone feel about the Climate Change question last night? Exactly. #GOPDebate"
[2] "RT @ScottWalker: Didn't catch the full #GOPdebate last night. Here are some of Scott's best lines in 90 seconds. #Walker16 http://t.co/ZSfF…"
[3] "RT @TJMShow: No mention of Tamir Rice and the #GOPDebate was held in Cleveland? Wow."
[4] "RT @RobGeorge: That Carly Fiorina is trending -- hours after HER debate -- above any of the men in just-completed #GOPdebate says she's on …"
[5] "RT @DanScavino: #GOPDebate w/ @realDonaldTrump delivered the highest ratings in the history of presidential debates. #Trump2016 http://t.co…"
[6] "RT @GregAbbott_TX: @TedCruz: \"On my first day I will rescind every illegal executive action taken by Barack Obama.\" #GOPDebate @FoxNews"
# We will use bigrams
corpus <- VCorpus(VectorSource(tweet$text))
control <- list(tolower = TRUE, removePunctuation = TRUE,
removeNumbers = TRUE, wordLengths=c(1, Inf))
dtm <- DocumentTermMatrix(corpus, control=control)
dtm <- sparseMatrix(dtm$i, dtm$j, x = dtm$v, dim=dim(dtm),
dimnames=dimnames(dtm))
To compare the methods, we will use a random sample of 80% of the dataset for training, and the remaining 20% for testing
train_ix <- sample(nrow(dtm), floor(0.8 * nrow(dtm)))
train <- logical(nrow(dtm))
train[train_ix] <- TRUE
test <- !train
In the naive Bayes method, we just predict the same sentiment probabilities for all tweets. We learn these probabilities from the training data. We can do this by fitting a multinomial logit model with no covariates:
(nb <- multinom(sentiment ~ 1, tweet, subset=train))
# weights: 6 (2 variable)
initial value 12190.201955
final value 10323.084189
converged
Call:
multinom(formula = sentiment ~ 1, data = tweet, subset = train)
Coefficients:
(Intercept)
Negative 0.9887498
Positive -0.3547696
Residual Deviance: 20646.17
AIC: 20650.17
Here are the class probabilities:
predict(nb, newdata=data.frame(row.names=1), "probs")
Neutral Negative Positive
0.2278316 0.6123821 0.1597863
The simplest sentiment detection methods are based on counting the numbers of positive and negative words in the texts. To use such methods, we use Bing Liu's lists of positive and negative sentiment words (RAR archive).
pos_words <- scan("positive-words.txt", character(), comment.char=";")
neg_words <- scan("negative-words.txt", character(), comment.char=";")
Here are some of the words from Liu's lists:
# Positive words
sample(pos_words, 30)
[1] "flawless" "fortunate" "posh" "overtakes" "rapid" "dotingly"
[7] "god-send" "award" "dignity" "cashback" "free" "refunded"
[13] "evenly" "reasonable" "invulnerable" "unrivaled" "breakthroughs" "pardon"
[19] "wholesome" "bolster" "multi-purpose" "honest" "cure-all" "well-managed"
[25] "admiringly" "affable" "sharper" "nourishment" "immaculate" "astutely"
# Negative words
sample(neg_words, 30)
[1] "infamous" "victimize" "frightening" "feckless" "pander"
[6] "condescension" "unreachable" "inflict" "forgetfulness" "garbage"
[11] "paucity" "oppose" "inconsistency" "accusing" "scars"
[16] "bristle" "wreak" "assail" "reproachful" "oppressiveness"
[21] "annoyingly" "partisans" "blasphemous" "blinding" "stuffy"
[26] "inessential" "bothered" "adverse" "denied" "persecution"
We form vectors with weights for the positive and negative words:
vocab <- colnames(dtm)
nvocab <- length(vocab)
pos_wt <- numeric(nvocab)
pos_wt[match(pos_words, vocab, 0)] <- 1
neg_wt <- numeric(nvocab)
neg_wt[match(neg_words, vocab, 0)] <- 1
We then form features for each tweet, counting the number of positive and negative words.
tweet$pos_count <- as.numeric(dtm %*% pos_wt)
tweet$neg_count <- as.numeric(dtm %*% neg_wt)
We get weights on these features using a multinomial logistic model:
(dict <- multinom(sentiment ~ pos_count + neg_count, tweet, subset=train))
# weights: 12 (6 variable)
initial value 12190.201955
iter 10 value 9828.090585
iter 20 value 9730.272442
iter 20 value 9730.272392
iter 20 value 9730.272378
final value 9730.272378
converged
Call:
multinom(formula = sentiment ~ pos_count + neg_count, data = tweet,
subset = train)
Coefficients:
(Intercept) pos_count neg_count
Negative 0.6077960 0.1291945 0.7762683
Positive -0.8320178 0.7074497 -0.1689209
Residual Deviance: 19460.54
AIC: 19472.54
This model uses the prior class probabilities to inform the predictions. This will help predictions when the training set sentiment probabilities are representative of the test set.
Here are some predictions from the model:
# positive words only
predict(dict, newdata=data.frame(pos_count=1, neg_count=0), "probs")
Neutral Negative Positive
0.2517297 0.5260237 0.2222466
predict(dict, newdata=data.frame(pos_count=2, neg_count=0), "probs")
Neutral Negative Positive
0.1934605 0.4600146 0.3465250
predict(dict, newdata=data.frame(pos_count=10, neg_count=0), "probs")
Neutral Negative Positive
0.001916385 0.012809427 0.985274188
# negative words only
predict(dict, newdata=data.frame(pos_count=0, neg_count=1), "probs")
Neutral Negative Positive
0.18661507 0.74479749 0.06858744
predict(dict, newdata=data.frame(pos_count=0, neg_count=5), "probs")
Neutral Negative Positive
0.011082590 0.986844915 0.002072495
# both types of words
predict(dict, newdata=data.frame(pos_count=10, neg_count=1), "probs")
Neutral Negative Positive
0.002223457 0.032300162 0.965476381
predict(dict, newdata=data.frame(pos_count=10, neg_count=5), "probs")
Neutral Negative Positive
0.001831347 0.593557447 0.404611206
For a simpler predictor, we can force the coefficients on pos_count
and
neg_count
to have the same absolute value using the following method:
(dict_eq <- multinom(sentiment ~ I(pos_count - neg_count), tweet,
subset=train))
# weights: 9 (4 variable)
initial value 12190.201955
final value 9903.034703
converged
Call:
multinom(formula = sentiment ~ I(pos_count - neg_count), data = tweet,
subset = train)
Coefficients:
(Intercept) I(pos_count - neg_count)
Negative 1.013631 -0.2386327
Positive -0.606270 0.4866102
Residual Deviance: 19806.07
AIC: 19814.07
It is of course possible to use other covariates to aid with the predictions (we will not do this here).
The main problem with dictionary methods is that many texts do not contain the sentiment lexicon words:
# Raw counts:
table(tweet$pos_count, tweet$neg_count)
0 1 2 3 4 5 6 7 9
0 4973 1982 549 106 30 1 0 0 0
1 2821 1003 331 69 12 1 1 1 1
2 965 445 83 26 5 2 0 0 0
3 295 87 25 2 0 0 0 0 0
4 27 12 1 3 0 0 0 0 0
5 9 2 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0
# Relative counts (%):
round(100 * table(tweet$pos_count, tweet$neg_count) / nrow(tweet), 2)
0 1 2 3 4 5 6 7 9
0 35.85 14.29 3.96 0.76 0.22 0.01 0.00 0.00 0.00
1 20.34 7.23 2.39 0.50 0.09 0.01 0.01 0.01 0.01
2 6.96 3.21 0.60 0.19 0.04 0.01 0.00 0.00 0.00
3 2.13 0.63 0.18 0.01 0.00 0.00 0.00 0.00 0.00
4 0.19 0.09 0.01 0.02 0.00 0.00 0.00 0.00 0.00
5 0.06 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
For 35.85% of all texts, we just predict sentiment using the prior probabilities. This is clearly undesirable.
# Compute bigrams
BigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)
}
control2 <- control
control2$tokenize <- BigramTokenizer
dtm2 <- DocumentTermMatrix(corpus, control=control2)
dtm2 <- sparseMatrix(dtm2$i, dtm2$j, x = dtm2$v, dim=dim(dtm2),
dimnames=dimnames(dtm2))
x <- cbind(dtm, dtm2) # predictors: unigrams and bigrams
df <- colSums(x > 0) # + remove rate terms
x <- x[,df >= 5]
df <- colSums(x > 0) # + use tf-idf scaling
x <- t(t(x) * log(1 + ncol(x) / df))
x <- as.matrix(x) # + convert to dense matrix (needed for LiblineaR)
y <- tweet$sentiment # response: sentiment
# Choose cost by cross-validation
do_fit <- function(x, y, type, costs=10^c(-6, -3, 0, 3, 6)) {
best_cost <- NA
best_acc <- 0
for (co in costs) {
acc <- LiblineaR(data=x, target=y, type=type, cost=co, bias=TRUE,
cross=5)
cat("Results for C=", co, " : ",acc," accuracy.\n",sep="")
if (acc > best_acc) {
best_cost <- co
best_acc <- acc
}
}
LiblineaR(data=x, target=y, type=type, cost=best_cost, bias=TRUE)
}
ix <- train & y != "Neutral"
# l2-regularized logistic regression (type = 0)
fit_l2 <- do_fit(x[ix,], y[ix], type=0)
Results for C=1e-06 : 0.7934174 accuracy.
Results for C=0.001 : 0.8388189 accuracy.
Results for C=1 : 0.8457049 accuracy.
Results for C=1000 : 0.8284314 accuracy.
Results for C=1e+06 : 0.8295985 accuracy.
# l1-regularized logistic regression (type = 6)
fit_l1 <- do_fit(x[ix,], y[ix], type=6)
Results for C=1e-06 : 0.6764706 accuracy.
Results for C=0.001 : 0.7930672 accuracy.
Results for C=1 : 0.8425537 accuracy.
Results for C=1000 : 0.8161765 accuracy.
Results for C=1e+06 : 0.8005369 accuracy.
pred_l2 <- predict(fit_l2, x, proba=TRUE)$probabilities
pred_l2 <- cbind(Neutral=0, pred_l2)[,levels(y)]
pred_l1 <- predict(fit_l1, x, proba=TRUE)$probabilities
pred_l1 <- cbind(Neutral=0, pred_l1)[,levels(y)]
loss <- function(pred, y) {
as.numeric(apply(pred, 1, which.max) != as.integer(y))
}
risk <- function(pred, y, train, test) {
neut <- y == "Neutral"
l <- loss(pred, y)
c(train=mean(l[train]),
train_polar=mean(l[train & !neut]),
test=mean(l[test]),
test_polar=mean(l[test & !neut]))
}
# Naive Bayes
risk(predict(nb, newdata=tweet, "probs"), y, train, test)
train train_polar test test_polar
0.3876172 0.2069328 0.3881081 0.2142527
# Dictionary
risk(predict(dict, newdata=tweet, "probs"), y, train, test)
train train_polar test test_polar
0.3911319 0.2114846 0.3909910 0.2179547
# l1-regularized logistic
risk(pred_l1, y, train, test)
train train_polar test test_polar
0.24576424 0.02322596 0.33981982 0.15224433
# l2-regularized logistic
risk(pred_l2, y, train, test)
train train_polar test test_polar
0.24441240 0.02147526 0.34054054 0.15316983
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets base
other attached packages:
[1] tm_0.6-2 NLP_0.1-8 nnet_7.3-11 Matrix_1.2-3 LiblineaR_1.94-2
[6] RColorBrewer_1.1-2 knitr_1.12.3
loaded via a namespace (and not attached):
[1] codetools_0.2-14 lattice_0.20-33 digest_0.6.8 slam_0.1-32 grid_3.2.3
[6] formatR_1.1 magrittr_1.5 evaluate_0.8 stringi_1.0-1 tools_3.2.3
[11] stringr_1.0.0 parallel_3.2.3 methods_3.2.3