SentimentAnalysis Vignette

Stefan Feuerriegel

Nicolas Proellochs

2018-04-09

The SentimentAnalysis package introduces a powerful toolchain facilitating the sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP, Harvard IV and Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter function uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable. Finally, all methods can be easility compared using built-in evaluation routines.

Introduction

Sentiment analysis is a research branch located at the heart of natural language processing (NLP), computational linguistics and text mining. It refers to any measures by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed opinion in a range spanning from positive to negative. As a result, one may also refer to sentiment analysis as opinion mining (Pang and Lee 2008).

Applications in research

Sentiment analysis has received great traction lately (K. Ravi and Ravi 2015; Pang and Lee 2008), which we explore in the following. Current research in finance and the social sciences utilizes sentiment analysis to understand human decisions in response to textual materials. This immediately reveals manifold implications for practitioners, as well as those involved in the fields of finance research and the social sciences: researchers can use R to extract text components that are relevant for readers and test their hypotheses on this basis. By the same token, practitioners can measure which wording actually matters to their readership and enhance their writing accordingly (Pröllochs, Feuerriegel, and Neumann 2015). We demonstrate below the added benefits in two case studies drawn from finance and the social sciences.

Applications in practice

Several applications demonstrate the uses of sentiment analysis for organizations and enterprises:

Methods for sentiment analysis

As sentiment analysis is applied to a broad variety of domains and textual sources, research has devised various approaches to measuring sentiment. A recent literature overview (Pang and Lee 2008) provides a comprehensive, domain-independent survey.

On the one hand, machine learning approaches are preferred when one strives for high prediction performance. However, machine learning usually works as a black-box, thereby making interpretations diffucult. On the other hand, dictionary-based approaches generate lists of positive and negative words. The respective occurrences of these words are then combined into a single sentiment score. Therefore, the underlying decisions become traceable and researchers can understand the factors that result in a specific sentiment.

In addition, SentimentAnalysis allows one to generate tailored dictionaries. These are customized to a specific domain, improve prediction performance compared to pure dictionaries and allow full interpretability. Details of this methodology can be found in (Pröllochs, Feuerriegel, and Neumann 2015).

In the process of performing sentiment analysis, one must convert the running text into a machine-readable format. This is achieved by executing a series of preprocessing operations. First, the text is tokenized into single words, followed by what are common preprocessing steps: stopword removal, stemming, removal of punctuation and conversion to lower-case. These operations are also conducted by default in SentimentAnalysis, but can be adapted to one’s personal needs.

Setup of the SentimentAnalysis package

Even though sentiment analysis has received great traction lately, the available tools are not yet living up to the needs of researchers. The SentimentAnalysis package is intended to partially close this gap and offer capabilities that most research demands.

First, simply install the package SentimentAnalysis from CRAN. Afterwards, one merely needs to load the SentimentAnalysis package as follows. This section shows the basic functionality to crawl for ad hoc filings. The following lines extract the ad hoc disclosure that was published most recently.

# install.packages("SentimentAnalysis")
library(SentimentAnalysis)
## 
## Attaching package: 'SentimentAnalysis'
## The following object is masked from 'package:base':
## 
##     write

Brief demonstration

# Analyze a single string to obtain a binary response (positive / negative)
sentiment <- analyzeSentiment("Yeah, this was a great soccer game for the German team!")
convertToBinaryResponse(sentiment)$SentimentQDAP
## [1] positive
## Levels: negative positive
# Create a vector of strings
documents <- c("Wow, I really like the new light sabers!",
               "That book was excellent.",
               "R is a fantastic language.",
               "The service in this restaurant was miserable.",
               "This is neither positive or negative.",
               "The waiter forget about my dessert -- what poor service!")

# Analyze sentiment
sentiment <- analyzeSentiment(documents)

# Extract dictionary-based sentiment according to the QDAP dictionary
sentiment$SentimentQDAP
## [1]  0.3333333  0.5000000  0.5000000 -0.3333333  0.0000000 -0.4000000
# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
## [1] positive positive positive negative neutral  negative
## Levels: negative neutral positive
response <- c(+1, +1, +1, -1, 0, -1)

compareToResponse(sentiment, response)
## Warning in cor(sentiment, response): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(sentiment, response): the standard deviation is zero
##                              WordCount SentimentGI NegativityGI
## cor                        -0.18569534   0.9900115  -0.99748901
## cor.t.statistic            -0.37796447  14.0440465 -28.16913204
## cor.p.value                -0.37796447  14.0440465 -28.16913204
## lm.t.value                 -0.37796447  14.0440465 -28.16913204
## r.squared                   0.03448276   0.9801228   0.99498433
## RMSE                        3.82970843   0.4501029   1.18665418
## MAE                         3.33333333   0.4000000   1.10000000
## Accuracy                    0.66666667   1.0000000   0.66666667
## Precision                          NaN   1.0000000          NaN
## Sensitivity                 0.00000000   1.0000000   0.00000000
## Specificity                 1.00000000   1.0000000   1.00000000
## F1                          0.00000000   0.5000000   0.00000000
## BalancedAccuracy            0.50000000   1.0000000   0.50000000
## avg.sentiment.pos.response  3.25000000   0.3333333   0.08333333
## avg.sentiment.neg.response  4.00000000  -0.6333333   0.63333333
##                            PositivityGI SentimentHE NegativityHE
## cor                           0.9429542   0.4152274 -0.083045480
## cor.t.statistic               5.6647055   0.9128709 -0.166666667
## cor.p.value                   5.6647055   0.9128709 -0.166666667
## lm.t.value                    5.6647055   0.9128709 -0.166666667
## r.squared                     0.8891626   0.1724138  0.006896552
## RMSE                          0.7136240   0.8416254  0.922958207
## MAE                           0.6666667   0.7500000  0.888888889
## Accuracy                      0.6666667   0.6666667  0.666666667
## Precision                           NaN         NaN          NaN
## Sensitivity                   0.0000000   0.0000000  0.000000000
## Specificity                   1.0000000   1.0000000  1.000000000
## F1                            0.0000000   0.0000000  0.000000000
## BalancedAccuracy              0.5000000   0.5000000  0.500000000
## avg.sentiment.pos.response    0.4166667   0.1250000  0.083333333
## avg.sentiment.neg.response    0.0000000   0.0000000  0.000000000
##                            PositivityHE SentimentLM NegativityLM
## cor                           0.3315938   0.7370455  -0.40804713
## cor.t.statistic               0.7029595   2.1811142  -0.89389841
## cor.p.value                   0.7029595   2.1811142  -0.89389841
## lm.t.value                    0.7029595   2.1811142  -0.89389841
## r.squared                     0.1099545   0.5432361   0.16650246
## RMSE                          0.8525561   0.7234178   0.96186547
## MAE                           0.8055556   0.6333333   0.92222222
## Accuracy                      0.6666667   0.8333333   0.66666667
## Precision                           NaN   1.0000000          NaN
## Sensitivity                   0.0000000   0.5000000   0.00000000
## Specificity                   1.0000000   1.0000000   1.00000000
## F1                            0.0000000   0.3333333   0.00000000
## BalancedAccuracy              0.5000000   0.7500000   0.50000000
## avg.sentiment.pos.response    0.2083333   0.2500000   0.08333333
## avg.sentiment.neg.response    0.0000000  -0.1000000   0.10000000
##                            PositivityLM RatioUncertaintyLM SentimentQDAP
## cor                           0.6305283                 NA     0.9865356
## cor.t.statistic               1.6247248                 NA    12.0642877
## cor.p.value                   1.6247248                 NA    12.0642877
## lm.t.value                    1.6247248                 NA    12.0642877
## r.squared                     0.3975659                 NA     0.9732526
## RMSE                          0.7757911          0.9128709     0.5398902
## MAE                           0.7222222          0.8333333     0.4888889
## Accuracy                      0.6666667          0.6666667     1.0000000
## Precision                           NaN                NaN     1.0000000
## Sensitivity                   0.0000000          0.0000000     1.0000000
## Specificity                   1.0000000          1.0000000     1.0000000
## F1                            0.0000000          0.0000000     0.5000000
## BalancedAccuracy              0.5000000          0.5000000     1.0000000
## avg.sentiment.pos.response    0.3333333          0.0000000     0.3333333
## avg.sentiment.neg.response    0.0000000          0.0000000    -0.3666667
##                            NegativityQDAP PositivityQDAP
## cor                           -0.94433955      0.9429542
## cor.t.statistic               -5.74114834      5.6647055
## cor.p.value                   -5.74114834      5.6647055
## lm.t.value                    -5.74114834      5.6647055
## r.squared                      0.89177719      0.8891626
## RMSE                           1.06840137      0.7136240
## MAE                            1.01111111      0.6666667
## Accuracy                       0.66666667      0.6666667
## Precision                             NaN            NaN
## Sensitivity                    0.00000000      0.0000000
## Specificity                    1.00000000      1.0000000
## F1                             0.00000000      0.0000000
## BalancedAccuracy               0.50000000      0.5000000
## avg.sentiment.pos.response     0.08333333      0.4166667
## avg.sentiment.neg.response     0.36666667      0.0000000
compareToResponse(sentiment, convertToBinaryResponse(response))
##                            WordCount SentimentGI NegativityGI PositivityGI
## Accuracy                   0.6666667   1.0000000   0.66666667    0.6666667
## Precision                        NaN   1.0000000          NaN          NaN
## Sensitivity                0.0000000   1.0000000   0.00000000    0.0000000
## Specificity                1.0000000   1.0000000   1.00000000    1.0000000
## F1                         0.0000000   0.5000000   0.00000000    0.0000000
## BalancedAccuracy           0.5000000   1.0000000   0.50000000    0.5000000
## avg.sentiment.pos.response 3.2500000   0.3333333   0.08333333    0.4166667
## avg.sentiment.neg.response 4.0000000  -0.6333333   0.63333333    0.0000000
##                            SentimentHE NegativityHE PositivityHE
## Accuracy                     0.6666667   0.66666667    0.6666667
## Precision                          NaN          NaN          NaN
## Sensitivity                  0.0000000   0.00000000    0.0000000
## Specificity                  1.0000000   1.00000000    1.0000000
## F1                           0.0000000   0.00000000    0.0000000
## BalancedAccuracy             0.5000000   0.50000000    0.5000000
## avg.sentiment.pos.response   0.1250000   0.08333333    0.2083333
## avg.sentiment.neg.response   0.0000000   0.00000000    0.0000000
##                            SentimentLM NegativityLM PositivityLM
## Accuracy                     0.8333333   0.66666667    0.6666667
## Precision                    1.0000000          NaN          NaN
## Sensitivity                  0.5000000   0.00000000    0.0000000
## Specificity                  1.0000000   1.00000000    1.0000000
## F1                           0.3333333   0.00000000    0.0000000
## BalancedAccuracy             0.7500000   0.50000000    0.5000000
## avg.sentiment.pos.response   0.2500000   0.08333333    0.3333333
## avg.sentiment.neg.response  -0.1000000   0.10000000    0.0000000
##                            RatioUncertaintyLM SentimentQDAP NegativityQDAP
## Accuracy                            0.6666667     1.0000000     0.66666667
## Precision                                 NaN     1.0000000            NaN
## Sensitivity                         0.0000000     1.0000000     0.00000000
## Specificity                         1.0000000     1.0000000     1.00000000
## F1                                  0.0000000     0.5000000     0.00000000
## BalancedAccuracy                    0.5000000     1.0000000     0.50000000
## avg.sentiment.pos.response          0.0000000     0.3333333     0.08333333
## avg.sentiment.neg.response          0.0000000    -0.3666667     0.36666667
##                            PositivityQDAP
## Accuracy                        0.6666667
## Precision                             NaN
## Sensitivity                     0.0000000
## Specificity                     1.0000000
## F1                              0.0000000
## BalancedAccuracy                0.5000000
## avg.sentiment.pos.response      0.4166667
## avg.sentiment.neg.response      0.0000000
plotSentimentResponse(sentiment$SentimentQDAP, response)

The SentimentAnalysis package works very cleverly and neatly here in order to remove the effort for the user: it recognizes that the user has inserted a vector of strings and thus automatically performs a set of default preprocessing operations from text mining. Hence, it tokenizes each document and finally converts the input into a document-term matrix. All of the previous operations are undertaken without manual specification. The analyzeSentiment() routine also accepts other input formats in case the user has already performed a preprocessing step or wants to implement a specific set of operations.

Functionality

The following sections present the functionality in terms of working with different input formats and the underlying dictionaries.

Interface

The SentimentAnalysis package provides interfaces with several other input formats, among which are

We provide examples in the following.

Vector of strings

documents <- c("This is good",
               "This is bad",
               "This is inbetween")
convertToDirection(analyzeSentiment(documents)$SentimentQDAP)
## [1] positive negative neutral 
## Levels: negative neutral positive

Document-term matrix

library(tm)
## Loading required package: NLP
corpus <- VCorpus(VectorSource(documents))
convertToDirection(analyzeSentiment(corpus)$SentimentQDAP)
## [1] positive negative neutral 
## Levels: negative neutral positive

Corpus object

dtm <- preprocessCorpus(corpus)
convertToDirection(analyzeSentiment(dtm)$SentimentQDAP)
## [1] positive negative neutral 
## Levels: negative neutral positive

Since the package can work directly with a document-term matrix, this allows one to use customized preprocessing operations in the first place. Afterwards, one can utilize the SentimentAnalysis package for the computation of sentiment scores. For instance, one can replace the stopwords with those from a different list, or even perform tailored synonym merging, among other options. By default, the package uses the built-in routines transformIntoCorpus() to convert the input into a Corpus object and preprocessCorpus() to convert it into a DocumentTermMatrix.

Built-in dictionaries

The SentimentAnalysis package entails three different dictionaries:

All of them can be manually inspected and even accessed as follows:

# Make dictionary available in the current R environment
data(DictionarHE)
## Warning in data(DictionarHE): data set 'DictionarHE' not found
# Display the internal structure 
str(DictionaryHE)
## List of 2
##  $ negative: chr [1:85] "below" "challenge" "challenged" "challenges" ...
##  $ positive: chr [1:105] "above" "accomplish" "accomplished" "accomplishes" ...
# Access dictionary as an object of type SentimentDictionary
dict.HE <- loadDictionaryHE()
# Print summary statistics of dictionary
summary(dict.HE)
## Dictionary type:  binary (positive / negative)
## Total entries:    97
## Positive entries: 53 (54.64%)
## Negative entries: 44 (45.36%)
data(DictionaryLM)
str(DictionaryLM)
## List of 3
##  $ negative   : chr [1:2355] "abandon" "abandoned" "abandoning" "abandonment" ...
##  $ positive   : chr [1:354] "able" "abundance" "abundant" "acclaimed" ...
##  $ uncertainty: chr [1:297] "abeyance" "abeyances" "almost" "alteration" ...

Dictionary functions

The SentimentAnalysis package distinguishes between three different types of dictionaries. All of them differ by the data they store, which ultimately also controls which methods of sentiment analysis one can apply. The dictionaries are as follows:

SentimentDictionaryWordlist

d <- SentimentDictionaryWordlist(c("uncertain", "possible", "likely"))
summary(d)
## Dictionary type:  word list (single set)
## Total entries:    3
# Alternative call
d <- SentimentDictionary(c("uncertain", "possible", "likely"))
summary(d)
## Dictionary type:  word list (single set)
## Total entries:    3

SentimentDictionaryBinary

d <- SentimentDictionaryBinary(c("increase", "rise", "more"),
                               c("fall", "drop"))
summary(d)
## Dictionary type:  binary (positive / negative)
## Total entries:    5
## Positive entries: 3 (60%)
## Negative entries: 2 (40%)
# Alternative call
d <- SentimentDictionary(c("increase", "rise", "more"),
                         c("fall", "drop"))
summary(d)
## Dictionary type:  binary (positive / negative)
## Total entries:    5
## Positive entries: 3 (60%)
## Negative entries: 2 (40%)

SentimentDictionaryWeighted

d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"),
                                 c(+1, -1, -10),
                                 rep(NA, 3))
summary(d)
## Dictionary type:  weighted (words with individual scores)
## Total entries:    3
## Positive entries: 1 (33.33%)
## Negative entries: 2 (66.67%)
## Neutral entries:  0 (0%)
## 
## Details
## Average score:      -3.333333
## Median:             -1
## Min:                -10
## Max:                1
## Standard deviation: 5.859465
## Skewness:           -0.6155602
# Alternative call
d <- SentimentDictionary(c("increase", "decrease", "exit"),
                         c(+1, -1, -10),
                         rep(NA, 3))
summary(d)                         
## Dictionary type:  weighted (words with individual scores)
## Total entries:    3
## Positive entries: 1 (33.33%)
## Negative entries: 2 (66.67%)
## Neutral entries:  0 (0%)
## 
## Details
## Average score:      -3.333333
## Median:             -1
## Min:                -10
## Max:                1
## Standard deviation: 5.859465
## Skewness:           -0.6155602

Dictionary generation

The following example shows how the SentimentAnalysis package can extract statistically relevant textual drivers based on an exogenous response variable. The details of this method are presented in (Pröllochs, Feuerriegel, and Neumann 2015), while we provide a brief summary here. Let denote a response variable in the form of a vector. Furthermore, variables give the number of occurrences of word in a document. The methodology then estimates a linear model with intercept and coefficients . The estimation routine is based on LASSO regularization, which implicitly performs variable selection. In so doing, it sets some of the coefficients to exactly zero. The remaining words can then be ranked by polarity according to their coefficient.

# Create a vector of strings
documents <- c("This is a good thing!",
               "This is a very good thing!",
               "This is okay.",
               "This is a bad thing.",
               "This is a very bad thing.")
response <- c(1, 0.5, 0, -0.5, -1)

# Generate dictionary with LASSO regularization
dict <- generateDictionary(documents, response)

dict
## Type: weighted (words with individual scores)
## Intercept: 5.55333e-05
## -0.51 bad
##  0.51 good
summary(dict)
## Dictionary type:  weighted (words with individual scores)
## Total entries:    2
## Positive entries: 1 (50%)
## Negative entries: 1 (50%)
## Neutral entries:  0 (0%)
## 
## Details
## Average score:      -5.251165e-05
## Median:             -5.251165e-05
## Min:                -0.5119851
## Max:                0.5118801
## Standard deviation: 0.7239821
## Skewness:           0

In practice, users have several options for fine-tuning. Among these, they can disable the intercept and fix it to zero, or standardize the response variable . In addition, it is possible to replace the LASSO with any variant of the elastic net, simply by changing the argument alpha.

Finally, one can save and reload dictionaries using read() and write() as follows:

write(dict, file="dictionary.dict")
dict <- read("dictionary.dict")

Performance evaluation

Ultimately, several routines allow one to exlore the generated dictionary further. On the one hand, a simple overview can be displayed by means of the summary() routine. On the other hand, a Kernel Density Estimation can also visualize the distribution of positive and negative words. For instance, one can identify whether the opinionated words were skewed to either end of the polarity scale. Lastly, the compareDictionary() routine can compare the generated dictionary to dictionaries from the literature. It automatically computes various metrics, among which are the overlap or the correlation.

compareDictionaries(dict,
                    loadDictionaryQDAP())
## Comparing: wordlist vs weighted
## 
## Total unique words: 4213
## Matching entries: 2 (0.0004747211%)
## Entries with same classification: 0 (0%)
## Entries with different classification: 2 (0.0004747211%)
## Correlation between scores of matching entries: 1
## $totalUniqueWords
## [1] 4213
## 
## $totalSameWords
## [1] 2
## 
## $ratioSameWords
## [1] 0.0004747211
## 
## $numWordsEqualClass
## [1] 0
## 
## $numWordsDifferentClass
## [1] 2
## 
## $ratioWordsEqualClass
## [1] 0
## 
## $ratioWordsDifferentClass
## [1] 0.0004747211
## 
## $correlation
## [1] 1
sentiment <- predict(dict, documents)
compareToResponse(sentiment, response)
##                            Dictionary
## cor                         0.9486833
## cor.t.statistic             5.1961524
## cor.p.value                 5.1961524
## lm.t.value                  5.1961524
## r.squared                   0.9000000
## RMSE                        0.2330104
## MAE                         0.2000111
## Accuracy                    1.0000000
## Precision                   1.0000000
## Sensitivity                 1.0000000
## Specificity                 1.0000000
## F1                          0.5714286
## BalancedAccuracy            1.0000000
## avg.sentiment.pos.response  0.4511680
## avg.sentiment.neg.response -0.6767520
plotSentimentResponse(sentiment, response)

The following example demonstrates how a calculated dictionary can be used for predicting the sentiment of out-of-sample data. In addition, the code then evaluates the prediction performance by comparing it to the built-in dictionaries.

test_documents <- c("This is neither good nor bad",
                    "What a good idea!",
                    "Not bad")
test_response <- c(0, 1, 1)

pred <- predict(dict, test_documents)

compareToResponse(pred, test_response)
##                              Dictionary
## cor                        5.922189e-05
## cor.t.statistic            5.922189e-05
## cor.p.value                5.922189e-05
## lm.t.value                 5.922189e-05
## r.squared                  3.507232e-09
## RMSE                       8.523018e-01
## MAE                        6.666521e-01
## Accuracy                   3.333333e-01
## Precision                  0.000000e+00
## Sensitivity                         NaN
## Specificity                3.333333e-01
## F1                         0.000000e+00
## BalancedAccuracy                    NaN
## avg.sentiment.pos.response 1.457684e-05
## avg.sentiment.neg.response          NaN
plotSentimentResponse(pred, test_response)

compareToResponse(analyzeSentiment(test_documents), test_response)
## Warning in cor(sentiment, response): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(sentiment, response): the standard deviation is zero
##                             WordCount SentimentGI NegativityGI
## cor                        -0.8660254 -0.18898224   0.18898224
## cor.t.statistic            -1.7320508 -0.19245009   0.19245009
## cor.p.value                -1.7320508 -0.19245009   0.19245009
## lm.t.value                 -1.7320508 -0.19245009   0.19245009
## r.squared                   0.7500000  0.03571429   0.03571429
## RMSE                        1.8257419  1.19023807   0.60858062
## MAE                         1.3333333  0.83333333   0.44444444
## Accuracy                    1.0000000  0.66666667   1.00000000
## Precision                         NaN  0.00000000          NaN
## Sensitivity                       NaN         NaN          NaN
## Specificity                 1.0000000  0.66666667   1.00000000
## F1                          0.0000000  0.00000000   0.00000000
## BalancedAccuracy                  NaN         NaN          NaN
## avg.sentiment.pos.response  2.0000000 -0.16666667   0.44444444
## avg.sentiment.neg.response        NaN         NaN          NaN
##                            PositivityGI SentimentHE NegativityHE
## cor                         -0.18898224 -0.18898224           NA
## cor.t.statistic             -0.19245009 -0.19245009           NA
## cor.p.value                 -0.19245009 -0.19245009           NA
## lm.t.value                  -0.19245009 -0.19245009           NA
## r.squared                    0.03571429  0.03571429           NA
## RMSE                         0.67357531  0.67357531    0.8164966
## MAE                          0.61111111  0.61111111    0.6666667
## Accuracy                     1.00000000  1.00000000    1.0000000
## Precision                           NaN         NaN          NaN
## Sensitivity                         NaN         NaN          NaN
## Specificity                  1.00000000  1.00000000    1.0000000
## F1                           0.00000000  0.00000000    0.0000000
## BalancedAccuracy                    NaN         NaN          NaN
## avg.sentiment.pos.response   0.27777778  0.27777778    0.0000000
## avg.sentiment.neg.response          NaN         NaN          NaN
##                            PositivityHE SentimentLM NegativityLM
## cor                         -0.18898224 -0.18898224   0.18898224
## cor.t.statistic             -0.19245009 -0.19245009   0.19245009
## cor.p.value                 -0.19245009 -0.19245009   0.19245009
## lm.t.value                  -0.19245009 -0.19245009   0.19245009
## r.squared                    0.03571429  0.03571429   0.03571429
## RMSE                         0.67357531  1.19023807   0.60858062
## MAE                          0.61111111  0.83333333   0.44444444
## Accuracy                     1.00000000  0.66666667   1.00000000
## Precision                           NaN  0.00000000          NaN
## Sensitivity                         NaN         NaN          NaN
## Specificity                  1.00000000  0.66666667   1.00000000
## F1                           0.00000000  0.00000000   0.00000000
## BalancedAccuracy                    NaN         NaN          NaN
## avg.sentiment.pos.response   0.27777778 -0.16666667   0.44444444
## avg.sentiment.neg.response          NaN         NaN          NaN
##                            PositivityLM RatioUncertaintyLM SentimentQDAP
## cor                         -0.18898224                 NA   -0.18898224
## cor.t.statistic             -0.19245009                 NA   -0.19245009
## cor.p.value                 -0.19245009                 NA   -0.19245009
## lm.t.value                  -0.19245009                 NA   -0.19245009
## r.squared                    0.03571429                 NA    0.03571429
## RMSE                         0.67357531          0.8164966    1.19023807
## MAE                          0.61111111          0.6666667    0.83333333
## Accuracy                     1.00000000          1.0000000    0.66666667
## Precision                           NaN                NaN    0.00000000
## Sensitivity                         NaN                NaN           NaN
## Specificity                  1.00000000          1.0000000    0.66666667
## F1                           0.00000000          0.0000000    0.00000000
## BalancedAccuracy                    NaN                NaN           NaN
## avg.sentiment.pos.response   0.27777778          0.0000000   -0.16666667
## avg.sentiment.neg.response          NaN                NaN           NaN
##                            NegativityQDAP PositivityQDAP
## cor                            0.18898224    -0.18898224
## cor.t.statistic                0.19245009    -0.19245009
## cor.p.value                    0.19245009    -0.19245009
## lm.t.value                     0.19245009    -0.19245009
## r.squared                      0.03571429     0.03571429
## RMSE                           0.60858062     0.67357531
## MAE                            0.44444444     0.61111111
## Accuracy                       1.00000000     1.00000000
## Precision                             NaN            NaN
## Sensitivity                           NaN            NaN
## Specificity                    1.00000000     1.00000000
## F1                             0.00000000     0.00000000
## BalancedAccuracy                      NaN            NaN
## avg.sentiment.pos.response     0.44444444     0.27777778
## avg.sentiment.neg.response            NaN            NaN

Configuration of preprocessing

When desired, one can implement a tailored preprocessing stage that adapts to specific needs. The following code snippets demonstrate such adaptation. In particular, the SentimentAnalysis package ships a function ngram_tokenize() in order to extract -grams from the corpus. This does not affect the results of the built-in dictionaries but rather changes the features used as part of dictionary generation.

corpus <- VCorpus(VectorSource(documents))
tdm <- TermDocumentMatrix(corpus, 
                          control=list(wordLengths=c(1,Inf), 
                                       tokenize=function(x) ngram_tokenize(x, char=FALSE, 
                                                                           ngmin=1, ngmax=2)))
rownames(tdm)
##  [1] "a"           "a bad"       "a good"      "a very"      "bad"        
##  [6] "bad thing."  "good"        "good thing!" "is"          "is a"       
## [11] "is okay."    "okay."       "thing!"      "thing."      "this"       
## [16] "this is"     "very"        "very bad"    "very good"
dict <- generateDictionary(tdm, response)
summary(dict)
## Dictionary type:  weighted (words with individual scores)
## Total entries:    7
## Positive entries: 3 (42.86%)
## Negative entries: 4 (57.14%)
## Neutral entries:  0 (0%)
## 
## Details
## Average score:      2.136424e-06
## Median:             -5.891262e-05
## Min:                -0.4368759
## Max:                0.4380331
## Standard deviation: 0.3016221
## Skewness:           0.004124293
dict
## Type: weighted (words with individual scores)
## Intercept: -3.230431e-05
## -0.44 bad
## -0.29 very bad
## -0.00 bad thing.
## -0.00 thing.
##  0.00 good thing!
##  0.29 a good
##  0.44 good

Performance optimization

Once the user has decided upon a preferred rule, he can adapt the analyzeSentiment() routine by restricting it to calculate only the rules of interest. Such behavior can be implemented by changing the default value of the argument rules. See the following code snippets for an example:

sentiment <- analyzeSentiment(documents,
                              rules=list("SentimentLM"=list(ruleSentiment, loadDictionaryLM())))
sentiment
##   SentimentLM
## 1         0.5
## 2         0.5
## 3         0.0
## 4        -0.5
## 5        -0.5

Language support and extensibility

SentimentAnalysis can be adapted for use with languages other than English. In order to do this, one needs to introduce changes at two points:

The following example demonstrates how SentimentAnalysis can be adapted to work with a sample in German. Here, we supply a positive and negative document in the variable documents. Afterwards, we introduce a very small dictionary of positive and negative words, which is stored in dictionaryGerman. Finally, we use analyzeSentiment() to perform a sentiment analysis, where we introduce changes as follows: first of all, we supply language="german" to ensure that all preprocessing operations are being made for the German language. Additionally, we define our custom rule for GermanSentiment that uses our previous, customized dictionary.

documents <- c("Das ist ein gutes Resultat",
               "Das Ergebnis war schlecht")
dictionaryGerman <- SentimentDictionaryBinary(c("gut"), 
                                              c("schlecht"))

sentiment <- analyzeSentiment(documents,
                              language="german",
                              rules=list("GermanSentiment"=list(ruleSentiment, dictionaryGerman)))
sentiment
##   GermanSentiment
## 1             0.0
## 2            -0.5
convertToBinaryResponse(sentiment$GermanSentiment)
## [1] positive negative
## Levels: negative positive

Similarly, one can implement a dictionary with custom sentiment scores.

woorden <- c("goed","slecht")
scores <- c(0.8,-0.5)
dictionaryDutch <- SentimentDictionaryWeighted(woorden, scores)
documents <- "dit is heel slecht"
sentiment <- analyzeSentiment(documents,
                              language="dutch",
                              rules=list("DutchSentiment"=list(ruleLinearModel, dictionaryDutch)))
sentiment
##   DutchSentiment
## 1           -0.5

Notes:

Worked examples

The following example shows the usage of SentimentAnalysis in an applied setting. More precisely, we utilize Reuters oil-related news from the tm package.

library(tm)
data("crude")

# Analyze sentiment
sentiment <- analyzeSentiment(crude)

# Count positive and negative news releases
table(convertToBinaryResponse(sentiment$SentimentLM))
## 
## negative positive 
##       16        4
# News releases with highest and lowest sentiment
crude[[which.max(sentiment$SentimentLM)]]$meta$heading
## [1] "HOUSTON OIL <HO> RESERVES STUDY COMPLETED"
crude[[which.min(sentiment$SentimentLM)]]$meta$heading
## [1] "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
# View summary statistics of sentiment variable
summary(sentiment$SentimentLM)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.08772 -0.04366 -0.02341 -0.02953 -0.01375  0.00000
# Visualize distribution of standardized sentiment variable
hist(scale(sentiment$SentimentLM))

# Compute cross-correlation 
cor(sentiment[, c("SentimentLM", "SentimentHE", "SentimentQDAP")])
##               SentimentLM SentimentHE SentimentQDAP
## SentimentLM     1.0000000   0.2769878     0.4769730
## SentimentHE     0.2769878   1.0000000     0.6141075
## SentimentQDAP   0.4769730   0.6141075     1.0000000
# crude oil news between  1987-02-26 until 1987-03-02
datetime <- do.call(c, lapply(crude, function(x) x$meta$datetimestamp))

plotSentiment(sentiment$SentimentLM)

plotSentiment(sentiment$SentimentLM, x=datetime, cumsum=TRUE)

Word couting

SentimentAnalysis can also be used to count words with the help of countWords() in documents.

# count words (without stopwords)
countWords(documents)
##   WordCount
## 1         3
# count all words (including stopwords)
countWords(documents, removeStopwords=FALSE)
##   WordCount
## 1         4

Note: The package has a built-in rule ruleWordCount(), which is used for the “WordCount” column when calling analyzeSentiment(). However, the former is likely to return different results as it is subject to the preprocessing rules of analyzeSentiment(). By default, it removes stopwords, excludes words with equal or less than 3 letters and might apply a sparsity operation. Hence, one should always use countWords() when working with word counts.

Outlook

The current version leaves open avenues for further enhancement. In the future, we see the following items as being potentially subject to improvements:

We cordially invite everyone to contribute source code, dictionaries and further demos.

License

SentimentAnalysis is released under the MIT License Copyright (c) 2016 Stefan Feuerriegel & Nicolas Pröllochs

References

Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5): 1–54.

Henry, Elaine. 2008. “Are Investors Influenced by How Earnings Press Releases Are Written?” Journal of Business Communication 45 (4): 363–407.

Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” Journal of Finance 66 (1): 35–65.

Pang, Bo, and Lillian Lee. 2008. “Opinion Mining and Sentiment Analysis.” Foundations and Trends in Information Retrieval 2 (1): 1–135. doi:10.1561/1500000011.

Pröllochs, Nicolas, Stefan Feuerriegel, and Dirk Neumann. 2015. “Generating Domain-Specific Dictionaries Using Bayesian Learning.” In 23rd European Conference on Information Systems (Ecis). Münster, Germany. doi:10.2139/ssrn.2522884.

———. 2016. “Negation Scope Detection in Sentiment Analysis: Decision Support for News-Driven Trading.” Decision Support Systems. doi:10.1016/j.dss.2016.05.009.

Ravi, Kumar, and Vadlamani Ravi. 2015. “A Survey on Opinion Mining and Sentiment Analysis: Tasks, Approaches and Applications.” Knowledge-Based Systems 89: 14–46. doi:10.1016/j.knosys.2015.06.015.

Tetlock, Paul C. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.” Journal of Finance 62 (3): 1139–68.