Quanteda

Introduction

Corresponding to many ethical arguments I’ve heard, what has often stood salient, is the idea there are are moral beliefs, some are true, others are not, and there is unending disagrement as to which is which. What I do not recall as frequently, personally, is the insistence that what makes a moral claim true, is that “it just feels right.”1 If we focus on this claim, that it just feels* right, then we would need to focus on the term sentiment. From a moral context, when we use the term sentiment, what we mean to say is something close to the following: Our moral judgments ooze with sentiment. Passion about values, rightness and wrongness are things that we feel.* You’ve likely already noted that just because something feels a certain way, does not mean that it is necessarily that way. As such, there is a conflict between normative claims and descriptive ones in this arena. This project is not concerned with normative claims abouut right and wrong. It is not even concerned with descriptive claims about right and wrong. Rather, the purpose here is merely to potentially look at how claims about rightness and wrongness are framed in literature. For this, it will be useful to use sentiment analysis. To this end, I will begin with this tutorial on quanteda.

Quanteda is an R package for managing and analyzing textual data.2 In this tutorial, we will look over how to import textual data into our project, and some basic operations of this data afforded by quanteda. These will include constructing a corpus, tokens objects, a document-feature matrix and how to conduct advanced operations on these. Finally, we will look at text scaling and document classification, both using Naive Bayes and topic models.

Installing R and RStudio

To get started, you will first need to download and install both Base R and R Studio. Base R is a computer language primarily used for statistical analysis type applications but can be used for other applications as well.

You can install it by following the directions found here:

https://cran.r-project.org/

  1. First, you will want to download the version of R for your operating system, either Linux (which will be one of the following debian, fedora, redhat, suse, ubuntu),

  2. macOS

  3. Windows.

Next, while not strictly necessary, it is indeed the most effecient way to learn, so you will want to download and install R Studio, specifically the RStudio Desktop version. The RStudio Desktop link will take you to a download page.

Choose your operating system, Windows, macOS, or one of the linux distributions and click on the file next to it.

Installation of Packages

Quanteda

Next we will install quanteda.

There are tradtionally four windows. To the top left is your editing environment (if you do not see this initially, double click in the blank space next to “Console” and “Terminal”). Below that are your “console” and “terminal” environments. Console is the environment that you will input all of your R relevant commands, such as the ones below. The Terminal allows you to control your Mac or Windows computer. Here is a helpful page on its use for Mac and another, for the Windows operating systems, however you will not likely need these as the terminal environment in RStudio will perform all the same functions.

Type the commmands below into your console:

install.packages("quanteda")

There will be three packages that we will use throughout the tutorial:

install.packages("quanteda.textmodels")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")

We will use the following package to read in different types of textual data.

Readtext, Devtools, SpaCyr

install.packages("readtext")

We will be using several datasets provided by quanteda. These are available in quanteda.corpora but it will need to be install from github using the function install_github() from the devtools package.

install.packages("devtools") # get devtools to install quanteda.corpora
devtools::install_github("quanteda/quanteda.corpora")

Although this tutorial does not cover syntactical analysis, it will be helpful to install spacyr, a package does part-of-speech tagging, entity recognition, and dependency parsing. It provides an interface to the spaCy library and works well with quanteda.

install.packages("spacyr")

Putting them to use

require(quanteda)
require(quanteda.textmodels)
require(quanteda.textstats)
require(quanteda.textplots)
require(readtext)
require(devtools)
require(quanteda.corpora)
require(newsmap)
require(seededlda)

Vectors and Other Objects

Vectors are objects that contain a set of values. For instance, a number vector 2,3,4, and a character vector "apples", "oranges", "bananas".

We can store vectors and print them out later.

## [1] 1 5 6 3
## [1] "apple"    "banana"   "mandarin" "melon"
## [1] "banana"
## [1] 1
## [1] "apple"    "mandarin"
## [1] "apple"    "banana"   "mandarin"
## [1] 1 5 6 3
## [1] "apple"    "banana"   "mandarin" "melon"
## [1] "apple"
## [1] 1
## [1] "apple"    "mandarin"
## [1] "apple"    "banana"   "mandarin"

Functions on Vectors

We can also perform functions, including arithmetical or logical operations, on vectors once we’ve stored them. Afterwards we can print out the result.

vec_num2 <- vec_num + 2
print(vec_num2)

vec_num3 <- vec_num > 2

print(vec_num3
)
## [1] 3 7 8 5
## [1] FALSE  TRUE  TRUE  TRUE

Dataframe and Matrix

A dataframe is a combined set of vectors making a dataset. Vectors can only be combined in a dataframe if they have the same lenghts though vectors of different types, character and numerical for instance, can be combined. A matrix is like a dataframe in that it represents ‘rectangular’ data types. However, only dataframes contain multiple classes of data.

A dataframe

dat_fruit <- data.frame(name = vec_char, count = vec_num)
print(dat_fruit)

mat <- matrix(c(1:8))
print(mat)

##############

mat <- matrix(c(1, 3, 6, 8, 3, 5, 2, 7), nrow = 2)
print(mat)

colnames(mat) <- vec_char
print(mat)
name count
apple 1
banana 5
mandarin 6
melon 3
##      [,1] [,2] [,3] [,4]
## [1,]    1    6    3    2
## [2,]    3    8    5    7
##      apple banana mandarin melon
## [1,]     1      6        3     2
## [2,]     3      8        5     7

A matrix


dat_fruit <- data.frame(name = vec_char, count = mat)

dat_fruit
##       name count.apple count.banana count.mandarin count.melon
## 1    apple           1            6              3           2
## 2   banana           3            8              5           7
## 3 mandarin           1            6              3           2
## 4    melon           3            8              5           7

Importing Data

There are several ways of importing data into your project. But first you need to gather your data. In the drive folder, I’ve made a number of folders.

## [1] "Artist" "Title"  "Lyrics"
## [1] Artist Title  Lyrics
## <0 rows> (or 0-length row.names)
## [1] Artist Title  Lyrics
## <0 rows> (or 0-length row.names)
Artist Title Lyrics
## NULL

Corpus Objects

A corpus object is a dataframe in which one column initially named “text”. This column will typically hold the data that we will do the analysis on. In the beginning, what we will want to do is create a corpus object. With our data, next we need to make a corpus object and view a summary of it.

What we see here is that our corpus has 164 documents total. To see what these look like, we can view the first or last or the ones in the middle.


head(corp_inaug)
tail(corp_inaug)

corp_inaug[50:60]

Next we will organize it visually. What we are doing here, is creating columns and telling R what they will be for us. For instance, our document identification variables will be Title, Text, and Source.

References


  1. See @prin04 and @slot10.↩︎

  2. See @beno18.↩︎