Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
library(rvest)
library(qdap)
library(SnowballC)
If you run into an error loading ‘qdap’ then update your java version, making sure it matches R (x32 or x64).
If you end up with strange characters in your text then change the character encoding using iconv() function.
The code below should do the trick.
Then this…
https://rpubs.com/Custer/text 1/17
9/18/2017 Make Text-Mining Great Again
This is where the first ‘qdap’ function comes into play, qprep(). This function is a wrapper for a number of other
replacement functions and using it will speed pre-processing, but should be used with caution if more detailed
analysis is required. The functions it passes through are as follows: 1. bracketX() - apply bracket removal 2.
replace_abbreviation() - changes abbreviations 3. replace_number() - numbers to words e.g. 100 becomes
one hundred 4. replace_symbol() - symbols become words e.g. @ becomes ‘at’
This chunk of code does the above and also replaces contractions, removes the top 100 stopwords and strips
the text of unwanted characters. Note that we will keep the period and the question marks to assist in sentence
creation.
One of the things you can/should do is fill spaces between words, which will keep them together for the
analysis such as a person’s name. The ‘keep’ list below provide an example of this and it will be used in the
space_fill() function. You could include several others.
It is also now time to put both speeches into one dataframe, consisting of the text for each respective
candidate.
keep <- c("United States", "Hillary Clinton", "Donald Trump", "middle class", "Supreme
Court")
donFill <- data.frame(space_fill(donStrip, keep))
donFill$candidate <- "Trump"
colnames(donFill)[1] <- "text"
hillFill <- data.frame(space_fill(hillStrip, keep))
hillFill$candidate <- "Clinton"
colnames(hillFill)[1] <- "text"
df1 <- rbind(donFill, hillFill)
Critical to any analysis with the ‘qdap’ package is to put the text into sentences with the sentSplit() function. It
also creates the ‘tot’ variable or ‘turn of talk’ index, which is something that would be important for analyzing
the debates.
https://rpubs.com/Custer/text 2/17
9/18/2017 Make Text-Mining Great Again
str(df2)
We’ve come to the point I think where stemming would be implemented. That is, to reduce a word to its root
e.g. stems, stemming, stemmer all become stem. However, I’m not necessarily a big fan of it anymore and
believe it should be applied judiciously. A number of highly experienced text miners have helped me correct the
error of my former auto-stemming ways. Also, ‘qdap’ has some flexibility in comparing stemmed text versus
non-stemmed text as we shall soon see.
3. Preliminary Analysis
I’ll start out with the standard word frequency analysis. As is usually the case with ‘qdap’, there are a number
of options to accomplish a task. On your own have a look at the bag o words() and word_count() functions.
Here I create a df of the 25 most frequent terms by candidate and compare that data in a plot.
https://rpubs.com/Custer/text 3/17
9/18/2017 Make Text-Mining Great Again
plot(hillFreq)
https://rpubs.com/Custer/text 4/17
9/18/2017 Make Text-Mining Great Again
No surprise that Trump hits “trade”, “violence”, “immigration” and “law”. Hillary likes to talk about “us” and “me”
(real shock there). Nothing about children or families?
You can create a word frequency matrix, which provides the counts for each word by speaker
## Clinton Trump
## abandon 0 1
## abandoned 1 1
## able 2 2
## abolish 0 1
## abroad 1 2
## crosser 0 1
## crossings 0 1
## crucial 1 0
## crushed 0 1
## crying 0 1
Of course we need to include the obligatory word cloud. In this case, I will use stemmed words
https://rpubs.com/Custer/text 5/17
9/18/2017 Make Text-Mining Great Again
There you have it, children and families now appear. Quite a heavy burden being engaged in what former
Assistant Director of the FBI, James Kallstrom, characterized as a criminal foundation AND caring for families
and children. Now that is leadership!
https://rpubs.com/Custer/text 6/17
9/18/2017 Make Text-Mining Great Again
But I digress. A great function is ‘word_associate()’ and building word clouds based on that association. Let’s
give “terror” a try.
https://rpubs.com/Custer/text 7/17
9/18/2017 Make Text-Mining Great Again
https://rpubs.com/Custer/text 8/17
9/18/2017 Make Text-Mining Great Again
## 1 6 Trump 6 attacks our police terrorism our cities threaten our very life.
## 2 82 Trump 82 plan begin safety home means safe neighborhoods secure borders protect
ion terrorism.
## 3 114 Trump 114 task our new administration liberate our citizens crime terrorism lawl
essness threatens communities.
## 4 133 Trump 133 once again france victim brutal islamic terrorism.
## 5 139 Trump 139 only weeks ago orlando florida forty nine wonderful americans savagely
murdered islamic terrorist.
## 6 140 Trump 140 terrorist targeted our lgbt community.
## 8 145 Trump 145 instead must work our allies share our goal destroying isis stamping i
slamic terror.
## 9 147 Trump 147 lastly must immediately suspend immigration any nation compromised ter
rorism until such proven vetting mechanisms put place.
## 10 328 Clinton 328 work americans our allies fight terrorism.
## 11 596 Clinton 596 should working responsible gun owners pass common sense reforms keep g
uns hands criminals terrorists others us harm.
##
## Match Terms
## ===========
##
## List 1:
## terrorism, terrorist, terror, terrorists
##
Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses
some of its visual appeal with just two speakers, but it should stimulate your interest nontheless. A complete
explanation of the stats is available under ?word_stats
## Warning: attributes are not identical across measure variables; they will
## be dropped
https://rpubs.com/Custer/text 9/17
9/18/2017 Make Text-Mining Great Again
Interesting the breakdown in the count of sentences and words. Hillary used a hundred more sentences, but
only two hundred more words. I’m curious as to what questions they asked and how they incorporated them.
truncdf(x1$raw)
https://rpubs.com/Custer/text 10/17
9/18/2017 Make Text-Mining Great Again
OK, we’ve learned that rows 473 and 474 should be thrown out. Also looks like we have the classic use of an
anaphora by Trump, which is the technique of repeating the first word or words of several consecutive
sentences. I think Churchill used it quite a bit e.g. “We shall not flag or fail. We shall go on to the end. We shall
fight in France, we shall…”"
df2[c(161:163), 3]
df2[c(473:474), 3]
4. Advanced Analysis
This is where it gets fun with ‘qdap’. You can tag the text by parts of speech. Check out ?pos and have a look
at the vignette for further explanation https://trinker.github.io/qdap/vignettes/qdap_vignette.html
(https://trinker.github.io/qdap/vignettes/qdap_vignette.html)
Be advised that this takes some time, which you can track with a progress bar. Notice Clinton’s use and
Trump’s lack of use of interjections.
https://rpubs.com/Custer/text 11/17
9/18/2017 Make Text-Mining Great Again
Readability scores (measures of speech complexity) are available. I won’t go into the details as I discuss this in
my book and detailed information is in the ‘qdap’ vignette.
automated_readability_index(df2$text, df2$candidate)
Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The
results indicate similar use of vocabulary, certainly not unusual given the assistance of professional speech
writers.
diversity(df2$text, df2$candidate)
https://rpubs.com/Custer/text 12/17
9/18/2017 Make Text-Mining Great Again
Formality contextualizes the text by comparing formal parts of speech (noun, adjective, preposition and article)
versus contextual parts of speech (pronoun, verb, adverb, interjection). A plot for analysis is available. Scores
closer to 100 are more formal and those closer to 1 are more contextual.
plot(form)
Polarity measures sentence sentiment. A plot is available. What we see is that, on average, Trump was slightly
more negative.
https://rpubs.com/Custer/text 13/17
9/18/2017 Make Text-Mining Great Again
The lexical dispersion plot allows one to see how a word occurs throughout the text. It is interesting to view to
see how topics change over time. Note that you can also include freq_terms should you so choose.
https://rpubs.com/Custer/text 14/17
9/18/2017 Make Text-Mining Great Again
Finally, an example of a gradient wordcloud, which produces one wordcloud colored by a binary grouping
variable. Let’s do one with words not stemmed and one with stemming included.
https://rpubs.com/Custer/text 15/17
9/18/2017 Make Text-Mining Great Again
https://rpubs.com/Custer/text 16/17
9/18/2017 Make Text-Mining Great Again
There you have it. Now go find text data, manipulate text data, analyze text data and make text-mining great
again.
https://rpubs.com/Custer/text 17/17