RPubs - Text-Mining With Rvest and Qdap

9/18/2017 Make Text-Mining Great Again
Make Text-Mining Great Again

Cory Lesmeister
November 2, 2016
1. Gather the Text

The speeches are available from http://politico.com (http://politico.com). I used the ‘rvest’ package to identify
the text and bring it into R. The only other package needed is ‘qdap’. Note that I used the Chrome Extension
‘SelectorGadget’ to scrape the relevant text.
library(rvest)
library(qdap)
library(SnowballC)
If you run into an error loading ‘qdap’ then update your java version, making sure it matches R (x32 or x64).
donHTML <- read_html("http://www.politico.com/story/2016/07/full-transcript-donald-trump-nomi

nation-acceptance-speech-at-rnc-225974")
hillHTML <- read_html("http://www.politico.com/story/2016/07/full-text-hillary-clintons-dnc-s

peech-226410")
SelectorGadget facilitates selecting the right html nodes
donNode <- html_nodes(donHTML, "style~ p")

hillNode <- html_nodes(hillHTML, "style~ p")
2. Prepare the Text

You can explore the text as you wish using html_text(). We will need to put the text into a dataframe, but there
are some cleaning tasks that need to be done first.
donText <- html_text(donNode)

donText <- sub("Remarks as prepared for delivery according to a draft obtained by POLITICO Th
ursday afternoon.", '', donText)
donText <- sub("Story Continued Below", '', donText)
hillText <- html_text(hillNode)
hillText <- sub("Hillary Clinton's speech at the Democratic National Convention, as prepared
for delivery:", '', hillText)
hillText <- sub("Story Continued Below", '', hillText)
If you end up with strange characters in your text then change the character encoding using iconv() function.
The code below should do the trick.
donText <- iconv(donText, "latin1", "ASCII", "")

hillText <- iconv(hillText, "latin1", "ASCII", "")
Then this…
https://rpubs.com/Custer/text 1/17
donText <- paste(donText, collapse = c(" ", "\n"))

hillText <- paste(hillText, collapse = c(" ", "\n"))
This is where the first ‘qdap’ function comes into play, qprep(). This function is a wrapper for a number of other
replacement functions and using it will speed pre-processing, but should be used with caution if more detailed
analysis is required. The functions it passes through are as follows: 1. bracketX() - apply bracket removal 2.
replace_abbreviation() - changes abbreviations 3. replace_number() - numbers to words e.g. 100 becomes
one hundred 4. replace_symbol() - symbols become words e.g. @ becomes ‘at’
This chunk of code does the above and also replaces contractions, removes the top 100 stopwords and strips
the text of unwanted characters. Note that we will keep the period and the question marks to assist in sentence
creation.
donPrep <- qprep(donText)

hillPrep <- qprep(hillText)
donPrep <- replace_contraction(donPrep)

hillPrep <- replace_contraction(hillPrep)
donRm <- rm_stopwords(donPrep, Top100Words, separate = F)

hillRm <- rm_stopwords(hillPrep, Top100Words, separate = F)
donStrip <- strip(donRm, char.keep = c("?", "."))

hillStrip <- strip(hillRm, char.keep = c("?", "."))
One of the things you can/should do is fill spaces between words, which will keep them together for the
analysis such as a person’s name. The ‘keep’ list below provide an example of this and it will be used in the
space_fill() function. You could include several others.
It is also now time to put both speeches into one dataframe, consisting of the text for each respective
candidate.
keep <- c("United States", "Hillary Clinton", "Donald Trump", "middle class", "Supreme
Court")
donFill <- data.frame(space_fill(donStrip, keep))
donFill$candidate <- "Trump"
colnames(donFill)[1] <- "text"
hillFill <- data.frame(space_fill(hillStrip, keep))
hillFill$candidate <- "Clinton"
colnames(hillFill)[1] <- "text"
df1 <- rbind(donFill, hillFill)
Critical to any analysis with the ‘qdap’ package is to put the text into sentences with the sentSplit() function. It
also creates the ‘tot’ variable or ‘turn of talk’ index, which is something that would be important for analyzing
the debates.
df2 <- sentSplit(df1, "text")
## Warning in sentSplit(df1, "text"): The following problems were detected:

## non character, missing ending punctuation, indicating incomplete
##
## *Consider running `check_text`
str(df2)
## Classes 'sent_split', 'qdap_df', 'sent_split_text_var:text' and 'data.frame': 660 obs.

of 3 variables:
## $ candidate: chr "Trump" "Trump" "Trump" "Trump" ...
## $ tot : chr "1.1" "1.2" "1.3" "1.4" ...
## $ text : chr "friends delegates fellow americans humbly gratefully accept nomination
presidency United~~States." "together lead our party back white house lead our country back s
afety prosperity peace." "country generosity warmth." "also country law order." ...
## - attr(*, "text.var")= chr "text"
## - attr(*, "qdap_df_text.var")= chr "text"
We’ve come to the point I think where stemming would be implemented. That is, to reduce a word to its root
e.g. stems, stemming, stemmer all become stem. However, I’m not necessarily a big fan of it anymore and
believe it should be applied judiciously. A number of highly experienced text miners have helped me correct the
error of my former auto-stemming ways. Also, ‘qdap’ has some flexibility in comparing stemmed text versus
non-stemmed text as we shall soon see.
3. Preliminary Analysis
I’ll start out with the standard word frequency analysis. As is usually the case with ‘qdap’, there are a number
of options to accomplish a task. On your own have a look at the bag o words() and word_count() functions.
Here I create a df of the 25 most frequent terms by candidate and compare that data in a plot.
freq <- freq_terms(df2$text)

plot(freq)
donFreq <- df2[df2$candidate == "Trump", ]

donFreq <- freq_terms(donFreq$text)
hillFreq <- df2[df2$candidate == "Clinton", ]
hillFreq <- freq_terms(hillFreq$text)
# par(mfrow=c(1,2))
plot(donFreq)
plot(hillFreq)
No surprise that Trump hits “trade”, “violence”, “immigration” and “law”. Hillary likes to talk about “us” and “me”
(real shock there). Nothing about children or families?
You can create a word frequency matrix, which provides the counts for each word by speaker
wordMat <- wfm(df2$text, df2$candidate)

wordMat[c(1:5, 350:354), ]
## Clinton Trump
## abandon 0 1
## abandoned 1 1
## able 2 2
## abolish 0 1
## abroad 1 2
## crosser 0 1
## crossings 0 1
## crucial 1 0
## crushed 0 1
## crying 0 1
Of course we need to include the obligatory word cloud. In this case, I will use stemmed words
trans_cloud(df2$text, df2$candidate, stem = T, min.freq = 10)
There you have it, children and families now appear. Quite a heavy burden being engaged in what former
Assistant Director of the FBI, James Kallstrom, characterized as a criminal foundation AND caring for families
and children. Now that is leadership!
But I digress. A great function is ‘word_associate()’ and building word clouds based on that association. Let’s
give “terror” a try.
word_associate(df2$text, df2$candidate, match.string = "terror", wordcloud = T)
## row group unit text
## 1 6 Trump 6 attacks our police terrorism our cities threaten our very life.
## 2 82 Trump 82 plan begin safety home means safe neighborhoods secure borders protect
ion terrorism.
## 3 114 Trump 114 task our new administration liberate our citizens crime terrorism lawl
essness threatens communities.
## 4 133 Trump 133 once again france victim brutal islamic terrorism.
## 5 139 Trump 139 only weeks ago orlando florida forty nine wonderful americans savagely
murdered islamic terrorist.
## 6 140 Trump 140 terrorist targeted our lgbt community.
## 7 142 Trump 142 protect us terrorism need focus three things.
## 8 145 Trump 145 instead must work our allies share our goal destroying isis stamping i
slamic terror.
## 9 147 Trump 147 lastly must immediately suspend immigration any nation compromised ter
rorism until such proven vetting mechanisms put place.
## 10 328 Clinton 328 work americans our allies fight terrorism.
## 11 596 Clinton 596 should working responsible gun owners pass common sense reforms keep g
uns hands criminals terrorists others us harm.
##
## Match Terms
## ===========
##
## List 1:
## terrorism, terrorist, terror, terrorists
##
No commentary needed as “res ipsa loquitur”.
Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses
some of its visual appeal with just two speakers, but it should stimulate your interest nontheless. A complete
explanation of the stats is available under ?word_stats
ws <- word_stats(df2$text, df2$candidate, rm.incomplete = T)
## Warning in end_inc(dataframe = DF, text.var = text.var, ...): 17 incomplete sentence items

removed
plot(ws, label = T, lab.digits = 2)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Interesting the breakdown in the count of sentences and words. Hillary used a hundred more sentences, but
only two hundred more words. I’m curious as to what questions they asked and how they incorporated them.
x1 <- question_type(df2$text, grouping.var = df2$candidate)

x1
## candidate tot.quest where does huh unknown

## 1 Clinton 18 0 1(5.56%) 2(11.11%) 15(83.33%)
## 2 Trump 7 3(42.86%) 1(14.29%) 0 3(42.86%)
truncdf(x1$raw)
## candidate raw.text n.row endmark strip.text q.type

## 1 Trump Our econom 38 ? our econo unknown
## 2 Trump Yet show? 46 ? yet show unknown
## 3 Trump After four 64 ? after fou unknown
## 4 Trump Every acti 131 ? every act does
## 5 Trump Where sanc 161 ? where san where
## 8 Clinton Stay true 313 ? stay true unknown
## 9 Clinton Really? 349 ? really unknown
## 10 Clinton Alone fix? 350 ? alone fix unknown
## 11 Clinton Forgetting 351 ? forgettin unknown
## 12 Clinton Know commu 365 ? know comm unknown
## 13 Clinton Lot looked 369 ? lot looke unknown
## 14 Clinton Big idea? 425 ? big idea unknown
## 15 Clinton Idea real? 427 ? idea real unknown
## 16 Clinton Know? 472 ? know unknown
## 17 Clinton ? 473 ? huh
## 18 Clinton ? 474 ? huh
## 19 Clinton Going done 534 ? going don unknown
## 20 Clinton Going brea 535 ? going bre unknown
## 21 Clinton Sales pitc 544 ? sales pit unknown
## 22 Clinton Put faith 545 ? put faith unknown
## 23 Clinton Ask yourse 579 ? ask yours does
## 24 Clinton Ask just s 598 ? ask just unknown
## 25 Clinton Offering? 625 ? offering unknown
OK, we’ve learned that rows 473 and 474 should be thrown out. Also looks like we have the classic use of an
anaphora by Trump, which is the technique of repeating the first word or words of several consecutive
sentences. I think Churchill used it quite a bit e.g. “We shall not flag or fail. We shall go on to the end. We shall
fight in France, we shall…”"
df2[c(161:163), 3]
## [1] "where sanctuary kate steinle?"

## [2] "where sanctuary children mary ann sabine jamiel?"
## [3] "where sanctuary americans brutally murdered suffered horribly?"
df2[c(473:474), 3]
## [1] "?" "?"
df2 <- df2[c(-473,-474), ]
4. Advanced Analysis
This is where it gets fun with ‘qdap’. You can tag the text by parts of speech. Check out ?pos and have a look
at the vignette for further explanation https://trinker.github.io/qdap/vignettes/qdap_vignette.html
(https://trinker.github.io/qdap/vignettes/qdap_vignette.html)
Be advised that this takes some time, which you can track with a progress bar. Notice Clinton’s use and
Trump’s lack of use of interjections.
posbydf <- pos_by(df2$text, grouping.var = df2$candidate)

names(posbydf)
## [1] "text" "POStagged" "POSprop" "POSfreq"

## [5] "POSrnp" "percent" "zero.replace" "pos.by.freq"
## [9] "pos.by.prop" "pos.by.rnp"
plot(posbydf, values = T, digits = 2)
Readability scores (measures of speech complexity) are available. I won’t go into the details as I discuss this in
my book and detailed information is in the ‘qdap’ vignette.
automated_readability_index(df2$text, df2$candidate)
## candidate word.count sentence.count character.count Automated_Readability_Index

## 1 Clinton 2636 391 15155 9.020
## 2 Trump 2349 267 14616 12.276
Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The
results indicate similar use of vocabulary, certainly not unusual given the assistance of professional speech
writers.
diversity(df2$text, df2$candidate)
## candidate wc simpson shannon collision berger_parker brillouin

## 1 Clinton 2636 0.997 6.609 5.842 0.028 6.060
## 2 Trump 2349 0.997 6.613 5.708 0.040 6.032
Formality contextualizes the text by comparing formal parts of speech (noun, adjective, preposition and article)
versus contextual parts of speech (pronoun, verb, adverb, interjection). A plot for analysis is available. Scores
closer to 100 are more formal and those closer to 1 are more contextual.
form <- formality(df2$text, df2$candidate)

form
## candidate word.count formality

## 1 Trump 2363 66.55
## 2 Clinton 2651 60.68
plot(form)
Polarity measures sentence sentiment. A plot is available. What we see is that, on average, Trump was slightly
more negative.
pol <- polarity(df2$text, df2$candidate)

plot(pol)
## Warning: `show_guide` has been deprecated. Please use `show.legend`

## instead.
## Warning: `show_guide` has been deprecated. Please use `show.legend`

## instead.
The lexical dispersion plot allows one to see how a word occurs throughout the text. It is interesting to view to
see how topics change over time. Note that you can also include freq_terms should you so choose.
dispersion_plot(df2$text, c("immigration", "jobs", "trade", "children"), df2$candidate)
Finally, an example of a gradient wordcloud, which produces one wordcloud colored by a binary grouping
variable. Let’s do one with words not stemmed and one with stemming included.
gradient_cloud(df2$text, df2$candidate, min.freq = 12, stem = F)
gradient_cloud(df2$text, df2$candidate, min.freq = 15, stem = T)
There you have it. Now go find text data, manipulate text data, analyze text data and make text-mining great
again.

RPubs - Text-Mining With Rvest and Qdap

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

RPubs - Text-Mining With Rvest and Qdap

Caricato da

Copyright:

Formati disponibili

9/18/2017 Make Text-Mining Great Again

Make Text-Mining Great Again

1. Gather the Text

donHTML <- read_html("http://www.politico.com/story/2016/07/full-transcript-donald-trump-nomi

hillHTML <- read_html("http://www.politico.com/story/2016/07/full-text-hillary-clintons-dnc-s

SelectorGadget facilitates selecting the right html nodes

donNode <- html_nodes(donHTML, "style~ p")

2. Prepare the Text

donText <- html_text(donNode)

donText <- iconv(donText, "latin1", "ASCII", "")

donText <- paste(donText, collapse = c(" ", "\n"))

donPrep <- qprep(donText)

donPrep <- replace_contraction(donPrep)

donRm <- rm_stopwords(donPrep, Top100Words, separate = F)

donStrip <- strip(donRm, char.keep = c("?", "."))

df2 <- sentSplit(df1, "text")

## Warning in sentSplit(df1, "text"): The following problems were detected:

## Classes 'sent_split', 'qdap_df', 'sent_split_text_var:text' and 'data.frame': 660 obs.

freq <- freq_terms(df2$text)

donFreq <- df2[df2$candidate == "Trump", ]

wordMat <- wfm(df2$text, df2$candidate)

trans_cloud(df2$text, df2$candidate, stem = T, min.freq = 10)

word_associate(df2$text, df2$candidate, match.string = "terror", wordcloud = T)

## row group unit text

## 7 142 Trump 142 protect us terrorism need focus three things.

No commentary needed as “res ipsa loquitur”.

ws <- word_stats(df2$text, df2$candidate, rm.incomplete = T)

## Warning in end_inc(dataframe = DF, text.var = text.var, ...): 17 incomplete sentence items

plot(ws, label = T, lab.digits = 2)

x1 <- question_type(df2$text, grouping.var = df2$candidate)

## candidate tot.quest where does huh unknown

## candidate raw.text n.row endmark strip.text q.type

## [1] "where sanctuary kate steinle?"

## [1] "?" "?"

df2 <- df2[c(-473,-474), ]

posbydf <- pos_by(df2$text, grouping.var = df2$candidate)

## [1] "text" "POStagged" "POSprop" "POSfreq"

plot(posbydf, values = T, digits = 2)

## candidate word.count sentence.count character.count Automated_Readability_Index

## candidate wc simpson shannon collision berger_parker brillouin

form <- formality(df2$text, df2$candidate)

## candidate word.count formality

pol <- polarity(df2$text, df2$candidate)

## Warning: `show_guide` has been deprecated. Please use `show.legend`

## Warning: `show_guide` has been deprecated. Please use `show.legend`

dispersion_plot(df2$text, c("immigration", "jobs", "trade", "children"), df2$candidate)

gradient_cloud(df2$text, df2$candidate, min.freq = 12, stem = F)

gradient_cloud(df2$text, df2$candidate, min.freq = 15, stem = T)

Potrebbero piacerti anche