Sei sulla pagina 1di 1

RPubs - Text Mining With R and the "tm&qu... https://rpubs.

com/sgeletta/95577

RPubs brought to you by RStudio


Text Mining With R and the "tm" Package by Simon
Sign in
Last updated almost 2 years ago
Register

Data Science Capstone Comments () Share Hide Toolbars

Simon Geletta
Saturday, July 25, 2015

Milestone Report
Introduction and Objectives
The main goal of this report is to demonstrate the level of competency achieved in working with
unstructured data in order to produce a structured set of records which can then be used for the
purposes of statistical modeling. The first step in any such task is to really know (as much as
possible), what is included in the raw data (or document corpus) and to separate out the useful from
the not-so-useful information. I would like to note that because the running of the codes while
preparing the document for publication on RPub.com was taking unreasonably long period of time, I
am forced to present this report based on a 10% sample of the entire data that was provided. The
idea is to provide this as an evidence of what I will do with the entire data at the end of the capstone
project.

Methods
The first task is to download the raw resources that would be used for the analytics tasks - The main
being the three data sources en_US.blogs.txt, en_US.news.txt, and en_US.tweets.txt. In addition, the
list of bad/profane words were also obtained (later to be used to exclude from the analysis). The raw
data were extracted from the given site: http://d396qusza40orc.cloudfront.net/dsscapstone/dataset
/Coursera-SwiftKey.zip (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-
SwiftKey.zip) in a compressed format and locally uncompressed. The bad/profane words were
downloaded from https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-
and-Otherwise-Bad-Words/master/en (https://raw.githubusercontent.com/shutterstock/List-of-Dirty-
Naughty-Obscene-and-Otherwise-Bad-Words/master/en). These were also locally stored as
en_bws.txt. The following chunc of code shows how the files acquisition went.

dtsrc <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKe


y.zip"
if (!file.exists("coursera-swiftkey.zip")){
download.file(dtsrc, destfile="coursera-swiftkey.zip")
unzip("coursera-swiftkey.zip")
}
## list of bad/profane words download from github
bwsrc1<-"https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscen

1 of 1 06/04/2017 09:49 PM

Potrebbero piacerti anche