R Data Mining Introduction Slides

An Introduction to Data Mining with R
Yanchang Zhao
http://www.RDataMining.com
6 September 2013
1 / 43
Questions
Do you know data mining and techniques for it?
2 / 43
Questions
Do you know data mining and techniques for it? Have you used R before?
2 / 43
Questions
Do you know data mining and techniques for it? Have you used R before? Have you used R in your data mining research or projects?
2 / 43
Outline
Introduction Classication with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Hadoop Online Resources
3 / 43
What is R?
R 1 is a free software environment for statistical computing and graphics. R can be easily extended with 4,728 packages available on CRAN2 (as of Sept 6, 2013). Many other packages provided on Bioconductor3 , R-Forge4 , GitHub5 , etc. R manuals on CRAN6
An Introduction to R The R Language Denition R Data Import/Export ...
1 2
http://www.r-project.org/ http://cran.r-project.org/ 3 http://www.bioconductor.org/ 4 http://r-forge.r-project.org/ 5 https://github.com/ 6 http://cran.r-project.org/manuals.html

4 / 43
Why R?
R is widely used in both academia and industry. R is ranked no. 1 again in the KDnuggets 2013 poll on Top Languages for analytics, data mining, data science 7 . The CRAN Task Views 8 provide collections of packages for dierent tasks.
Machine learning & atatistical learning Cluster analysis & nite mixture models Time series analysis Multivariate statistics Analysis of spatial data ...
7 8
http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html http://cran.r-project.org/web/views/ 5 / 43
Outline
6 / 43
Classication with R
Decision trees: rpart, party Random forest: randomForest, party SVM: e1071, kernlab Neural networks: nnet, neuralnet, RSNNS Performance evaluation: ROCR
7 / 43
The Iris Dataset

# iris data str(iris)
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 . ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: # split into training and test datasets set.seed(1234) ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3)) iris.train <- iris[ind==1, ] iris.test <- iris[ind==2, ]
8 / 43
Build a Decision Tree
# build a decision tree library(party) iris.formula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris.ctree <- ctree(iris.formula, data=iris.train)
9 / 43
plot(iris.ctree)
1 Petal.Length p < 0.001 1.9 > 1.9 3 Petal.Width p < 0.001 1.7 4 Petal.Length p = 0.026 4.4 Node 2 (n = 40) 1 0.8 0.6 0.4 0.2 0 setosa Node 5 (n = 21) 1 0.8 0.6 0.4 0.2 0 setosa > 4.4 Node 6 (n = 19) 1 0.8 0.6 0.4 0.2 0 setosa Node 7 (n = 32) 1 0.8 0.6 0.4 0.2 0 setosa > 1.7
10 / 43
Prediction
# predict on test data pred <- predict(iris.ctree, newdata = iris.test) # check prediction result table(pred, iris.test$Species) ## ## pred setosa versicolor virginica ## setosa 10 0 0 ## versicolor 0 12 2 ## virginica 0 0 14
11 / 43
Outline
12 / 43
Clustering with R
k -means: kmeans(), kmeansruns()9 k -medoids: pam(), pamk() Hierarchical clustering: hclust(), agnes(), diana() DBSCAN: fpc BIRCH: birch
Functions are followed with (), and others are packages.

13 / 43
k -means Clustering
set.seed(8953) iris2 <- iris # remove class IDs iris2$Species <- NULL # k-means clustering iris.kmeans <- kmeans(iris2, 3) # check result table(iris$Species, iris.kmeans$cluster) ## ## ## ## ##
1 2 3 setosa 0 50 0 versicolor 2 0 48 virginica 36 0 14
14 / 43
# plot clusters and their centers plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster) points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")], col=1:3, pch="*", cex=5)
3.5
4.0
Sepal.Width
* *
4.5 5.0 5.5 6.0 6.5
*
7.0 7.5 8.0
2.0
2.5
3.0
15 / 43
Density-based Clustering
library(fpc) iris2 <- iris[-5] # remove class IDs # DBSCAN clustering ds <- dbscan(iris2, eps = 0.42, MinPts = 5) # compare clusters with original class IDs table(ds$cluster, iris$Species) ## ## ## ## ## ##
0 1 2 3
setosa versicolor virginica 2 10 17 48 0 0 0 37 0 0 3 33
16 / 43
# 1-3: clusters; 0: outliers or noise plotcluster(iris2, ds$cluster)
0 3 33 0 3 3 03 3 1 1 2 1 3 3 0 0 2 2 2 0 2 2 2 2 2 3 3 3 3 3 0 33 3 3 3 3 3 3 3 3
1 1 dc 2 1 1
0 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11
3 30 3 3 0 3 2 3 22 2 22 2 0 20 3 2 202 2 2 3 2 2 2 22 02 0 22 30 0 3 2 20 2 0 0 0 0 2 2 2 0 0 0 0 2 0 0
4 dc 1
17 / 43
Outline
18 / 43
Association Rule Mining with R
Association rules: apriori(), eclat() in package arules Sequential patterns: arulesSequence Visualisation of associations: arulesViz
19 / 43
The Titanic Dataset

load("./data/titanic.raw.rdata") dim(titanic.raw) ## [1] 2201 4
idx <- sample(1:nrow(titanic.raw), 8) titanic.raw[idx, ] ## ## ## ## ## ## ## ## ## Class Sex Age Survived 3rd Male Adult No 3rd Male Adult No 3rd Male Adult No Crew Male Adult No 3rd Female Adult No 2nd Female Adult No 3rd Male Adult No 3rd Male Adult No
20 / 43
501 477 674 766 1485 1388 448 590
Association Rule Mining
# find association rules with the APRIORI algorithm library(arules) rules <- apriori(titanic.raw, control=list(verbose=F), parameter=list(minlen=2, supp=0.005, conf=0.8), appearance=list(rhs=c("Survived=No", "Survived=Yes"), default="lhs")) # sort rules quality(rules) <- round(quality(rules), digits=3) rules.sorted <- sort(rules, by="lift") # have a look at rules # inspect(rules.sorted)
21 / 43
# # # # # # # # # # # # # # # # # # # # # # # #
lhs {Class=2nd, Age=Child} 2 {Class=2nd, Sex=Female, Age=Child} 3 {Class=1st, Sex=Female} 4 {Class=1st, Sex=Female, Age=Adult} 5 {Class=2nd, Sex=Male, Age=Adult} 6 {Class=2nd, Sex=Female} 7 {Class=Crew, Sex=Female} 8 {Class=Crew, Sex=Female, Age=Adult} 9 {Class=2nd, Sex=Male} 10 {Class=2nd,
rhs => {Survived=Yes}
support confidence 0.011
lift
1.000 3.096
=> {Survived=Yes} => {Survived=Yes}
0.006 0.064
1.000 3.096 0.972 3.010
=> {Survived=Yes}
0.064
0.972 3.010
=> {Survived=No} => {Survived=Yes} => {Survived=Yes}
0.070 0.042 0.009
0.917 1.354 0.877 2.716 0.870 2.692
=> {Survived=Yes} => {Survived=No}
0.009 0.070
0.870 2.692 0.860 1.271

22 / 43
library(arulesViz) plot(rules, method = "graph")

Graph for 12 rules
width: support (0.006 0.192) color: lift (1.222 3.096)
{Class=2nd,Age=Child} {Class=2nd,Sex=Female} {Class=1st,Sex=Female,Age=Adult}
{Class=Crew,Sex=Female} {Survived=Yes} {Class=2nd,Sex=Female,Age=Child}
{Class=Crew,Sex=Female,Age=Adult} {Class=1st,Sex=Female} {Class=2nd,Sex=Female,Age=Adult} {Class=2nd,Sex=Male}
{Class=3rd,Sex=Male} {Survived=No}
{Class=3rd,Sex=Male,Age=Adult} {Class=2nd,Sex=Male,Age=Adult}
23 / 43
Outline
24 / 43
Text Mining with R
Text mining: tm Topic modelling: topicmodels, lda Word cloud: wordcloud Twitter data access: twitteR
25 / 43
Fetch Twitter Data

## retrieve tweets from the user timeline of @rdatammining library(twitteR) # tweets <- userTimeline('rdatamining') load(file = "./data/rdmTweets.RData") (nDocs <- length(tweets)) ## [1] 320 strwrap(tweets[[320]]$text, width = 50) ## ## ## ## [1] [2] [3] [4] "An R Reference Card for Data Mining is now" "available on CRAN. It lists many useful R" "functions and packages for data mining" "applications."
# convert tweets to a data frame df <- do.call("rbind", lapply(tweets, as.data.frame))

26 / 43
Text Cleaning
library(tm) # build a corpus myCorpus <- Corpus(VectorSource(df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, tolower) # remove punctuation & numbers myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) # remove URLs removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) # remove 'r' and 'big' from stopwords myStopwords <- setdiff(stopwords("english"), c("r", "big")) # remove stopwords myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
27 / 43
Stemming
# keep a copy of corpus myCorpusCopy <- myCorpus # stem words myCorpus <- tm_map(myCorpus, stemDocument) # stem completion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # replace "miners" with "mining", because "mining" was # first stemmed to "mine" and then completed to "miners" myCorpus <- tm_map(myCorpus, gsub, pattern="miners", replacement="mining") strwrap(myCorpus[320], width=50) ## [1] "r reference card data mining now available cran" ## [2] "list used r functions package data mining" ## [3] "applications"
28 / 43
Frequent Terms
myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(1,Inf))) # inspect frequent words (freq.terms <- findFreqTerms(myTdm, lowfreq=20)) ## [1] "analysis" ## [4] "data" ## [7] "network" ## [10] "postdoctoral" ## [13] "slides" ## [16] "university" "big" "examples" "package" "r" "social" "used" "computing" "mining" "position" "research" "tutorial"
29 / 43
Associations
# which words are associated with 'r'? findAssocs(myTdm, "r", 0.2) ## examples ## 0.32 code 0.29 package 0.20
# which words are associated with 'mining'? findAssocs(myTdm, "mining", 0.25) ## ## ## ## data 0.47 supports 0.30 mahout recommendation 0.30 0.30 frequent itemset 0.26 0.26 sets 0.30
30 / 43
Network of Terms
library(graph) library(Rgraphviz) plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)
university
tutorial
social
research
mining
used
network
postdoctoral
data
analysis
position
big
package
examples
computing
slides
31 / 43
Word Cloud
library(wordcloud) m <- as.matrix(myTdm) freq <- sort(rowSums(m), decreasing=T) wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)
provided melbourne
nd
google
short software
australia social university datasets

california
vacancy
statistics rdatamining
knowledge list graph
job
china
large snowfall casesee available poll draft performance applications group now reference course code can via visualizing series tenuretrack industrial center due introduction association clustering access information page distributed sentiment videos techniques tried youtube top presentation science classification southern wwwrdataminingcom canberra added experience management
published
slides used big
analytics join
database
open graphics thanks conference users processing cfp text
talk
mining network
fast call
cloud
workshop find
exampleschapter
analysis outlier map

postdoctoral
high sydney topic
predictive
studies
research package notes card
free online
analyst
frequent week tools
get
technology
document
tutorial
position
data
time learn details lecture
forecasting functions follower submission business events kdnuggetsinteractive detection programmingcanada spatial search machine pdf ausdm modelling twitter starting fellow web scientist computing parallel ibm amp rules dmapps handling
book
using recent linkedin
32 / 43
Topic Modelling
library(topicmodels) set.seed(123) myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8) terms(myLda, 5) ## ## ## ## ## ## ## ## ## ## ## ## Topic 1 Topic 2 Topic 3 Topic 4 "data" "r" "r" "research" "mining" "package" "time" "position" "big" "examples" "series" "data" "association" "used" "users" "university" "rules" "code" "talk" "postdoctoral" Topic 5 Topic 6 Topic 7 Topic 8 "mining" "group" "data" "analysis" "data" "data" "r" "network" "slides" "used" "mining" "social" "modelling" "software" "analysis" "text" "tools" "kdnuggets" "book" "slides"
[1,] [2,] [3,] [4,] [5,] [1,] [2,] [3,] [4,] [5,]
33 / 43
Outline
34 / 43
Time Series Analysis with R
Time series decomposition: decomp(), decompose(), arima(), stl() Time series forecasting: forecast Time Series Clustering: TSclust Dynamic Time Warping (DTW): dtw
35 / 43
Outline
36 / 43
Social Network Analysis with R
Packages: igraph, sna Centrality measures: degree(), betweenness(), closeness(), transitivity() Clusters: clusters(), no.clusters() Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number() Community detection: fastgreedy.community(), spinglass.community()
37 / 43
Outline
38 / 43
R and Hadoop
Packages: RHadoop, RHive RHadoop10 is a collection of 3 R packages:
rmr2 - perform data analysis with R via MapReduce on a Hadoop cluster rhdfs - connect to Hadoop Distributed File System (HDFS) rhbase - connect to the NoSQL HBase database
You can play with it on a single PC (in standalone or pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)! Step by step to set up my rst R Hadoop system http://www.rdatamining.com/tutorials/rhadoop
10
https://github.com/RevolutionAnalytics/RHadoop/wiki
39 / 43
An Example of MapReducing with R

library(rmr2) map <- function(k, lines) { words.list <- strsplit(lines, "\\s") words <- unlist(words.list) return(keyval(words, 1)) } reduce <- function(word, counts) { keyval(word, sum(counts)) } wordcount <- function(input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce) } ## Submit job out <- wordcount(in.file.path, out.file.path)
11
11
From Jerey Breens presentation on Using R with Hadoop

40 / 43
http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/
Outline
41 / 43
Online Resources
RDataMining website
http://www.rdatamining.com
R Reference Card for Data Mining R and Data Mining: Examples and Case Studies
RDataMining Group on LinkedIn (3100+ members)

http://group.rdatamining.com
RDataMining on Twitter (1200+ followers)

http://twitter.com/rdatamining
Free online courses

http://www.rdatamining.com/resources/courses
Online documents
http://www.rdatamining.com/resources/onlinedocs
42 / 43
The End
Thanks! Email: yanchang(at)rdatamining.com

43 / 43

R Data Mining Introduction Slides

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

R Data Mining Introduction Slides

Caricato da

Copyright:

Formati disponibili

An Introduction to Data Mining with R

Do you know data mining and techniques for it?

http://www.r-project.org/ http://cran.r-project.org/ 3 http://www.bioconductor.org/ 4 http://r-forge.r-project.org/ 5 https://github.com/ 6 http://cran.r-project.org/manuals.html

The Iris Dataset

Build a Decision Tree

Functions are followed with (), and others are packages.

1 2 3 setosa 0 50 0 versicolor 2 0 48 virginica 36 0 14

setosa versicolor virginica 2 10 17 48 0 0 0 37 0 0 3 33

# 1-3: clusters; 0: outliers or noise plotcluster(iris2, ds$cluster)

Association Rule Mining with R

The Titanic Dataset

501 477 674 766 1485 1388 448 590

Association Rule Mining

rhs => {Survived=Yes}

support confidence 0.011

=> {Survived=Yes} => {Survived=Yes}

1.000 3.096 0.972 3.010

=> {Survived=No} => {Survived=Yes} => {Survived=Yes}

0.070 0.042 0.009

0.917 1.354 0.877 2.716 0.870 2.692

=> {Survived=Yes} => {Survived=No}

0.870 2.692 0.860 1.271

library(arulesViz) plot(rules, method = "graph")

width: support (0.006 0.192) color: lift (1.222 3.096)

{Class=2nd,Age=Child} {Class=2nd,Sex=Female} {Class=1st,Sex=Female,Age=Adult}

{Class=Crew,Sex=Female} {Survived=Yes} {Class=2nd,Sex=Female,Age=Child}

{Class=Crew,Sex=Female,Age=Adult} {Class=1st,Sex=Female} {Class=2nd,Sex=Female,Age=Adult} {Class=2nd,Sex=Male}

Text Mining with R

Fetch Twitter Data

# convert tweets to a data frame df <- do.call("rbind", lapply(tweets, as.data.frame))

australia social university datasets

slides used big

open graphics thanks conference users processing cfp text

analysis outlier map

high sydney topic

research package notes card

frequent week tools

time learn details lecture

using recent linkedin

Time Series Analysis with R

Social Network Analysis with R

An Example of MapReducing with R

From Jerey Breens presentation on Using R with Hadoop

RDataMining Group on LinkedIn (3100+ members)

RDataMining on Twitter (1200+ followers)

Free online courses

Thanks! Email: yanchang(at)rdatamining.com

Potrebbero piacerti anche