Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2. Library Used.
> library(tm)
> library(SnowballC)
> library(worldcloud)
> library(rpart)
> library(rpart.plot)
> library(randomForest)
> library(NLP)
> library(RColorBrewer)
4. Data Cleaning
> # take out extra whitespace if any, mostly not present when analysed usuall
y.
> corpus = tm_map(corpus, stripWhitespace)
Warning message:
In tm_map.SimpleCorpus(corpus, stripWhitespace) :
transformation drops documents
> # as the description section seem to be very english on view and correct wi
th grammers, stemming could help.
> corpus = tm_map(corpus, stemDocument)
Warning message:
In tm_map.SimpleCorpus(corpus, stemDocument) :
transformation drops documents
5. WordCloud After cleaning
> #To work on these words or corpus seen in worldcloud, DTM has to be made to
assign 1's and 0's to them to do maths
7. CART, Random Forest, Logistic regression on Data with dependent variable as Deal.
[BEFORE RATIO]
CART:
> library(rpart)
> library(rpart.plot)
> SharkdataCART = rpart(deal ~ ., data=description, method="class")
CART Diagram:
Prediction:
> #prediction using this model
> predictCART = predict(SharkdataCART, data=description, type="class")
> CART <- table(description$deal, predictCART)
Accuracy:
> AccuracyCart = sum(diag(CART))/sum(CART)
> AccuracyCart
[1] 0.6808081
Logistic Regression:
> set.seed(123)
> SharkDataLogistic = glm(deal~., data = description)
Prediction:
Performance Analysis:
> # performance analysis
> LogisticRegression <- table(description$deal, predictLogistic> 0.5)
Accuracy:
> #Accuracy
> AccuracyLogisticRegression = sum(diag(LogisticRegression))/sum(LogisticRegr
ession)
> AccuracyLogisticRegression
[1] 0.6747475
Random Forest:
> library(randomForest)
> set.seed(123)
> SharkDataRF = randomForest(deal ~ ., data=description)
Prediction:
Performance Analysis:
Accuracy:
> # Accuracy
> AccuracyRF = sum(diag(RandomForest))/sum(RandomForest)
> AccuracyRF
[1] 0.5575758
Variable Importance Plot:
9. CART, Random Forest, Logistic regression on Data with dependent variable as Deal and
after adding one more variable “ratio” [AFTER RATIO]
CART
CART Diagram:
prp(SharktDataCartRatio, extra=2)
Performance:
Accuracy:
> AccuracyAfterRatioCart = sum(diag(CARTAfterRatio))/sum(CARTAfterRatio)
> AccuracyAfterRatioCart
[1] 0.6809080
Logistic Regression
Prediction:
> predictLogisticRatio = predict(SharkDatalogisticRatio, data=description)
Performance Analysis:
Random Forest:
Prediction:
Performance Analysis:
> RandomForestRatio <- table(description$deal, predictRFRatio>= 0.5)
Accuracy:
VarImpPlot:
All the three models improved slightly in Accuracy percentage for predictions
,CART being lowest improved in my model.
Seems like Logistic regression is highest accurate among all. Maybe addition
of more variable can boost up accuracy more and reac close to 80 percent
which can result into being best model.