WSMA

Project Approach: [Index]
1. Import the dataset.csv.

2. Import library to be used.
3. Make Corpus and extract Description Column.
4. Clean the Data.
5. Wordcloud.
6. Define Document term Matrix.
7. Make CART Tree, VarImp Diagram and accuracy through Random Forest and Logistic
Regression.[BEFORE]
8. Add Ratio as variable.
9. Make CART Tree, VarImp Diagram and accuracy through Random Forest and Logistic
Regression after adding Ratio variable.[AFTER]
10. Conslusion.
1. Import Dataset.
> Sharkdata = read.csv("C:/Users/uidk3333/Desktop/Data Science/Dataset.csv").
2. Library Used.
> library(tm)
> library(SnowballC)
> library(worldcloud)
> library(rpart)
> library(rpart.plot)
> library(randomForest)
> library(NLP)
> library(RColorBrewer)
3. Make Corpus and extract Description Column.
> corpus = Corpus(VectorSource(Sharkdata$description))
4. Data Cleaning
> # cleaning up of data to remove punctuations like full stop, questionmarks,

explanation marks
> corpus = tm_map(corpus, removePunctuation)
Warning message:
In tm_map.SimpleCorpus(corpus, removePunctuation) :
transformation drops documents
> # take out stopwords taking english as dictrionalry for all

> corpus = tm_map(corpus, removeWords, c(stopwords("english"), "the","and","c
an","of"))
Warning message:
In tm_map.SimpleCorpus(corpus, removeWords, c(stopwords("english"), :
> # take out extra whitespace if any, mostly not present when analysed usuall
y.
> corpus = tm_map(corpus, stripWhitespace)
Warning message:
In tm_map.SimpleCorpus(corpus, stripWhitespace) :
> # as the description section seem to be very english on view and correct wi
th grammers, stemming could help.
> corpus = tm_map(corpus, stemDocument)
Warning message:
In tm_map.SimpleCorpus(corpus, stemDocument) :
5. WordCloud After cleaning
6. Document Term Matrix
> #To work on these words or corpus seen in worldcloud, DTM has to be made to
assign 1's and 0's to them to do maths
> frequencies = DocumentTermMatrix(corpus)

> frequencies
<<DocumentTermMatrix (documents: 495, terms: 3559)>>

Non-/sparse entries: 9734/1751971
Sparsity : 99%
Maximal term length: 21
Weighting : term frequency (tf)
After seeing Sparsity 99%

> sparseData = removeSparseTerms(frequencies, 0.995)
Make Description data with the help of sparsed Data
> description = as.data.frame(as.matrix(sparseData))
Give Standard Columnn name
> colnames(description) = make.names(colnames(description))
Add deal as dependent variable

> # as per problem statement we need to add a dependent variable "deal"
> description$deal = Sharkdata$deal
7. CART, Random Forest, Logistic regression on Data with dependent variable as Deal.
[BEFORE RATIO]
CART:
> library(rpart)
> library(rpart.plot)
> SharkdataCART = rpart(deal ~ ., data=description, method="class")
CART Diagram:
Prediction:
> #prediction using this model
> predictCART = predict(SharkdataCART, data=description, type="class")
> CART <- table(description$deal, predictCART)
Accuracy:
> AccuracyCart = sum(diag(CART))/sum(CART)
> AccuracyCart
[1] 0.6808081
Logistic Regression:
> set.seed(123)
> SharkDataLogistic = glm(deal~., data = description)
Prediction:
> # Predictions on this model

> predictLogistic = predict(SharkDataLogistic, data=description)
Performance Analysis:
> # performance analysis
> LogisticRegression <- table(description$deal, predictLogistic> 0.5)
Accuracy:
> #Accuracy
> AccuracyLogisticRegression = sum(diag(LogisticRegression))/sum(LogisticRegr
ession)
> AccuracyLogisticRegression
[1] 0.6747475
Random Forest:
> library(randomForest)
> set.seed(123)
> SharkDataRF = randomForest(deal ~ ., data=description)
Prediction:
> # Prediction susing this model.

> predictRF = predict(SharkDataRF, data=description)
> # performance analysis

> RandomForest <- table(description$deal, predictRF>= 0.5)
Accuracy:
> # Accuracy
> AccuracyRF = sum(diag(RandomForest))/sum(RandomForest)
> AccuracyRF
[1] 0.5575758
Variable Importance Plot:
> #variable importance plot

> varImpPlot(SharkDataRF,main='Variable Importance Graph',type=2)
8. Add Ratio as one more variable

> description$ratio = Sharkdata$askedFor/Sharkdata$valuation
9. CART, Random Forest, Logistic regression on Data with dependent variable as Deal and
after adding one more variable “ratio” [AFTER RATIO]
 CART
SharktDataCartRatio = rpart(deal ~ ., data=description, method="class")
CART Diagram:
prp(SharktDataCartRatio, extra=2)
Performance:
> # Evaluate the performance of the CART model

> predictCARTRatio = predict(SharktDataCartRatio, data=description, type="class")
> CARTAfterRatio <- table(description$deal, predictCARTRatio)
Accuracy:
> AccuracyAfterRatioCart = sum(diag(CARTAfterRatio))/sum(CARTAfterRatio)
> AccuracyAfterRatioCart
[1] 0.6809080
 Logistic Regression
> SharkDatalogisticRatio = glm(deal~., data = description)
Prediction:
> predictLogisticRatio = predict(SharkDatalogisticRatio, data=description)
> LogisticRegressionAfterRatio <- table(description$deal, predictLogisticRati

o>= 0.5)
Accuracy:
> AccuracyLogisticRegAfterRatio = sum(diag(LogisticRegressionAfterRatio))/sum
(LogisticRegressionAfterRatio)
> AccuracyLogisticRegAfterRatio
[1] 0.7151515
Random Forest:
> SharkDataRFRatio = randomForest(deal ~ ., data=description)
Prediction:
> predictRFRatio = predict(SharkDataRFRatio, data=description)
> RandomForestRatio <- table(description$deal, predictRFRatio>= 0.5)
Accuracy:
> AccuracyRFAfterRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

> AccuracyRFAfterRatio
[1] 0.5676768
VarImpPlot:
> varImpPlot(SharkDataRFRatio,main='Variable Importance Plot After Ratio bein

g added',type=2)
10. Conslusion
MODEL: With Description only With Description and

[Prediction Accuracy Ratio [Prediction Accura
BEFORE] cy AFTER]
CART 68.08 % 68.09 %
Logistic Regression 67.47 % 71.51 %
Random Forest 55.75 % 56.76 %
All the three models improved slightly in Accuracy percentage for predictions
,CART being lowest improved in my model.
Seems like Logistic regression is highest accurate among all. Maybe addition
of more variable can boost up accuracy more and reac close to 80 percent
which can result into being best model.

WSMA

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

WSMA

Caricato da

Copyright:

Formati disponibili

Project Approach: [Index]

1. Import the dataset.csv.

> Sharkdata = read.csv("C:/Users/uidk3333/Desktop/Data Science/Dataset.csv").

3. Make Corpus and extract Description Column.

> corpus = Corpus(VectorSource(Sharkdata$description))

> # cleaning up of data to remove punctuations like full stop, questionmarks,

> # take out stopwords taking english as dictrionalry for all

6. Document Term Matrix

> frequencies = DocumentTermMatrix(corpus)

<<DocumentTermMatrix (documents: 495, terms: 3559)>>

After seeing Sparsity 99%

Add deal as dependent variable

> description$deal = Sharkdata$deal

> # Predictions on this model

> # Prediction susing this model.

> # performance analysis

> #variable importance plot

8. Add Ratio as one more variable

SharktDataCartRatio = rpart(deal ~ ., data=description, method="class")

> # Evaluate the performance of the CART model

> SharkDatalogisticRatio = glm(deal~., data = description)

> LogisticRegressionAfterRatio <- table(description$deal, predictLogisticRati

> SharkDataRFRatio = randomForest(deal ~ ., data=description)

> predictRFRatio = predict(SharkDataRFRatio, data=description)

> AccuracyRFAfterRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)

> varImpPlot(SharkDataRFRatio,main='Variable Importance Plot After Ratio bein

MODEL: With Description only With Description and

Potrebbero piacerti anche