Sei sulla pagina 1di 11

Project Approach: [Index]

1. Import the dataset.csv.


2. Import library to be used.
3. Make Corpus and extract Description Column.
4. Clean the Data.
5. Wordcloud.
6. Define Document term Matrix.
7. Make CART Tree, VarImp Diagram and accuracy through Random Forest and Logistic
Regression.[BEFORE]
8. Add Ratio as variable.
9. Make CART Tree, VarImp Diagram and accuracy through Random Forest and Logistic
Regression after adding Ratio variable.[AFTER]
10. Conslusion.
1. Import Dataset.

> Sharkdata = read.csv("C:/Users/uidk3333/Desktop/Data Science/Dataset.csv").

2. Library Used.
> library(tm)
> library(SnowballC)
> library(worldcloud)
> library(rpart)
> library(rpart.plot)
> library(randomForest)
> library(NLP)
> library(RColorBrewer)

3. Make Corpus and extract Description Column.

> corpus = Corpus(VectorSource(Sharkdata$description))

4. Data Cleaning

> # cleaning up of data to remove punctuations like full stop, questionmarks,


explanation marks
> corpus = tm_map(corpus, removePunctuation)
Warning message:
In tm_map.SimpleCorpus(corpus, removePunctuation) :
transformation drops documents

> # take out stopwords taking english as dictrionalry for all


> corpus = tm_map(corpus, removeWords, c(stopwords("english"), "the","and","c
an","of"))
Warning message:
In tm_map.SimpleCorpus(corpus, removeWords, c(stopwords("english"), :
transformation drops documents

> # take out extra whitespace if any, mostly not present when analysed usuall
y.
> corpus = tm_map(corpus, stripWhitespace)
Warning message:
In tm_map.SimpleCorpus(corpus, stripWhitespace) :
transformation drops documents

> # as the description section seem to be very english on view and correct wi
th grammers, stemming could help.
> corpus = tm_map(corpus, stemDocument)
Warning message:
In tm_map.SimpleCorpus(corpus, stemDocument) :
transformation drops documents
5. WordCloud After cleaning

6. Document Term Matrix

> #To work on these words or corpus seen in worldcloud, DTM has to be made to
assign 1's and 0's to them to do maths

> frequencies = DocumentTermMatrix(corpus)


> frequencies

<<DocumentTermMatrix (documents: 495, terms: 3559)>>


Non-/sparse entries: 9734/1751971
Sparsity : 99%
Maximal term length: 21
Weighting : term frequency (tf)

After seeing Sparsity 99%


> sparseData = removeSparseTerms(frequencies, 0.995)
Make Description data with the help of sparsed Data
> description = as.data.frame(as.matrix(sparseData))
Give Standard Columnn name
> colnames(description) = make.names(colnames(description))

Add deal as dependent variable


> # as per problem statement we need to add a dependent variable "deal"

> description$deal = Sharkdata$deal

7. CART, Random Forest, Logistic regression on Data with dependent variable as Deal.
[BEFORE RATIO]

CART:
> library(rpart)
> library(rpart.plot)
> SharkdataCART = rpart(deal ~ ., data=description, method="class")

CART Diagram:

Prediction:
> #prediction using this model
> predictCART = predict(SharkdataCART, data=description, type="class")
> CART <- table(description$deal, predictCART)
Accuracy:
> AccuracyCart = sum(diag(CART))/sum(CART)
> AccuracyCart
[1] 0.6808081

Logistic Regression:

> set.seed(123)
> SharkDataLogistic = glm(deal~., data = description)

Prediction:

> # Predictions on this model


> predictLogistic = predict(SharkDataLogistic, data=description)

Performance Analysis:
> # performance analysis
> LogisticRegression <- table(description$deal, predictLogistic> 0.5)

Accuracy:

> #Accuracy
> AccuracyLogisticRegression = sum(diag(LogisticRegression))/sum(LogisticRegr
ession)
> AccuracyLogisticRegression
[1] 0.6747475

Random Forest:

> library(randomForest)
> set.seed(123)
> SharkDataRF = randomForest(deal ~ ., data=description)

Prediction:

> # Prediction susing this model.


> predictRF = predict(SharkDataRF, data=description)

Performance Analysis:

> # performance analysis


> RandomForest <- table(description$deal, predictRF>= 0.5)

Accuracy:
> # Accuracy
> AccuracyRF = sum(diag(RandomForest))/sum(RandomForest)
> AccuracyRF
[1] 0.5575758
Variable Importance Plot:

> #variable importance plot


> varImpPlot(SharkDataRF,main='Variable Importance Graph',type=2)

8. Add Ratio as one more variable


> description$ratio = Sharkdata$askedFor/Sharkdata$valuation

9. CART, Random Forest, Logistic regression on Data with dependent variable as Deal and
after adding one more variable “ratio” [AFTER RATIO]

 CART

SharktDataCartRatio = rpart(deal ~ ., data=description, method="class")

CART Diagram:

prp(SharktDataCartRatio, extra=2)
Performance:

> # Evaluate the performance of the CART model


> predictCARTRatio = predict(SharktDataCartRatio, data=description, type="class")
> CARTAfterRatio <- table(description$deal, predictCARTRatio)

Accuracy:
> AccuracyAfterRatioCart = sum(diag(CARTAfterRatio))/sum(CARTAfterRatio)
> AccuracyAfterRatioCart
[1] 0.6809080

 Logistic Regression

> SharkDatalogisticRatio = glm(deal~., data = description)

Prediction:
> predictLogisticRatio = predict(SharkDatalogisticRatio, data=description)

Performance Analysis:

> LogisticRegressionAfterRatio <- table(description$deal, predictLogisticRati


o>= 0.5)
Accuracy:
> AccuracyLogisticRegAfterRatio = sum(diag(LogisticRegressionAfterRatio))/sum
(LogisticRegressionAfterRatio)
> AccuracyLogisticRegAfterRatio
[1] 0.7151515

Random Forest:

> SharkDataRFRatio = randomForest(deal ~ ., data=description)

Prediction:

> predictRFRatio = predict(SharkDataRFRatio, data=description)

Performance Analysis:
> RandomForestRatio <- table(description$deal, predictRFRatio>= 0.5)

Accuracy:

> AccuracyRFAfterRatio = sum(diag(RandomForestRatio))/sum(RandomForestRatio)


> AccuracyRFAfterRatio
[1] 0.5676768

VarImpPlot:

> varImpPlot(SharkDataRFRatio,main='Variable Importance Plot After Ratio bein


g added',type=2)
10. Conslusion

MODEL: With Description only With Description and


[Prediction Accuracy Ratio [Prediction Accura
BEFORE] cy AFTER]
CART 68.08 % 68.09 %
Logistic Regression 67.47 % 71.51 %
Random Forest 55.75 % 56.76 %

All the three models improved slightly in Accuracy percentage for predictions
,CART being lowest improved in my model.

Seems like Logistic regression is highest accurate among all. Maybe addition
of more variable can boost up accuracy more and reac close to 80 percent
which can result into being best model.

Potrebbero piacerti anche