B) " Data - Frame".: Outliers and Missing Values (8 Marks) Answer

1) Perform exploratory data analysis on the dataset. Showcase some charts, graphs.
Check for
outliers and missing values (8 marks)
Answer
a) In summary we are able to see the values of mean, median, minimum value, 1st
Quartile,3rd Quartile, maximum value.
b) Through class function we can see the class of the data. The data is in table format and "
data.frame".
Summary of the Data
Structure and Class of the Data

Function to find missing values
>> It is clearly evident from the analysis of above function that there is missing values in the data
Functions to get the Correlation and Correlation Charts

Corrplot to know the Correlation between the Variables. Blue and dark blues is the significant
Correlation Between the variables.
Correlation in Numbers
Correlation chart with Number and PIE
Scatter Plot (pairs.panels(Mydata[,-c(1)])
2.5 4.0 5.5 3 5 7 3 5 7 4 6 8 10 2 4 6 5 7 9
5 9
ProdQual
-0.14 0.10 0.11 -0.05 0.48 -0.15 -0.40 0.09 0.10 0.03 0.49
Ecom
0.00 0.14 0.43 -0.05 0.79 0.23 0.05 0.16 0.19 0.28
2.5
2 8
TechSup
0.10 -0.06 0.19 0.02 -0.27 0.80 0.08 0.03 0.11
8
CompRes
0.20 0.56 0.23 -0.13 0.14 0.76 0.87 0.60
3
6
Advertising
-0.01 0.54 0.13 0.01 0.18 0.28 0.30
2
3 8
ProdLine
-0.06 -0.49 0.27 0.42 0.60 0.55
8
SalesFImage
0.26 0.11 0.20 0.27 0.50
3
4 9
ComPricing
-0.24 -0.11 -0.07 -0.21
8
WartyClaim
0.20 0.11 0.18
4
6
OrdBilling
0.75 0.52
2
DelSpeed
0.58
2
5 10
Satisfaction
5 7 9 2 4 6 8 2 4 6 3 5 7 4 5 6 7 8 2 3 4 5
From the above scatter plot, we can also see the normality of the data.
Boxplot to see the Outliers in the Data
10
8
6
4
2
ProdQual Ecom TechSup CompRes ProdLine ComPricing OrdBilling DelSpeed
Outliers are there in the Ecom, SalesFImage, OrdBilling and DelSpeed.

Outliers for Ecom
> boxplot(Ecom)$out
[1] 5.6 5.7 5.1 5.1 5.1 5.5
Outliers for SalesFImage

boxplot(SalesFImage)$out
[1] 7.8 7.8 8.2
Outlier for OrdBilling

boxplot(OrdBilling)$out
[1] 6.7 6.5 2.0 2.0
Outliers for DelSpeed

boxplot(DelSpeed)$out
[1] 1.6
2) Is there evidence of multicollinearity ? Showcase your analysis(6 marks)
> #Removing of Coloumn number 1 from data Set i.e ID

> Mydata<-data.frame(Mydata[,-1])
> head(Mydata)
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImage ComPricing
1 8.5 3.9 2.5 5.9 4.8 4.9 6.0 6.8
2 8.2 2.7 5.1 7.2 3.4 7.9 3.1 5.3
3 9.2 3.4 5.6 5.6 5.4 7.4 5.8 4.5
4 6.4 3.3 7.0 3.7 4.7 4.7 4.5 8.8
5 9.0 3.4 5.2 4.6 2.2 6.0 4.5 6.8
6 6.5 2.8 3.1 4.1 4.0 4.3 3.7 8.5
WartyClaim OrdBilling DelSpeed Satisfaction
1 4.7 5.0 3.7 8.2
2 5.5 3.9 4.9 5.7
3 6.2 5.4 4.5 8.9
4 7.0 4.3 3.0 4.8
5 6.1 4.5 3.5 7.1
6 5.1 3.6 3.3 4.7
R Code to check Correlation between the variables
Analysis of Correlation between variables

We can observe from the above correlation table that there is some correlation between the
independent variables. For example Ecom & SalesFImage correlated with 0.79,CompRes and ProdLine
Correlated with 0.56,TechSup and WartyClaim Correlated with 0.80,etc., We can say that there is a
multicolinearity in the data set. We can also check the multicolinearity by calculating variance inflation
factor (VIF). If VIF is coming more than 5 for any variable we can say that there is multicolinearity.
> vif(Model)
> #Calculation of Variance Inflaton factor to check Multicollinearity
> vif(Model1)
ProdQual Ecom TechSup CompRes Advertising ProdLine SalesFImage
1.635797 2.756694 2.976796 4.730448 1.508933 3.488185 3.439420
ComPricing WartyClaim OrdBilling DelSpeed
1.635000 3.198337 2.902999 6.516014
Calculation of Varinace Inflation Factor

3) Perform simple linear regression for the dependent variable with every independent variable (6
marks)
Answer:-
>> Satisfaction is a dependent variable and other 11 variables are Independent Variables.
Linear Regression of Satisfaction as a function of other 11 variables
When we do multiple linear regression of Satisfaction as a function of other 11 variables, we can see the
P-values of ProdQual, Ecom and SalesFImgae are statistically significant at 5% alpha and other variables
are not statistically significant. And overall R Squared is 80.21% that means all the 11 variables
combining together explaining 80.21% variation in the Dependent variable satisfaction. The regression
model also is a statistically significant as the P-value is less than the alpha@5%.
Note: I have done here multiple liner regression and I will attach/append one more word with screen
shots of simple liner regression with each independent variable.When we do simple liner regression
with all independent variables, ProdQual, Ecom, CompRes, Advertising, ProdLine, SalesFImage,
ComPricing, OrdBilling and DelSpeed are statistically significant as the P-Value Less than the
alpha@1% and TechSup & WartClaim are not statistically significant as the P-value is more than the
alpha@5%.
4) Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors (20
marks)
#Principle Component Analysis/Factor Analysis
#We are assuming Satisfaction is a dependent variable
#Let us Remove the satisfaction variable from the Data Before PCA/FCA
PCA_Mydata<-Mydata[,-12]
#Checking of Correlation between X Variables
matrix1<-cor(PCA_Mydata)
print(matrix1)
Correlation Between X Variables

Correlation with Corrplot(Visualization of Correlation)
SalesFImage
ComPricing
WartyClaim
Advertising
CompRes
DelSpeed
OrdBilling
ProdQual
TechSup
ProdLine
Ecom
1
ProdQual
0.8
Ecom
0.6
TechSup
CompRes 0.4
Advertising 0.2
ProdLine 0
SalesFImage -0.2
ComPricing -0.4
WartyClaim
-0.6
OrdBilling
-0.8
DelSpeed
-1
Correlation Plot with Numbers

Correlation Plot with Numbers and Pie Charts
Bartlett Test to check whether the data is sufficient to do Principle Component Analysis(PCA)/Factor
Analysis(FA)
P value is less than default significance level (0.05) hence this test suggest that this data is suitable to
do PCA/FA
KMO Test to check whether we have adequate sample size to go ahead with analysis.
Overall MSA is greater than 0.5 which means we have adequate number of samples in the dataset, we
can go ahead with the analysis
Calculation of Eigen Values and Scree Plot
Scree Plot
3.5
3.0
2.5
Eigen Value
2.0
1.5
1.0
0.5
0.0
2 4 6 8 10
Principal Components
Analysis of Eigen Values and Scree Plot

Eigen values are showing in decreasing trend. As per Kaiser rule we will go with the number of
components/factors having anything above 1 eigen value. We have 4 components/factors with Eigen
value above 1.
Perform Principal Component Analysis without rotation
Total cumulative variance explained by 4 Components is 80%. To get better division of variables under
different components, we can take the cut off for the loadings.
We clearly see all the separations and ProdQual has the same explainability with PC2 & PC4, let’s try to
use varimax rotation and see if we get a clear picture.
To get better division of variables under different components, we can take the cut off for the loadings.
Looks like PCA with Rotation gave better result and we can assign ProdLine to PC4 since it has a higher
explainability to it. As the variables are divided into 4 components based on the respective nature of
variables.
Performance of Factor analysis
As per the factor diagram, it looks like factors do not give clear picture of separation of variable. No we
will rotate and check the results.
VARIMAX ROTATION OF FACTORS
Naming of Factors
The variables separated properly based on the attributes of variables as per the Varimax rotation.
We can name the factors based on the attributes of the variables.
1) Factor1-Customer Service:- DelSpeed,CompRes and OrdBilling
2) Factor2-Marketing:SalesFImage,Ecom and Advertising
3) Factor3-Customer Support/Technical Support: WartyClaim and TechSup
4) Factor4-Product Goodwill/Product Value:ProdLine,ProdQual and ComPricing
5) Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody(20 marks)
Multiple Linear Regression with Satisfaction as Dependent Variable and 4 Factors as Independent
Variables-Creation of Model on 70% data(Train)
Getting of Factor Scores

Split the data into 2 parts (70:30), create model on 70% data and validate the model on the rest 30%
Create a MLR Model on training data

Form the above linear Regression model; it is clear that the variables Customer Service, Marketing and
Product value are statistically significant as the P-Values are less than the alpha 5%. And the intercept P-
value is also less than the alpha 5%. The Model is valid as the model explains 67.41%(R-Squared Value)
of variation of the dependent variable and P-Value of model also is less than the alpha 5%. The variable
Technical support is not statistically significant as the P-value is more than the alpha 5%.We can also
validate the model in different ways, those will be explained below.
Getting of Predictions for Test data
Model Validity
There are many ways to validate the model, few of them are mentioned below
Comparing RSquare of train & test data - Calculating Rsquare value manually for the test data as we
can’t see a rsquare using summary for prediction
Two ways to compare rsquare of train & test
Model is valid if Rsquare value for train & test data is close or if Rsquare of test is better.
1. Test Rsquare value for the predicted data (Method 1)
In this case Rsquare for test data (0.7434612) is better/close to train data (0.6741), hence the model is
valid.
2. Test Rsquare value for the predicted data (Method 2)
In this case Rsquare for test data (0.7605972) is better than train data (0.6741), hence the
model is valid.
Calculate RMSE (Root Mean Square Error)
When RMSE is at of low range of distribution i.e either below the range of minimum and 1st Quartile or
between the range of minimum and 1st Quartile, Model is considered to be statistically significant (Good
Model)
Calculating MAPE (Means Absolute Percentage Error)
We get a MAPE of 0.07505267 which means 7.5%, which means that our model accuracy is 92.5%. With
such high accuracy it can be stated as a valid model and model is over fitted.
The Regression model is statistically significant and there is linear relationship between the variables.
We can reject the null hypothesis and accept the alternative hypothesis. That means null hypothesis of
all beats are zero’s is rejected and there is a at least one beta is non zero.

B) " Data - Frame".: Outliers and Missing Values (8 Marks) Answer

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

B) " Data - Frame".: Outliers and Missing Values (8 Marks) Answer

Caricato da

Copyright:

Formati disponibili

1) Perform exploratory data analysis on the dataset. Showcase some charts, graphs.

Structure and Class of the Data

Functions to get the Correlation and Correlation Charts

Scatter Plot (pairs.panels(Mydata[,-c(1)])

2.5 4.0 5.5 3 5 7 3 5 7 4 6 8 10 2 4 6 5 7 9

ProdQual Ecom TechSup CompRes ProdLine ComPricing OrdBilling DelSpeed

Outliers are there in the Ecom, SalesFImage, OrdBilling and DelSpeed.

Outliers for SalesFImage

Outlier for OrdBilling

Outliers for DelSpeed

> #Removing of Coloumn number 1 from data Set i.e ID

R Code to check Correlation between the variables

Analysis of Correlation between variables

Calculation of Varinace Inflation Factor

Correlation Between X Variables

Correlation Plot with Numbers

Analysis of Eigen Values and Scree Plot

Getting of Factor Scores

Create a MLR Model on training data

Getting of Predictions for Test data

2. Test Rsquare value for the predicted data (Method 2)

Potrebbero piacerti anche