Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Check for
outliers and missing values (8 marks)
Answer
a) In summary we are able to see the values of mean, median, minimum value, 1st
Quartile,3rd Quartile, maximum value.
b) Through class function we can see the class of the data. The data is in table format and "
data.frame".
Summary of the Data
>> It is clearly evident from the analysis of above function that there is missing values in the data
Correlation in Numbers
Correlation chart with Number and PIE
5 9
ProdQual
-0.14 0.10 0.11 -0.05 0.48 -0.15 -0.40 0.09 0.10 0.03 0.49
Ecom
0.00 0.14 0.43 -0.05 0.79 0.23 0.05 0.16 0.19 0.28
2.5
2 8
TechSup
0.10 -0.06 0.19 0.02 -0.27 0.80 0.08 0.03 0.11
8
CompRes
0.20 0.56 0.23 -0.13 0.14 0.76 0.87 0.60
3
6
Advertising
-0.01 0.54 0.13 0.01 0.18 0.28 0.30
2
3 8
ProdLine
-0.06 -0.49 0.27 0.42 0.60 0.55
8
SalesFImage
0.26 0.11 0.20 0.27 0.50
3
4 9
ComPricing
-0.24 -0.11 -0.07 -0.21
8
WartyClaim
0.20 0.11 0.18
4
6
OrdBilling
0.75 0.52
2
DelSpeed
0.58
2
5 10
Satisfaction
5 7 9 2 4 6 8 2 4 6 3 5 7 4 5 6 7 8 2 3 4 5
From the above scatter plot, we can also see the normality of the data.
Boxplot to see the Outliers in the Data
10
8
6
4
2
When we do multiple linear regression of Satisfaction as a function of other 11 variables, we can see the
P-values of ProdQual, Ecom and SalesFImgae are statistically significant at 5% alpha and other variables
are not statistically significant. And overall R Squared is 80.21% that means all the 11 variables
combining together explaining 80.21% variation in the Dependent variable satisfaction. The regression
model also is a statistically significant as the P-value is less than the alpha@5%.
Note: I have done here multiple liner regression and I will attach/append one more word with screen
shots of simple liner regression with each independent variable.When we do simple liner regression
with all independent variables, ProdQual, Ecom, CompRes, Advertising, ProdLine, SalesFImage,
ComPricing, OrdBilling and DelSpeed are statistically significant as the P-Value Less than the
alpha@1% and TechSup & WartClaim are not statistically significant as the P-value is more than the
alpha@5%.
4) Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors (20
marks)
#Principle Component Analysis/Factor Analysis
#We are assuming Satisfaction is a dependent variable
#Let us Remove the satisfaction variable from the Data Before PCA/FCA
PCA_Mydata<-Mydata[,-12]
#Checking of Correlation between X Variables
matrix1<-cor(PCA_Mydata)
print(matrix1)
SalesFImage
ComPricing
WartyClaim
Advertising
CompRes
DelSpeed
OrdBilling
ProdQual
TechSup
ProdLine
Ecom
1
ProdQual
0.8
Ecom
0.6
TechSup
CompRes 0.4
Advertising 0.2
ProdLine 0
SalesFImage -0.2
ComPricing -0.4
WartyClaim
-0.6
OrdBilling
-0.8
DelSpeed
-1
Bartlett Test to check whether the data is sufficient to do Principle Component Analysis(PCA)/Factor
Analysis(FA)
P value is less than default significance level (0.05) hence this test suggest that this data is suitable to
do PCA/FA
KMO Test to check whether we have adequate sample size to go ahead with analysis.
Overall MSA is greater than 0.5 which means we have adequate number of samples in the dataset, we
can go ahead with the analysis
Calculation of Eigen Values and Scree Plot
Scree Plot
3.5
3.0
2.5
Eigen Value
2.0
1.5
1.0
0.5
0.0
2 4 6 8 10
Principal Components
Total cumulative variance explained by 4 Components is 80%. To get better division of variables under
different components, we can take the cut off for the loadings.
We clearly see all the separations and ProdQual has the same explainability with PC2 & PC4, let’s try to
use varimax rotation and see if we get a clear picture.
To get better division of variables under different components, we can take the cut off for the loadings.
Looks like PCA with Rotation gave better result and we can assign ProdLine to PC4 since it has a higher
explainability to it. As the variables are divided into 4 components based on the respective nature of
variables.
Performance of Factor analysis
As per the factor diagram, it looks like factors do not give clear picture of separation of variable. No we
will rotate and check the results.
VARIMAX ROTATION OF FACTORS
Naming of Factors
The variables separated properly based on the attributes of variables as per the Varimax rotation.
We can name the factors based on the attributes of the variables.
1) Factor1-Customer Service:- DelSpeed,CompRes and OrdBilling
2) Factor2-Marketing:SalesFImage,Ecom and Advertising
3) Factor3-Customer Support/Technical Support: WartyClaim and TechSup
4) Factor4-Product Goodwill/Product Value:ProdLine,ProdQual and ComPricing
5) Perform Multiple linear regression with customer satisfaction as dependent variables and the four
factors as independent variables. Comment on the Model output and validity. Your remarks should
make it meaningful for everybody(20 marks)
Multiple Linear Regression with Satisfaction as Dependent Variable and 4 Factors as Independent
Variables-Creation of Model on 70% data(Train)
Model Validity
There are many ways to validate the model, few of them are mentioned below
Comparing RSquare of train & test data - Calculating Rsquare value manually for the test data as we
can’t see a rsquare using summary for prediction
Two ways to compare rsquare of train & test
Model is valid if Rsquare value for train & test data is close or if Rsquare of test is better.
1. Test Rsquare value for the predicted data (Method 1)
In this case Rsquare for test data (0.7434612) is better/close to train data (0.6741), hence the model is
valid.
In this case Rsquare for test data (0.7605972) is better than train data (0.6741), hence the
model is valid.
Calculate RMSE (Root Mean Square Error)
When RMSE is at of low range of distribution i.e either below the range of minimum and 1st Quartile or
between the range of minimum and 1st Quartile, Model is considered to be statistically significant (Good
Model)
Calculating MAPE (Means Absolute Percentage Error)
We get a MAPE of 0.07505267 which means 7.5%, which means that our model accuracy is 92.5%. With
such high accuracy it can be stated as a valid model and model is over fitted.
The Regression model is statistically significant and there is linear relationship between the variables.
We can reject the null hypothesis and accept the alternative hypothesis. That means null hypothesis of
all beats are zero’s is rejected and there is a at least one beta is non zero.