Sei sulla pagina 1di 34

RESEARCH PROJECT

FOR INTERNAL ASSESMENT

EFFORTS BY -
 KUNWARDEEP SINGH
(175055)
 SANMEET SINGH
(175050)
 KARAM SINGH
(175041)
FACTOR ANALYSIS
 In this analysis we are using the responses from 312 prisoners on the Measure
of Criminal Social Identity (Boduszek et al., 2012). The scale is comprised of 8
items designed to measure social identity as criminal. There have been
different factor analytic solutions reported for social identification. Some
authors suggest that the scale measures one factor, some suggested two
factor solution, whereas others have stated that it measures three corrected
factors, which are cognitive centrality, in-group affect and in-group ties.

OUTPUT & INTERPRETATIONS

 This means for each of the items appear to be reasonable as each of the
items is measured on a 5-point Likert scale. No values are above 5 or
below 1. The standard deviations are all similar suggesting that there are
no outliers for any of the items.
 The ‘Analysis N’ shows the number of valid cases. Here there are 9
missing values because the entire sample included 312 prisoners.
 Communalities can be thought of as the R2 for each of the variables that
have been included in the analysis using the factors as IV’s and the item
as a DV.
 It represents the proportion of variance of each item that is explained by
the factors.
 This is calculated of the initial solution and then after extraction. These
are reported in the Initial and Extraction.
 The Initial Eigenvalues - first 3 factors are meaningful as they have
Eigenvalues > 1. Factors 1, 2 and 3 explain 51.08%, 23.01%, and
16.37% of the variance respectively – a cumulative total of 90.46%
(total acceptable). The Extraction Sums of Squared Loadings provides
similar information based only on the extracted factors
 This plot shows that there are three relatively high (factors 1, 2, and 3)
eigenvalues. Retain factors that are above the ‘bend’ – the point at
which the curve of decreasing eigenvalues changes from a steep line
to a flat gradual slope.

 The Factor Matrix represents information from initial unrotated


solution. The values are weights that relate the item (or variable) to
the respective factor. All the items have high(ish) positive weights with
the first factor.
 At this stage the solution has not taken into consideration the
correlation between the three factors. Subsequent information is
more readily interpretable.
 The Goodness-of-fit Test determines if the sample data (correlations)
are likely to arisen from three correlated factors. In this situation we
want the probability value of the Chi-square statistic to be greater
than the chosen alpha (generally 0.05). Based on our results the
three-factor model is a good description of the data.

 The Pattern Matrix shows the factor loadings for the rotated
solution. Factor loadings are similar to regression weights (or slopes)
and indicate the strength of the association between the variables
and the factors. The solution has been rotated to achieve an
interpretable structure.
 When the factors are uncorrelated the Pattern Matrix and the
Structure Matrix should be the same
 The Structure Matrix shows the correlations between the factors and
the items for the rotated solution.

 The Factor Correlation Matrix shows that factor 1, 2 and 3 are


statistically correlated.
DISCRIMINANT ANALYSIS
 Here we are considering a model in which we are targeting potential
buyers who would like to purchase bike on the basis of the following
features and are scored for the same -
 Durability
 Performance
 Looks

OUTPUT & INTERPRETATION


 Analysis Case Processing Summary – This table summarizes the
analysis dataset in terms of valid and excluded cases. The reasons why
SPSS might exclude an observation from the analysis are listed here,
and the number (“N”) and percent of cases falling into each category
(valid or one of the exclusions) are presented. In this example, all of
the observations in the dataset are valid.
 Group Statistics – This table presents the distribution of observations
into the three groups within buying a bike. We can see the number of
observations falling into each of the three groups. In this model , we
are using the default weight of 1 for each observation in the dataset, so
the weighted number of observations in each group is equal to the
unweighted number of observations in each group.
 Test of Equality of Group Means – This helps in checking whether there
is a significant difference in each of our predictor variable across the
two categorical groups.
As we can see that Looks of the bike significance (0.747) doesn’t fall
into the hypothesis. Hence, we can conclude that Looks of the bike
isn’t statistically contributing towards discriminating the dependent
variable i.e. buying the bike.
 Box’s test helps us check our assumption of whether there is an equality
in variance and covariance matrices.
 The p value of test is .062, which is greater than .05 which means that
our null hypothesis is not rejected and hence our assumption is right that
there is an equality in the variance and covariance matrices.
 Eigenvalue – These are the eigenvalues of the matrix product of the
inverse of the within-group sums-of-squares and cross-product matrix
and the between-groups sums-of-squares and cross-product
matrix. These eigenvalues are related to the canonical correlations and
describe how much discriminating ability a function possesses. The
magnitudes of the eigenvalues are indicative of the functions’
discriminating abilities. See superscript e for underlying calculations.
 Wilks’ Lambda – Wilks’ Lambda is one of the multivariate statistic
calculated by SPSS. It is the product of the values of (1-canonical
correlation2).
INTERPRETATION: Our canonical correlations is 0.814, so the Wilks’
Lambda testing canonical correlation is (1 - .814^2) = 0.338 .
 Standardized Canonical Discriminant Function Coefficients – These
coefficients can be used to calculate the discriminant score for a given
case. The score is calculated in the same manner as a predicted value
from a linear regression, using the standardized coefficients and the
standardized variables.
For example, let Durability, Performance and Looks be the variables
created by standardizing our discriminating variables. Then, for each
case, the function scores would be calculated using the following
equation:

Z = -0.3021 + 0.56 x Durability + 0.26 x Performance – (- 0.30) x Looks

 Structure Matrix – This is the canonical structure, also known as


canonical loading or discriminant loading, of the discriminant
functions. It represents the correlations between the observed
variables (the three continuous discriminating variables) and the
dimensions created with the unobserved discriminant functions
(dimensions).
 Functions at Group Centroids – These are the means of the
discriminant function scores by group for each function calculated.
Here, we can conclude that if the buyers mean is greater than or equal
to 1.365, he would buy the bike but if the mean is less than or equal
to – 1.365 then he would not purchase the bike.

 Classification Processing Summary – This is similar to the Analysis Case


Processing Summary (see superscript a), but in this table, “Processed”
cases are those that were successfully classified based on the analysis. The
reasons why an observation may not have been processed are listed here.
INTERPRETATION: all of the observations in the dataset were successfully
classified.

 Prior Probabilities for Groups – This is the distribution of observations into


the job groups used as a starting point in the analysis. The default prior
distribution is an equal allocation into the groups,
INTERPRETATION: SPSS allows users to specify different priors with
the prior’s subcommand.

 Predicted Group Membership – These are the predicted frequencies of


groups from the analysis. The numbers going down each column indicate
how many were correctly and incorrectly classified.
For example, of the 20 cases that were predicted to be in
the purchase group, 18 were correctly predicted, and 2 were incorrectly
predicted (2 cases were in the would not purchase group).
LOGISTIC REGRESSION

OUTPUT & INTERPRETATION

 N – This is the number of cases in each category.


 Percent – This is the percent of cases in each category
 Included in Analysis – This row gives the number and percent of cases that
were included in the analysis. Because we have no missing data in our
example data set, this also corresponds to the total number of cases.
 Missing Cases – This row give the number and percent of missing
cases. By default, if there is missing value for any variable in the model,
the entire case will be excluded from the analysis.
 Total – This is the sum of the cases that were included in the analysis and
the missing cases. In our example, 45 + 0 = 45.

 Beginning Block is the results of the analysis without any of the


independent variables used in the model. This will serve as a baseline
later for comparing the model with the predictor variables included.
 In the Classification table, the overall percentage of correctly classified
cases is 55.6 per cent. In this case, SPSS classified that all cases would
TERMINATE the symptoms of getting the disease (only because there
was a higher percentage of people not terminating the symptoms).
 Block 1 is where the model (set of predictor variables) is tested. The
Omnibus Tests of Model Coefficients gives us an overall indication of
how well the model performs, over and above the results obtained for
Block 0, with none of the predictors entered into the model. This is
referred to as a ‘goodness of fit’ test.
 For this set of results, we want a highly significant value (the Sig. value
should be less than .05). In this case, the value is .000 (which really
means p<.0005). Therefore, the model (with our set of variables used as
predictors) is better than SPSS’s original guess shown in Block 0, which
assumed that everyone would report termination of the disease. The
chi-square value, which we report in our results, is 28.552 with 3
degrees of freedom.

 The results shown in the table headed Hosmer and Lemeshow Test also
support the model as being worthwhile.
 For the Hosmer-Lemeshow Goodness of Fit Test poor fit is indicated by a
significance value less than .05, so to support the model we actually want
a value greater than .05. In our case, the chi-square value for the Hosmer
and Lemeshow Test is 5.375 with a significance level of .614. This value is
larger than .05, therefore indicating support for the model.
 The Cox & Snell R Square and the Nagelkerke R Square values provide an
indication of the amount of variation in the dependent variable explained
by the model (from a minimum value of 0 to a maximum of approximately
1).
 These are described as pseudo R square statistics, rather than the true R
square values that we will see provided in the multiple regression output.
The two values are .470 and .629, suggesting that between 47 per cent and
62.9 percent of the variability is explained by this set of variables.

 Classification Table provides us with an indication of how well the model is


able to predict the correct category (terminate/not terminate) for each
case.
 We can compare this with the Classification Table shown for Block 0, to
see how much improvement there is when the predictor variables are
included in the model. The model correctly classified 84.4 per cent of cases
overall, an improvement over the 55.6 per cent in Block 0.
 We were able to correctly classify 75 per cent of the people who do not
want to terminate the disease.
 92 per cent is the percentage of the people who want to terminate the
disease that is correctly identified (true negatives).

 The Variables in the Equation table gives us information about the


contribution or importance of each of our predictor variables. The test that
is used here is known as the Wald test.
 Values less than .05 are the variables that contribute significantly to the
predictive ability of the model. In this case, we have two significant
variables (avdiscl p = .004, sympsev p = .011).
 In this example, the major factors influencing whether a person wants to
terminate disease are: gender, symptoms severity and avoidance of
personal disclosure.
CLUSTER ANALYSIS

HIERARCHICAL CLUSTER ANALYSIS


 STAGE 1 – Cluster 14 and 16 are the first to join with coefficient of
1. In this case, both the clusters are in reality individual
observation. We know so because of ‘0’ in the collision of stage in
which cluster first appears. This cluster will be again seen in stage
14 when observation 5 joins 20.
 STAGE 24- Cluster 1 and 4 are joined. This result in that all the 25
observations are now in single range cluster of this stage has
highest coefficient of 400.240.

 All where the dark col^n collapse first: 14 and 16, which is the first
cluster. Similarly, towards the top, all portions are grey as they
have all become a part of one big cluster.
 It is problematic to interpret the result so let us look at the
Dendogram.
 Dendrogram tells us about the homogeneous nature of
observations in a particular cluster.
 To get a four-cluster solution we need to eliminate some vertical
rows to fit the observation in the cluster. As a result, the final
meaning of the four cluster will be determined in the non-
hierarchical analysis.
NON - HIERARCHICAL CLUSTER ANALYSIS

 K-Means clustering happens iteratively. That is, you start with an


initial set of ‘centres’ and then modify them until the change
between two iterations is small enough.
 Case assessment is done again and again until no cluster centre
changes is appreciable and the convergence is achieved.
 In our case, convergence is achieved after 2nd iteration.

 In Final Cluster Centers, clusters are prescribed to the customers


or respondents. Using this we can describe the clusters.
 CLUSTER 1 - It is different from other clusters as low mean of new
product(V1), bargaining power(V4) and daily need products(V5).
Overall every rating is relatively lower.
 CLUSTER 2 – It is most distinguished by relatively higher means on
buying the product(V1) and eating food at the shopping mall(V3).
 CLUSTER3 – It is distinguished by relatively higher means for
discounted products(V2), bargaining of the products(V4) and
similar priced product(V6).
 So, this segment believes that customers like shopping at the
mall, barging of the product, eating at the food court and
comparing prices.

 If P-VALUE > α , then that variable doesn’t contribute much to the


separation of cluster.
 INTERPRETATION - Our N-H result suggests that the cluster
solution is adequately discriminating observation.
2 - STEP CLUSTER ANALYSIS

 MODEL SUMMARY - Using 6 input variables we obtained 1


cluster as produced by this algorithm.
 CLUSTER QUALITY – This illustrates the overall strength of the
model. In our case, it is nearly fair 0.5.
 INTERPRETATION – No cluster in our data set is more than 2
times as large as any other cluster.
ANOVA (ONE-WAY &TWO-WAY)

 ONE – WAY ANOVA


Researchers want to test a new anti-anxiety medication.
They split participants into three conditions (0mg, 50mg,
100mg) then ask them to rate the anxiety level on a scale of
1-10. Are there any differences between the three
conditions? Using alpha= 0.5.

0 mg 50 mg 100 mg
9.00 7.00 4.00
8.00 6.00 3.00
7.00 6.00 2.00
8.00 7.00 3.00
8.00 8.00 4.00
9.00 7.00 3.00
7.00 6.00 2.00
8.00 7.00 4.00
9.00 6.00 3.00
8.00 8.00 3.00
OUTPUT & INTERPRETATION

 In this Descriptive Statistics box, the mean for the 0mg is 8.10. The
mean for 50mg is 6.80 and the mean for the 100mg is 3.10. The
standard deviation for 0mg is 0.73786, the standard deviation for 50mg
is 0.24944 and the standard deviation for 100mg is 0.73786 The
number of participants in each condition (N) is 10.

 This is the table that shows the output of the ANOVA analysis and
whether there is a statistically significant difference between our
group means. We can see that the significance value is 0.00(i.e., p =
.000), which is below 0.05. and, therefore, there is a statistically
significant difference in the mean length of time to complete the
spreadsheet problem between the different courses taken. This is
great to know, but we do not know which of the specific groups
differed. Luckily, we can find this out in the Multiple
Comparisons table which contains the results of the Tukey post hoc
test.

 From the results so far, we know that there are statistically


significant differences between the groups as a whole. The
table above, Multiple Comparisons, shows which groups
differed from each other. The Tukey post hoc test is
generally the preferred test for conducting post hoc tests on
a one-way ANOVA, but there are many others.
 We can see from the table above that there is a statistically
significant difference in time to complete the problem
between the group that took 50mg and 100mg (p = 0.02), as
well as between the 0mg and 100mg (p = 0.002). However,
there were no differences between the groups that took
0mg and 50mg (p = 0.000).
 TWO – WAY ANOVA
The effective life (in hours) of batteries is compared by material type (1, 2 or 3)
and operating temperature: Low (-10˚C), Medium (20˚C) or High (45˚C). Twelve
batteries are randomly selected from each material type and are then randomly
allocated to each temperature level. The resulting life of all 36 batteries is shown
below:

OUTPUT & INTERPRETATION


 Descriptive statistics - These provide the mean scores, standard
deviations and N for each subgroup.
 Levene’s Test of Equality of Error Variances - This test provides a
test of one of the assumptions underlying analysis of variance.
The value of the Sig. level. We want this to be greater than.05,
and therefore not significant.
INTERPRETATION- A significant result (Sig. value less than .05)
suggests that the variance of our dependent variable across the
groups is not equal. The Sig. level is .529. As this is larger than .05,
we can conclude that we have not violated the homogeneity of
variances assumption.

 Tests of Between-Subjects Effects -The actual result of the two-


way ANOVA – namely, whether either of the two independent
variables or their interaction are statistically significant .
INTERPRETATION - The particular rows we are interested in are
the "MATERIAL", "TEMP" and "MATERIAL*TEMP" rows, and these
are highlighted above. These rows inform us whether our
independent variables (the "MATERIAL" and "TEMP" rows) and
their interaction (the "MATERIAL*TEMP" row) have a statistically
significant effect on the dependent variable, "LIFE". It is important
to first look at the "MATERIAL*TEMP" interaction as this will
determine how you can interpret your results (see our enhanced
guide for more information). You can see from the "Sig." column
that we have a statistically significant interaction at the p = .019
level. You may also wish to report the results of "MATERIAL" and
"TEMP", but again, these need to be interpreted in the context of
the interaction result. We can see from the table above that there
was no statistically significant difference in mean of temperature
and material type (p = .002), but there were statistically significant
differences between TEMP(p < .0005).

 Material Comparison - You can see from the table above that
there is some repetition of the results, but regardless of which
row we choose to read from, we are interested in the differences
between (1) MATERIAL and MATERIAL*TEMP, (2)
MATERIAL*TEMP and TEMP, and (3) TEMP and MATERIAL. From
the results, we can see that there is a statistically significant
difference between all three different sets (p < .05).
 PLOTS- The plot shows the optimism of material type on the basis
of Temperature. This plot is very useful for allowing us to visually
inspect the relationship among our variables.
 Results from Two Way ANOVA-
 A two-way between-groups analysis of variance was conducted to
explore the life of batteries under the basis of TEMPERATURE and
MATERIAL.
 Subjects were divided into three groups (Group 1:MATERIAL;
Group 2:TEMP; Group 3: 45MATERIAL*TEMP).
 There was a statistically significant interaction between MATERIAL
and TEMPERATURE on the basis of life F (4, 27) = 3.460, p = .019.

Potrebbero piacerti anche