Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Professor Melcon
STA106
A.
The data reflects the number of times a helicopter has been called for an emergency
during center periods. It should be implied that the emergency call for one helicopter does not
affect another helicopter’s chance of being called for.
B.
C.
The locations of the outlier are 1,4,15,16,67,69,80. However, removing these data points
removes roughly 9% of the data. This is an extreme amount to lose almost 10% of the data,
however, removing these outliers satisfy the ANOVA assumptions by the normal QQplot, errors
vs groups, and the boxplot. These graphs are considered subjective and therefore, we use the
Shapiro-Wilks and Brown-Forsythe to support our claim.
By applying the Shapiro-Wilks test, the F-test gives a p-value of .0124 which fails to reject the
null hypothesis. Thus, the errors are not normally distributed. Also applying the Brown-Forsythe
test, the F-test gives a p-value of .1718 which rejects the null hypothesis and concludes that the
variances are equal.
In conclusion, the data with removed outliers have equal variances, however, the errors are not
normally distributed either, therefore the removed outliers model violates the ANOVA
assumptions as well.
Transformed Transform/Remove
PCC -.2568 -0.1464
Shapiro-Wilks -.2322 0.4070
Log-likelihood -.3961 -0.1392
Both the original data and the transformed data failed to satisfy the ANOVA assumptions.
Therefore, we use a boxcox transformation on the original data. Afterwards, we find the best
from the Shapiro-Wilks, PPCC, and log-likelihood. The log-likelihood has the lowest value.
From the boxplot It appears that there is one outlier for both lower and upper tails. The error by
group also appears to show that the group variances are unequal.
By applying the Shapiro-Wilks test, the F-test gives a p-value of .0673 which rejects the null
hypothesis, the errors are normally distributed. Also applying the Brown-Forsythe test, the F-test
gives a p-value of .6597 which rejects the null hypothesis.
Finally, we test the transformed and removed outlier data to check with the ANOVA
assumptions. By applying the Shapiro-Wilks test, F-test gives the p-value .2572 which rejects the
null hypothesis, the errors are normally distributed. Applying the Brown-Forsythe test, the F-test
gives a p-value of .2606, which rejects the null hypothesis. Therefore, the population has equal
variance.
D.
In conclusion the transformed and removed outlier data satisfies the ANOVA assumptions. As a
result, since both the original, the removed outlier, and transformed data both failed the ANVOA
assumptions, the transformed and removed outlier data is considered the best model. Despite
having a better fit model, there are still some drawbacks. The data is now harder to interpret and
even harder to revert back to the original data. Even though the model has improved, removing
the outliers also has the effect of removing important information contributing from those
even harder to revert back to the original data. Even though the model has improved, removing
the outliers also has the effect of removing important information contributing from those
outliers.
The reason to interpret transformed data will help visualize equal variances or errors of
normality. There are times when the original data is unable to express what we are looking for.
Therefore, changing the data whether it is by transformation or outliers improves our
understanding of the subject.
I.
The data comes from a random sample of technology workers in Seattle and San
Francisco. The data deals with different job titles: data scientist, software engineer, and
bioinformatics engineer and their annual salaries. By comparing this small sample size of jobs,
we are able to determine the population annual salaries for these jobs.
The approaches we plan to use is ANOVA test, F-test, hypothesis testing, Shapiro-Wilks,
Brown-Forsythe, transformation of data, removal of outliers, coefficient of partial determination,
interaction, confidence intervals, and QQ normality.
II.
Assumptions:
1 All subjects randomly sampled/independence
2 All levels of factor A independent (Profession)
3 All levels of factor B independent (Region)
4
From the interaction plot, the Data scientist and the Bioinformatics engineer are parallel
to each other in their respected regions. However, the Data scientist and the Software engineer
decrease for the Seattle region. From the interaction plot, we assume there to be an interaction.
III.
The histogram for the salaries are roughly symmetric. Therefore, the histogram reveals
that the data is normally distributed. By looking at the boxplot, the data appears to be roughly
normal. However, Bioinformatics in Seattle contains an outlier. This outlier affects the data and
violates the ANOVA assumptions. As shown in the boxplot, there are two groups which seem to
be larger than the others; Data scientist in Seattle and Data scientist in San Francisco. From the
boxplot, it appears that the data violates the normal assumption and equal variance. Despite, the
outlier, we plan to leave it in the data.
IV.
From our assumption that there is an interaction, the model should look like this:
From our interaction plot, we assume our model contains interactions. To support our
claim, we test the interaction against no-interaction.
Conditional percentage of reduction in error:
Test statistics:
AB A+B A B Empty
SSE 15252.93 16058.34 17764.09 39872.94 41578.69
From the coefficient of partial determination (Partial R2), we find that there does not exist
P-value .0532 .0006 2.2*10
From the coefficient of partial determination (Partial R2), we find that there does not exist
an interaction when we test for . When we add an interaction effect to a model with factor A
(profession) and factor B (region) effects, the overall reduction in error is 5%. The F-test gives a
p-value of .0532 which fails to reject the null hypothesis, and conclude that there is no
interaction effect. Adding an interaction effect to the model reduces the errors is not effective,
therefore, we should use a no-interaction model.
From the , when we add factor A (profession) to a model with factor B (Region), the
overall reduction in error is 9%, and , when we add factor B to a model with factor A, the overall
reduction in error is 59%. The respective F-tests also reject their null hypothesis meaning that
factor A and factor B exist.
df(SSE) = nt-a-b+1
can be one of either Bonferroni, Tukey, or Scheffe.
Bonferroni: = 2.684
Tukey: = 2.374
Scheffe: = 2.48
Bonferroni is a good multiplier for any type of confidence interval. Tukey is a good
multiplier for pairwise confidence intervals. Finally, Scheffe is a good multiplier for contrast
confidence intervals. We plan to use Tukey for the pairwise confidence intervals and Scheffe for
the contrasts. If we wanted to compare them equivalently, we would use Bonferroni, since it can
be used in any type of confidence interval.
Contrast (include pairwise):
S SF
BE 2.438 -2.438
DS 1.1493 -1.1493
SE -3.5874 3.5874
From the table above, The and equal zero makes sense, because there is no interaction effect at
all.
Best model:
Confidence Intervals:
In the first confidence interval, the true annual salary of a Bioinformatics engineer has a lower
salary compared to the Data scientist.
The second confidence interval, the true annual salary of a Bioinformatics engineer has a lower
salary compared to the Software engineer.
The third confidence interval, the true annual salary of a Data scientist has a high salary than the
The second confidence interval, the true annual salary of a Bioinformatics engineer has a lower
salary compared to the Software engineer.
The third confidence interval, the true annual salary of a Data scientist has a high salary than the
Software engineer.
The fourth confidence interval, the true annual salary from San Francisco is not significantly
different compared to Seattle.
The fifth confidence interval, the true average annual salary of Bioinformatics and Data scientist
has a higher salary than the Software engineer.
The sixth confidence interval, the true average annual salary of Bioinformatics and Software
engineer has a lower salary than the Data scientist.
V.
We test interaction against no-interaction models and find from the Partial R2 and from
the F-test, that the better model is the no-interaction model. It appears that the Bioinformatics
engineer has the lowest salary of the three and the Data scientist has the highest salary of the
three. The boxplot made it appear that one region had higher wages than the other, in actuality,
both region provide similar salaries.
VI.
The ANOVA assumptions might have been violated, therefore our conclusion may not be
dependable. From the boxplot, histogram, and interaction plots, it seemed as though there was an
interaction effect in the model. However, from the ANOVA test and Partial R2 test, the best
model is the no-interaction model. Overall, the region does not differ in salary, rather that the
profession is the main cause for salary differences. Therefore, the best profession to receive the
highest annual salary is the Data scientist.