106 Project 2

Bailey Wang
Professor Melcon
STA106
A.
The data reflects the number of times a helicopter has been called for an emergency
during center periods. It should be implied that the emergency call for one helicopter does not
affect another helicopter’s chance of being called for.
B.
For all tests, .

SW P-value BF P-value QQ Correlation
L1 with outliers .1075 .6544 .9806
L2 with outliers .1090 .6467 .9805
L3 with outliers .0673 .6596 .9800
L1 without outliers .2226 .0915 .9767
Original Removed Outlier Transformed Transform/Remove

F-test .7971 .9560 .9743 .9730
SW p-value 3.945*10-9 .0124 .1075 .2572
F-test 3.0942 1.7158 .5428 2.9187
BF p-value .0318 .1718 .6544 .2606
From our normal QQplot, it shows that our distribution is heavy on the right tail as well as
having a single outlier on the left tail. The error vs fitted model graph also shows that the
variances are not equal. Therefore, two of the ANOVA assumptions are violated.
As stated before, the ANOVA assumptions are violated follows what the boxplot shows which
are four outliers in the right tail and a single outlier in the left tail, however, the boxplot can be
considered subjective.
By applying the Shapiro-Wilks test, the F-test gives a p-value of 3.945*10-9 which fails to reject
our null hypothesis. Therefore, the errors are not normally distributed. Also applying the Brown-
Forsythe test, the F-test gives a p-value of .0318 which also fails to reject our null hypothesis and
states that the group variances are unequal.
In conclusion, the original data violates ANOVA assumptions, for equal variance and normally
distributed errors, and should either remove the outliers or transform the data.
C.
The locations of the outlier are 1,4,15,16,67,69,80. However, removing these data points
removes roughly 9% of the data. This is an extreme amount to lose almost 10% of the data,
however, removing these outliers satisfy the ANOVA assumptions by the normal QQplot, errors
vs groups, and the boxplot. These graphs are considered subjective and therefore, we use the
Shapiro-Wilks and Brown-Forsythe to support our claim.
By applying the Shapiro-Wilks test, the F-test gives a p-value of .0124 which fails to reject the
null hypothesis. Thus, the errors are not normally distributed. Also applying the Brown-Forsythe
test, the F-test gives a p-value of .1718 which rejects the null hypothesis and concludes that the
variances are equal.
In conclusion, the data with removed outliers have equal variances, however, the errors are not
normally distributed either, therefore the removed outliers model violates the ANOVA
assumptions as well.
Transformed Transform/Remove
PCC -.2568 -0.1464
Shapiro-Wilks -.2322 0.4070
Log-likelihood -.3961 -0.1392
Both the original data and the transformed data failed to satisfy the ANOVA assumptions.
Therefore, we use a boxcox transformation on the original data. Afterwards, we find the best
from the Shapiro-Wilks, PPCC, and log-likelihood. The log-likelihood has the lowest value.
From the boxplot It appears that there is one outlier for both lower and upper tails. The error by
group also appears to show that the group variances are unequal.
By applying the Shapiro-Wilks test, the F-test gives a p-value of .0673 which rejects the null
hypothesis, the errors are normally distributed. Also applying the Brown-Forsythe test, the F-test
gives a p-value of .6597 which rejects the null hypothesis.
Finally, we test the transformed and removed outlier data to check with the ANOVA
assumptions. By applying the Shapiro-Wilks test, F-test gives the p-value .2572 which rejects the
null hypothesis, the errors are normally distributed. Applying the Brown-Forsythe test, the F-test
gives a p-value of .2606, which rejects the null hypothesis. Therefore, the population has equal
variance.
From our ANOVA testing, we create our best model:
D.
In conclusion the transformed and removed outlier data satisfies the ANOVA assumptions. As a
result, since both the original, the removed outlier, and transformed data both failed the ANVOA
assumptions, the transformed and removed outlier data is considered the best model. Despite
having a better fit model, there are still some drawbacks. The data is now harder to interpret and
even harder to revert back to the original data. Even though the model has improved, removing
the outliers also has the effect of removing important information contributing from those
even harder to revert back to the original data. Even though the model has improved, removing
the outliers also has the effect of removing important information contributing from those
outliers.
The reason to interpret transformed data will help visualize equal variances or errors of
normality. There are times when the original data is unable to express what we are looking for.
Therefore, changing the data whether it is by transformation or outliers improves our
understanding of the subject.
I.
The data comes from a random sample of technology workers in Seattle and San
Francisco. The data deals with different job titles: data scientist, software engineer, and
bioinformatics engineer and their annual salaries. By comparing this small sample size of jobs,
we are able to determine the population annual salaries for these jobs.
The approaches we plan to use is ANOVA test, F-test, hypothesis testing, Shapiro-Wilks,
Brown-Forsythe, transformation of data, removal of outliers, coefficient of partial determination,
interaction, confidence intervals, and QQ normality.
II.
Assumptions:
1 All subjects randomly sampled/independence
2 All levels of factor A independent (Profession)
3 All levels of factor B independent (Region)
4
For all tests, .
From the interaction plot, the Data scientist and the Bioinformatics engineer are parallel
to each other in their respected regions. However, the Data scientist and the Software engineer
decrease for the Seattle region. From the interaction plot, we assume there to be an interaction.
III.
The histogram for the salaries are roughly symmetric. Therefore, the histogram reveals
that the data is normally distributed. By looking at the boxplot, the data appears to be roughly
normal. However, Bioinformatics in Seattle contains an outlier. This outlier affects the data and
violates the ANOVA assumptions. As shown in the boxplot, there are two groups which seem to
be larger than the others; Data scientist in Seattle and Data scientist in San Francisco. From the
boxplot, it appears that the data violates the normal assumption and equal variance. Despite, the
outlier, we plan to leave it in the data.
IV.
Sample Mean Standard Deviation

DS,SF 117.7688 8.7866
DS,S 112.5272 14.2892
SE,SF 110.2641 12.8385
SE,S 95.5487 11.5987
BE,SF 82.4191 10.5214
BE,S 79.7548 8.7866
From our assumption that there is an interaction, the model should look like this:
Estimate STD Error T-value P(>|t|)

Intercept 79.755 2.586 30.835 2*10-16
DS 32.772 3.658 8.959 7.30*10-15
SE 15.794 3.658 4.318 3.38*10-5
SF 2.664 3.658 .728 .4679
DS,SF 2.577 5.173 .498 .6193
SE,SF 12.051 5.173 2.330 .0216
From our interaction plot, we assume our model contains interactions. To support our
claim, we test the interaction against no-interaction.
Conditional percentage of reduction in error:
Test statistics:
AB A+B A B Empty
SSE 15252.93 16058.34 17764.09 39872.94 41578.69
.0501 .0960 .5972

F-test 3.0098 12.322 86.014
P-value .0532 .0006 2.2*10-16
From the coefficient of partial determination (Partial R2), we find that there does not exist
P-value .0532 .0006 2.2*10
From the coefficient of partial determination (Partial R2), we find that there does not exist
an interaction when we test for . When we add an interaction effect to a model with factor A
(profession) and factor B (region) effects, the overall reduction in error is 5%. The F-test gives a
p-value of .0532 which fails to reject the null hypothesis, and conclude that there is no
interaction effect. Adding an interaction effect to the model reduces the errors is not effective,
therefore, we should use a no-interaction model.
From the , when we add factor A (profession) to a model with factor B (Region), the
overall reduction in error is 9%, and , when we add factor B to a model with factor A, the overall
reduction in error is 59%. The respective F-tests also reject their null hypothesis meaning that
factor A and factor B exist.
df(SSE) = nt-a-b+1
can be one of either Bonferroni, Tukey, or Scheffe.
Bonferroni: = 2.684
Tukey: = 2.374
Scheffe: = 2.48
Bonferroni is a good multiplier for any type of confidence interval. Tukey is a good
multiplier for pairwise confidence intervals. Finally, Scheffe is a good multiplier for contrast
confidence intervals. We plan to use Tukey for the pairwise confidence intervals and Scheffe for
the contrasts. If we wanted to compare them equivalently, we would use Bonferroni, since it can
be used in any type of confidence interval.
Contrast (include pairwise):
S SF
BE 2.438 -2.438
DS 1.1493 -1.1493
SE -3.5874 3.5874
From the table above, The and equal zero makes sense, because there is no interaction effect at
all.
Best model:
Confidence Intervals:
Lower Bound Upper bound

(1,-1,0) -40.2013 -27.9206
(1,0,-1) -27.9597 -15.6791
(0,1,-1) 6.1012 18.3818
(1,-1) -10.3440 .7661
17.5961 28.7063
-12.5539 -2.569
In the first confidence interval, the true annual salary of a Bioinformatics engineer has a lower
salary compared to the Data scientist.
The second confidence interval, the true annual salary of a Bioinformatics engineer has a lower
salary compared to the Software engineer.
The third confidence interval, the true annual salary of a Data scientist has a high salary than the
The second confidence interval, the true annual salary of a Bioinformatics engineer has a lower
salary compared to the Software engineer.
The third confidence interval, the true annual salary of a Data scientist has a high salary than the
Software engineer.
The fourth confidence interval, the true annual salary from San Francisco is not significantly
different compared to Seattle.
The fifth confidence interval, the true average annual salary of Bioinformatics and Data scientist
has a higher salary than the Software engineer.
The sixth confidence interval, the true average annual salary of Bioinformatics and Software
engineer has a lower salary than the Data scientist.
V.
We test interaction against no-interaction models and find from the Partial R2 and from
the F-test, that the better model is the no-interaction model. It appears that the Bioinformatics
engineer has the lowest salary of the three and the Data scientist has the highest salary of the
three. The boxplot made it appear that one region had higher wages than the other, in actuality,
both region provide similar salaries.
VI.
The ANOVA assumptions might have been violated, therefore our conclusion may not be
dependable. From the boxplot, histogram, and interaction plots, it seemed as though there was an
interaction effect in the model. However, from the ANOVA test and Partial R2 test, the best
model is the no-interaction model. Overall, the region does not differ in salary, rather that the
profession is the main cause for salary differences. Therefore, the best profession to receive the
highest annual salary is the Data scientist.

106 Project 2

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

106 Project 2

Caricato da

Copyright:

Formati disponibili

Bailey Wang

For all tests, .

Original Removed Outlier Transformed Transform/Remove

From our ANOVA testing, we create our best model:

For all tests, .

Sample Mean Standard Deviation

Estimate STD Error T-value P(>|t|)

.0501 .0960 .5972

Lower Bound Upper bound

Potrebbero piacerti anche