Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Alyssa Winn
Skittles Statistical Analysis
Term Project
At the beginning of the semester each student in the class purchased a 2.17 ounce bag
of Skittles that we would use as a sample of the total population of Skittles produced.
Throughout the semester we worked individually and as a group to analyze the data
from our individual bag and from all bags of candies collected by the class. As we
learned statistical concepts we applied them at various stages of the project. As a group
we collaborated to analyze the data and apply confidence intervals and hypothesis
tests. As an individual I had to draw conclusions from the data and demonstrate my
understanding of the process.
The following is the combination of my individual and my group’s work during the five
stages of the project.
The expected proportions/ percentages for Red, Orange, Yellow, Green and Purple are 20% each. This is
based on the assumption the colors have even chances of appearing. In reality, even though the Skittles
are distributed by standardized processes and machinery, variability will, to some extent, still be
introduced. Therefore, it is highly unlikely each color will account for exactly 20% in each 2.17oz bag.
*Inserting the graphs to this document caused distortion so they may be more difficult to read. The program used to create the
graphs only allowed for colors to be randomly assigned - so the colors show on the pie chart don’t match the candy color.
Yes. the data represents a random sampling of 2.17oz bags of Skittles, at least within the Salt
Lake City, Utah area. The population represented by this sample is all 2.17oz bags of Skittles
available for purchase. The bags were presumably purchased from various (and somewhat
unique) stores by each member of the class, though likely these stores were conveniently
accessible for each student. The results could perhaps be distorted if the production process,
delivery process, or availability of 2.17oz bags of Skittles were different for this geographic region
and, in particular, for the stores that were most convenient to the students. A likely better
representation of the population would be to purchase 2.17oz bags of Skittles from different
geographic locations and different stores, varying data and times of purchase leading up to the
assignment. This would probably provide a better sampling of the population since this increases
the chances of purchasing bags of Skittles from different production groups.
My Bag 16 6 13 14 11 60
Before starting this project I had the expectation that each color would represented rather equally
in each bag. However, in my bag that wasn’t the case. So I then thought that if what began as
equal amounts of the colors was then divided amongst the bags, each bag may have different
proportions of each color, but when all the samples were brought back together it would reflect
the equal proportions of colors in the total population. Our group’s Pareto chart reflects that
concept because the frequency of each color is relatively similar. In the initial PDF of data from
the class there were two sets of outliers where the frequency of the colors was significantly higher
than the other occurrences. This is probably due to the students purchasing the incorrect bag
size. If those numbers weren’t omitted in our Modified Data PDF then it would create misleading
graphs because colors wouldn’t be shown from the same population size, it would also affect the
mean and median later in our project.
Graphs that are describing categorical data need to help compare the data without attaching
numerical values. It is best to use Pareto Diagram or bar graph when representing categorical
data because it is easy to compare the number of occurrences of the categorical data side by
side. Pie charts can also represent categorical data.
Graphs that are describing quantitative data can vary depending on how the values are being
interpreted. A histogram is best used with quantitative data because it can describe the spread,
center and identifies the different classes within the data. A stem and leaf plot can only be used
with quantitative data because it seeks to organize the numerical values. The same can be said
about scatter plots and time-series graphs.
The three requirements that must be met to construct a confidence interval for a population
proportion are:
1. The sample was obtained through a simple random sample since several students
obtained a 2.17oz bag of Skittles from various and (at least somewhat) unique locations.
2. np̂(1 − p̂) ≥ 10 where n=4394 p̂=0.205 and 1-p̂=0.795
4394 × 0.205(1 − 0.205) = 716.112
716.112 ≥ 10 ✓ Verified
3. Skittles were sampled from 74 bags out of millions sold. It is therefore reasonable to
assume that the sample size is less than 5% of the population size (2 ≤ 0.05N )
99% confidence interval is (0.189,0.221)
√(
p̂(1−p̂)
Lower and upper bounds: p̂ ± z a2 × n
)
where α = 0.10 and z .01
2 = 2.5758
√
0.205(1−.205)
Lower: 0.205 − 2.5758 × 4394
= 0.189
√
0.205(1−.205)
Upper: 0.205 + 2.5758 × 4394
= 0.221
upper limit−lower limit
The margin of error is equal to
2
0.221−0.189
2 = 0016 or 1.6%
The confidence interval, 0.205 ± 0.016 , indicates that if a large number of different samples is
obtained, we expect 99% of intervals will encapsulate the population proportion of yellow candies
out of all candies.
Construct a 90% confidence interval estimate for the population mean number of candies per bag
Sample mean number of candies per bag (x) :
Σcandies in each bag
x = number of bags = 4394 74 = 59.4 candies per bag
Since we have the sample mean, we will construct a confidence interval for a population mean
(μ) :
The two requirements that must be met to construct a confidence interval for a population mean
are:
1. The sample was obtained through a simple random sample since several students
obtained a 2.17oz bag of Skittles from various and (at least somewhat) unique locations.
2. n = 74 ≥ 30 ✓Verified
90% confidence interval is (58.8, 59.9)
Lower and upper bounds: x ± t a2 × s
√n
where α = 0.10 ,
t .10
2 = 1.6660 and s = 2.812412
2.812412
Lower: 59.4 − 1.6660 × = 58.8
√74
2.182412
Upper: 59.4 + 1.6660 × = 59.9
√74
upper limit−lower limit
The margin of error is equal to: 2
59.9−58.8
2 = 0.55 or 0.55 candies per bag
This confidence interval, 59.4 ± 0.55, indicates that if a large number of different samples is
obtained, we expect 90% of intervals will encapsulate the population mean number of candies per
bag.
Individual Work for Part 4:
In a paragraph, explain in general the purpose and meaning of a confidence interval.
Confidence intervals provide a range of values that are likely to to contain the population parameter. A
confidence level is used to show the percentage of samples that will contain the population parameter,
based on the range of values found in the confidence interval. If the confidence level was higher, it means
the range of values that could include the unknown parameter is wider, which would result in an increased
margin of error. If the sample size is increased it allows for more data and the confidence interval would
be narrower. Essentially the goal of constructing a confidence interval is to better estimate the true
population and to see if our sample actually reflects the population.
Part 5 - Reflection
The final part of our course project was to complete a reflection on what we learned and
how it can be applied to other classes or our future career.
I was explaining to my husband that it is much easier to use a concept you just
learned on a problem that has the similar structure but so much harder when you
have to determine the right application to interpret the data and then be able to
conceptualize the statistics. This project was like a baby step for a real world
application of statistics.
The biggest take away from the project was it’s not just about the data but
how I, as the author/ researcher, interpreted and explained how the statistics
support my argument. I am a political science and sociology major so it was
incredibly helpful to learn how to identify misleading graphs and how I should
present data to support my conclusion. In my other classes I am already applying
the concepts I have learned. With statistics I’m able to analyze data to make
inferences on cultural changes in society. It allows me to look at raw data to
make my own claim instead of taking another person’s interpretation as truth.