Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Statistics
Statistics is the science of
conducting studies to collect,
organize, summarize, present,
analyze, and make conclusions from
data.
• The branch of statistics that involves
the collection, organization,
summarization, and presentation of
data is called descriptive statistics.
–Mean - Midrange
–Median - Weighted mean
–Mode
Mean
• Is the average of a group of numbers
• Applicable for interval and ratio data,
not applicable for nominal or ordinal
data
• Affected by each value in the data set,
including extreme values
• Computed by summing all values in the
data set and dividing the sum by the
number of values in the data set
Example
The table represent the no. of miles
run per week for sample of 20
runners
Median
• Middle value in an ordered array of
numbers.
• Applicable for ordinal, interval, and ratio
data
• Not applicable for nominal data
• Unaffected by extremely large and
extremely small values.
Find the median
1. The number of rooms in the downtown
hotels in Pittsburgh
292, 300, 311, 401, 595, 618, 713
Mode - 109
Bimodal - 2 modes
Find the modal class for the
frequency distribution of miles that
20 runners ran in one week
Class frequency
5.5 - 10.5 1
10.5 - 15.5 2
15.5 - 20.5 3
20.5 – 25.5 5
25.5 – 30.5 4
30.5 – 35.5 3
35.5 – 40.5 2
Midrange
Find the midrange of the NFL signing
bonuses , given the bonuses in million of
dollars are
18.0, 14.0, 34.5, 10, 12.4, 10
Weighted Mean
A student received an A in English
Composition 1 (3 credits), a C in
Psychology I (3 credits), a B in biology
(4credits) and a D in PE (2 credits).
Assuming A = 4 grade point
B = 3 grade point
C = 2 grade point
D = 1 grade point
find the student’s grade point average.
Course Credits (w) Grade (X)
A
10 20 30 40 50 60
25 30 35 40 45
Measures of Variation
• Range
• Variance
• Standard Deviation
Range
Comparison of outdoor paint
Find the range the range of the two brands of
paint.
For brand A, the range is R = 60 – 10
= 50 months
For brand B, the range is R = 45 -25
= 20 months
The range for brand A shows that 50 months
separate the largest data value from lowest
data value. For brand B, 20 months separate
the largest data value from lowest data
value.
Variance
Standard Deviation (Population)
X
35 0 0
45 10 100
30 -5 25
35 0 0
40 5 25
25 -10 __100__
Since the standard deviation of brand A is
7.1 and the standard deviation of brand B
is 6.5.the data is more variable in brand
A. In summary, when the means are
equal the larger the variance and the
standard deviation is, the more
variable the data are.
Variance and Standard Deviation
(sample)
Find the sample variance and standard
deviation for the amount of European auto
sales for a sample of 6 years shown.
The data are in millions of dollars.
5.5 – 10.5 1 8 8 64
10.5 – 15.5 2 13 26 338
15.5 – 20.5 3 18 54 972
20.5 – 25.5 5 23 115 2645
25.5 – 30.5 4 28 112 3136
30.5 – 35.5 3 33 99 3267
35.5 – 40.5 2 38 76 2888
n =20
Coefficient of Variation
The mean of the number of sales of
cars over a 3 – month period is 87,
and the standard deviation is 5. the
mean of the commission is $ 5225
and the standard deviation is $773.
Compare the variation of the two.
Measures of Position
Used to locate the relative position of a
data value in the data set. For example if
the value is located in the 80th percentile ,
it means that 80 % of the values fall below
the distribution and 20 % of the values
fall above it. The median is the 50th
percentile , since one half of the values
fall below it and one half of the values fall
above it.
Measures of Position
To measure the relative position of the
data value in the data set:
• Standard scores
• Quartiles
• Deciles
• Percentiles
There is an old saying that “You cannot
compare apples and oranges”. But with
the use of statistics , it can be done to
some extent. Suppose a student scored
90 on music test and 45 on English
exam. Direct comparison of raw scores is
impossible , since the exam might not be
equivalent in terms of number of , value
of each question and so on. However, a
comparison of relative standard similar ro
both can be made. This comparison uses
the mean and standard deviation and is
called standard score or z score.
Standard Score or Z Score
Example:
Cheese Substitute
270 180 250 290
130 260 340 310
Real Cheese
Cheese Substitute
215 265 300
130 340
6. 39 7. 39 8. 40 9. 40 10. 41
20 25 30 35 40 45 50 55 60 65 70 75 80
Stem and Leaf
• A stem & leaf plot organizes data points by
the place value of the leading digits. When
making a stem & leaf plot, each item of data
is separated into two parts. The “stems”
usually consist of the digits in the greatest
common place value of each item of
data. The “leaves” contain the other digits
of each item of data.
•
1. For example, suppose you're given the
data set 35, 37, 23, 24, 27, 31, 33, 49,
34, 35, 41, 35, 37, 23, 24, 27, 31, 33, 49,
34, 35, 41 .
3 13457
4 19
2. Consider 65, 72, 96, 86, 43, 61, 75, 86,
49, 68, 98, 74, 84, 78, 85, 75,
86, 73,
Stems Leaves
4 3 9
5
6 1 5 8
7 2 3 4 5 5 8
8 4 5 6 6 6
9 6 8
3. The number of stories in selected
samples of tall buildings in Atlanta and
Philadelphia is shown. Construct a back
to back stem and leaf plot, and compare
the distributions
Atlanta Philadelphia
55 70 44 36 40 61 40 38 32 30
63 40 44 34 38 58 40 40 25 30
60 47 52 32 32 54 40 36 30 30
50 53 32 28 31 53 39 16 34 33
52 32 34 32 50 50 38 36 39 32
26 29
Atlanta Philadelphia
1 6
9 8 6 2 5
8 6 4 4 2 2 2 2 2 3 0 0 0 0 2 2 3 4 6
6 6 8 8 9 9
7 4 4 0 0 4 0 0 0 0
5 3 2 2 0 0 5 0 3 4 8
3 0 6 1
0 7
The buildings in Atlanta have a large
variation in the number of stories per
building. Although both data are peaked
in the 30 –to – 39 story class ,
Philadelphia has more buildings in this
class . Atlanta has more buildings that
have 40 or more stories than Philadelphia
does.
Another important point to remember is
that summary statistics (median and
interquartile range) used in exploratory
data analysis are said to be resistant
statistics. A resistant statistics is relatively
affected by outliers. The mean and
standard deviation are nonresistant
statistics. Sometimes when a distribution
is skewed or contains outliers.
Traditional versus EDA Techniques
Traditional Exploratory Data
Analysis
Discrete variable:
number of people, number of phone
calls, outcome in rolling a die, etc
Continuous variable:
Temperature, a person’s height,
time, a person’s weight, etc
Many continuous variables are
have distributions that are bell
shaped , and these are called
approximately normally
distributed variables
The Normal Distribution
iii) Therefore
( 95%)(4500) = 4275
of the tomatoes can be expected to
weigh from 0.31 lb to 0.91 lb
The Standard Normal Distribution
A table is used in order to determine the
approximate areas of the standard normal
distribution between the mean 0 and z
standard deviations from the mean.
b. Dogs
c. Camels
Stride length 2.5 3.0 3.2 3.4 3.5 3.8 4.0
(m)
Speed (m/s) 2.3 3.9 4.4 5.0 5.5 6.2 7.1
After the relationship being paired (which is
referred to as bivariate data) , has been
discovered, the scientist try to model the
relationship with an equation. One method of
determining linear relationship is called linear
regression.
The least square regression line for a set of bivariate data is the line that
minimizes the sum of the squares of the vertical deviations from each data
point of the line.
Example:
1. Find the equation of the least square line for
the ordered pairs: (2.5, 3.4), (3.0, 4.9),
(3.3, 5.5), (3.5, 6.6), (3.8, 7.0), (4.0, 7.7),
(4.2, 8.3), (4.5 8.7)
x y xy
2.5 3.4 6.26 8.50
3.0 4.9 9.00 14.70
3.3 5.5 10.89 18.15
3.5 6.6 12.25 23.10
3.8 7.0 14.44 26.60
4.0 7.7 16.00 30.80
4.2 8.3 17.64 34.86
4.5 8.7 20.25 39.15
Car Rental Companies: