Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Wh t type
What t off variable
i bl is
i thi
this?
?
E l
Explanatory
t vs Response
R V
Variables
i bl
1800
1600
1400
1200
1000
ounts
Count
Co
800
600
400
200
0
Always Most Times Sometimes Rarely Never
Response
Summarizing two categorical
variables
Now we want to make some comparisons based
on the gender of the people that answered.
Female
915 276 167 84 25 1467
(62 4%)
(62.4%) (18 8%)
(18.8%) (11 4%)
(11.4%) (5 7%)
(5.7%) (1 7%)
(1.7%) (100%)
Male
771 302 247 165 90 1575
(49 0%)
(49.0%) (19 2%)
(19.2%) (15 7%)
(15.7%) (10 5%)
(10.5%) (5 7%)
(5.7%) (100%)
B Ch
Bar Chartt ((with
ith counts)
t ) Frequancy of seat belt usage by gender
1000
900
800
700
600
C oun ts
Female
500
Male
400
300
200
100
0
Always Most Times Sometimes Rarely Never
Response
B Ch
Bar Chartt ((with
ith percentages)
t )
Relative frequance of seat belt usage by gender
70
60
50
40
entage
Female
Perce
Male
30
20
10
0
Always Most Times Sometimes Rarely Never
Response
B chart
Bar h t
Which of the two bar graphs is correct?
Which
Whi h off th
the ttwo b
bar graphs
h hhas more
important information?
T bl
Tables
If we go back to the last table we will see that we have
only row percentages and not column percentages. That
is, there is nothing saying that out of the 1686 people
that always wear seatbelts the 45.7% are male and the
54.3% are female.
The reason is that when you summarize two categorical
variables, the variable of interest is identified as your
response variable (or outcome variable) and is the
one that is defined in the columns
T bl
Tables
So when constructing a table we put the:
Explanatory variable in the rows (gender).
This is the variable in which we are interested
for the percentage inside each category
Response variable in the columns (Frequency
of seatbelt usage). This is the variable in
which we are interested for the ppercentage
g of
each category in all the categories of the
explanatory variable
F t
Features off Quantitative
Q tit ti Data
D t
Every quantitative data have the following
important characteristics:
Location
L ti
Spread
Shape
Outliers
For now, we will see how to calculate these and
also how to illustrate them using several plots
(we will see later how to calculate them)
U f ld
Useful details
t il
We denote with n the number of
observations in a dataset.
The raw data values are represented with
x1 , x2 ,..., xn
L
Location
ti
Location measures usually give
indications for the center of the data
The two most important location measures
are:
Mean,is the arithmetic average.
Median,
Median is the middle value of the data
data.
Mean
Is denoted with x .
For calculating the mean we sum all the
values in the dataset and then divide by
the number of values
n
x i
x i 1
n
Mean
Example:
Iask 7 students how many credit hours they
have this semester. The answers are: 4, 6, 3,
9, 5, 4, 3. Find the mean number of credit
hours they have this summer term.
M di
Median
Median is the middle data value of a
samplep that is ordered.
It is denoted with the letter M.
M di
Median
To calculate the median
first
you have to order the data from the
smaller to the largest value.
Then if n is odd, the median is the n 1 th value
in the dataset. 2
If n is even, then the median is the average of the
n n
th and the 2 1 th value in the dataset
2
M di
Median
Example 1:
Iasked 7 students how many credit hours
they have this semester. The answers they
gave me were: 4, 6, 3, 9, 5, 4, 3. Find the
median number of credit hours they have this
summer term.
M di
Median
Example 2:
Let
Lets
s
say my sample has 8 students. That is
the data are 4, 6, 3, 9, 5, 4, 3, 6. What is the
median now?
S
Spread
d
Spread measures usually give indication
for the variability
y of the dataset.
The most useful spread measures are
the range
range,
the interquartile range
the standard deviation
variance
Range
Range is the difference between the two
extreme values. That is Range=max-min
g
Example:
Iasked 7 students how many credit hours
they have this semester. The answers they
gave me were: 4,, 6,, 3,, 9,, 5,, 4,, 3. Find the
g
range of the number of credit hours they have
this summer term.
I t
Interquartile
til Range
R (IQR)
The interquartile range is the range between the
difference between the upper quartile and the
lower quartile, that is:
IQR Q3 Q1
Lower quartile is the median of the lower half of
the ordered data values and is denoted by: Q1
Upper quartile is the median of the upper half of
the ordered data values and is denoted by: Q
3
I t
Interquartile
til Range
R (IQR)
In this context median is denoted by Q2
Example 1:
I asked 7 students how many credit hours
they have this semester.
semester The answers they
gave me were: 4, 6, 3, 9, 5, 4, 3. Find the
interquartile range of the number of credit
hours they have this summer term.
I t
Interquartile
til range (IQR)
Example 2:
Let
Lets
ssay my sample has 8 students. That is
the data are 4, 6, 3, 9, 5, 4, 3, 6. What is the
interquartile range now?
I t
Interquartile
til range (IQR)
!!!!!!!!!!!!!!!!!!!!!!!!!!!IMPORTANT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Throughout this class we will learn to use
Mi it b Mi
Minitab. Minitab
it b calculates
l l t th the llower andd upper
quartile using a different procedure. That is, you
might find different results than Minitab. When
you are asked to do an exercise by hand you
should use the procedure I have just shown,
when
h you are asked k d to
t do
d it in
i Minitab
Mi it b th
then you
should report Minitabs answer.
St d d deviation
Standard d i ti
Standard deviation is roughly the average
distance, the values fall from the mean.
L t say I ask
Lets k 5 students
t d t h how many credit
dit h
hours
they have this semester and I get the following
answers: 4, 4, 4, 4, 4
Here the mean is 4 and the standard deviation is 0
If I ask another 5 students and the answers I get
g
is 3, 3, 4, 5, 5 then
The mean is 4 and the standard deviation is 1
St d d deviation
Standard d i ti
How to calculate standard deviation.
Step 1: Find the mean
Step 2: For each observation in the sample find the
square distance from the mean.
Step 3: Sum all the square distances and divide by
n 1
Step 4: Take the square root of the result in Step 3
3.
n
Or else using the formula:
x x
2
i
s i 1
n 1
V i
Variance
Variance is just the square of standard
deviation. That is:
n
x x
2
i
s
2 i 1
n 1
E
Example
l
I asked 5 students how many credit hours
theyy have this semester. The answers
they gave me were: 3, 5, 6, 4, 7. Find the
standard deviation and the variance of the
number of credit hours they have this
summer term.
Shape
Sh
Shape is a measure that describes how
the values are distributed.
The easiest way to see this is by
histograms.
histograms
I will explain first how to create histograms
and then how to identify the shape
shape.
Hi t
Histogram
How to create a histogram:
Step 1: Decide how many equally spaced intervals to
use for horizontal axis. Usually something between 6
and 15 is appropriate.
Step 2: Decide if you will use frequencies or relative
frequencies
q on the vertical axis
Step 3: Draw the appropriate number of equally
spaced intervals
Step 4: Determine the frequency or relative frequency
for each data values in each interval and draw a bar
corresponding to the height.
Hi t
Histogram
Example:
Iask 24 students to tell me what was the
fastest speed they have ever driven. The
answers I get are summarized in the following
histogram:
Hi t
Histogram
4
ncy
Frequen
50 60 70 80 90 100 110
Speed
Shape
Sh
Now, that we know how to construct a
histogram, we will see how to make inference on
what type of shape we have
have.
Shape can be:
Symmetric
Skewed
A symmetric
y dataset can be bell shaped
p
A skewed dataset can be skewed to the left or
skewed to the right.
Shape
Sh
6
quency
Freq 3
50 60 70 80 90 100 110
Speed
The above
Th b iis a symmetric
ti d dataset.
t t
It is also bell shaped.
Shape
Sh
4
Frequency 2
50 60 70 80 90 100 110
Speed_1
6
Frequenccy
5
50 60 70 80 90 100 110
Speed
p _1_1
6
Frequencyy
50 60 70 80 90 100 110
Speed_1_1_1
uency
Frequ 3
50 60 70 80 90 100 110
Speed
uency
5
Frequ 4
50 60 70 80 90 100 110
Speed_1_2
This is symmetric,
symmetric not bell-shaped
This is bimodal
O tli
Outliers
There is no formal definition of an outlier. In
general, is a data point not consistent with the
bulk of data.
Example:
Credit hours example. I asked 5 students the number
of credit hours they have this semester and I get the
following answers: 5, 6, 3, 22, 4. Which value can be
considered as an outlier.
Eff t off outliers
Effect tli
What is the effect of outliers?
Outliers generally affect the mean and the
standard deviation a lot while at the same
time they do not affect the median and the
IQR that much. So the median and IQR is
said to be resistant or robust to outliers
Eff t off outliers
Effect tli
I asked 7 students the number of credit hours
they are enrolled. I got the following answers 4,
6 3,
6, 3 9,
9 5,
5 4,
4 3.
3
Mean: 4.86, standard deviation: 2.116
Median:
ed a 4,, IQR:
Q 3
Now if I ask another 7 students and I get the
following answers 4, 6, 3, 29, 5, 4, 3
Mean: 7.71, standard deviation: 9.45
Median: 4, IQR: 3
B
Boxplot
l t
The easiest way to identify outliers is by drawing a
boxplot.
A boxplot is constructed as follows:
Step 1: Find the quartiles of the dataset.
Step 2: Find the IQR
Step 3: Put on one axis the numbers from the minimum to the
maximum of the data
Step 4: Draw a box with the lower end at the lower quartile and
the upper end at the upper quartile and a line exactly where the
median is
.
B
Boxplot
l t
Step 5: Draw a line that extend from the lower
quartile to the smallest data value that is not
smaller of Q1 1.5*
1 5* IQR
Step 6: Draw a line that extend from the upper
quartile to the largest data value that is not
greater of Q3 1.5* IQR
Step 7: Mark all the values that are between
the two values, Q1 1.5* IQR and Q3 1.5* IQR
with an asterisk. Those are the outliers
B
Boxplot
l t
Example:
The weight of nine students in pounds where the
following: 188.5, 183.0, 194.5, 185.0, 214.0, 203.5,
186.0, 178.5, 109.0.
First we put the data in order: 109
109.0,
0 178
178.5,
5 183
183.0,
0
185.0, 186.0, 188.5, 194.5, 203.5, 214.0
Find the median: (sample size is 9, that is odd, and so
the (9+1)/2=5th value is the median which is equal to
186.0
B
Boxplot
l t
Now take the lower portion of the data and find the
lower quartile. That is: 109.0, 178.5, 183.0, 185.0.
So the median of this part of the data with size 4 is
the average of the 2nd and the 3rd value. That is:
(178.5 +183.0)/2 which gives us 180.75
Doing the same thing with the upper portion of the
data 188.5, 194.5, 203.5, 214.0 we have that the
upper quartile is (194.5+203.5)/2
(194 5+203 5)/2 which gives us
199.0
B
Boxplot
l t
Calculate the IQR=199-180.75=18.25
Calculate 1.5*IQR=1.5*18.25=27.375
Any value less than Q1 1.5*
1 5* IQR that is 180
180.75-
75
27.375=153.375 is an outlier
Now since the smaller point greater than 153.375 is
178.5 the line will extend up to 178.5
Any value greater than Q3 1.5* IQR that is
199.0+27.375=226.375 is an outlier.
Now since the larger point less than 226.375 is 214.0
the line will extend up to 214.0
B
Boxplot
l t
Here
it is:
200
ht
W eigh
150
100
H
How tto h
handle
dl outliers
tli
Reasons why we have outliers:
The outlier is a legitimate data value and represents
natural variability of the variable measured
A mistake was made while taking the measurement
or while entering it into the computer
The individual in question belongs in another group
that the one of interest
In the first case we keep the outliers while in the
last two we should discard them.
Oth plots
Other l t
For quantitative data we have seen the
histogram and the boxplot. There are
another three ways that we can use in
order to view information about location,
spread and shape
shape. Those are
are:
Fivepoint summary
Stem and Leaf plot
Dotplot
Fi point
Five i t summary
Five point summary is consist of the maximum
and the minimum values, the upper and lower
quartiles and the median
median.
Example:
The weight of nine students in pounds where the
following: 188.5, 183.0, 194.5, 185.0, 214.0, 203.5,
186.0, 178.5, 109.0
Minimum is 109
109.0,
0 Maximum is 214214.0,
0 Lower Quartile
is 180.75, Upper Quartile is 199 and the Median is
186
Fi point
Five i tSSummary
Median 186
Quartiles 180.75
180 75 199
Extremes 109 214
28 38 48 58 68 78 88
Age
P
Percentiles
til
The quartiles and the median are special cases
of percentiles
The kth percentile is a number that has k% of
the data values at or below it.
Example:
I ask7 students how many credit hours they have this
semester.
t The
Th answers are: 4, 4 6
6, 3
3, 9,
9 5,
5 4
4, 3
3. Fi
Find
d
the 20th percentile of those numbers.
P
Percentiles
til
Since 20th percentile we need at least 20% of the data
to be below it. So that is one fifth of the data. Since I
have 7 numbers that means that we need 1 1.4
4 of those
numbers to be below that percentile.
So if we order them we have the following g order 3, 3,
4, 4, 5, 6, 9
So the 20th percentile is between the first and second
number which are both equal to 3 3. So the 20th
percentile is equal to 3.
B ll Shaped
Bell Sh d Di
Distributions
t ib ti
Nature seems to be fair most of the times
and most numerical variables seems to
follow what is called a bell-shaped
y
symmetric curve.
This curve is so important that it is also
called a normal curve or normal
distribution.
N
Normal
l curve
Ch
Characteristics
t i ti off normall curve
Each normal curve is characterized by its:
Mean
Standard deviation
If the mean is equal to 0 and the standard
deviation equal to 1 then we have the
standard normal curve
E i i l rule
Empirical l
The empirical rule states that for any bell
shaped
p curve:
68% of the values fall within 1 standard
deviation of the mean in either direction
95% of the values fall within 2 standard
deviations of the mean in either direction
99.7% of the values fall within 3 standard
deviations of the mean in either direction
E i i l rule
Empirical l
E i i l rule
Empirical l
With the above empirical rule we have
another relationship.
p The standard
deviation can be approximated to be:
Range
R
s
6
E ii lR
Empirical Rule
l
Example:
I ask 30 students the number of credit hours
they have this semester and the answers
have mean 6 and standard deviation 1.2. By
the empirical rule:
68% of the data will be between and .
95% of the data will be between and .
99.7% of the data will be between and .
St d di i th
Standardizing the normall curve
As we said earlier the mean and standard
deviation of the sample is characterizing the
normal curve that we fit to it.
The easiest normal curve to work with is the
standard
t d d normall curve; th the one th
thatt h
has mean 0
and standard deviation 1.
S we have
So, h a way tto convertt every dataset
d t t in
i
order to fit a standard normal curve.
St d di i th
Standardizing the normall curve
In order to standardize a score we have
the following procedure:
Score mean
z
Standard deviation