Sei sulla pagina 1di 90

Chapter 2

Wednesday, May 19th


R
Raw d
data
t
Raw data is a term used for numbers and
category
g y labels that have been collected
but not yet processed in any way
Examples:
Iasked five people how many credits they
have this summer term and I have the
following answers:
5, 3, 6, 4, 9
V i bl
Variables
Variable is a characteristic that can differ
from one individual to the next
Examples:
In the previous example where I am
measuring the credit hours of five students
this semester, the variable of interest is the
credit hours
hours.
I am interested about the average weight of
Penn State students. The variable I am goingg g
to measure is weight.
Ob
Observations
ti
An observational unit is a single
individual who p
participates
p in a study.
y
Observation is the value of a single
measurement from one observational unit
Ob
Observations
ti
Examples:
The number of credit hours a student has this
summer term. My observational units are the
students I am asking the questions.
Observations will be the answers I get from
the observational units
If I am interested in how manyy hours a lampp
operates before it burns. What will be my
observation units and what will be my
observations?
D t
Dataset
t
Dataset is the complete set of raw data, for all
observational units and variables in a survey or
experiment.
P
Population
l ti vs S
Sample
l
When all individuals in a population are
measured, the measurements represent
population data
This is easy to do when the population is small.
If I want to see the number of credits in Stat 200
Section 103 is easy to ask each one in this
class.
What happens, though, when the population is
large, lets say I want to see the number of
credits for all 50000 students in Penn State?
P
Population
l ti vs S
Sample
l
Sample is the subset of a population that
is being
gqquestioned
When measurements are taken from a
sample they represent sample data
Sample size for a study is the total
number of observational units in the
sample.
P
Parameters
t and
d St
Statistics
ti ti
The summary measure for an entire
population
p p is called a p
parameter.
The summary measure from a sample
data is called a statistic.
statistic
Descriptive statistics is the summary
numbers for either a population or a
sample.
P
Parameter
t and
d St
Statistics
ti ti
Examples:
I find the average credit hours of Stat 200 Section
103. Since I use all the students in this class this is
called a parameter.
I take a sample of Penn State students and I find the
average number of credit hours Penn State students
have. This is called a statistic because I take a
sample.l
Now since usually we have a collection of numbers
(parameters or statistics) that are interesting, this
collection is called simply descriptive statistics
T
Types off variables
i bl
A categorical variable consists of group or
category names that doesnt necessarily
have any logical ordering.
Examples:
Eye color (Brown, Blue, Green, Hazel)
Country of Residence (U.S.A., Canada,
M i
Mexico, Chi
China, KKorea, JJapan, IIndia,
di C Cyprus))
T-shirts size (S, M, L, XL)
T
Types off variables
i bl
A categorical variable that has ordered
categories
g it is called ordinal variable.
Examples:
T-shirtsizes (S
(S, M
M, L
L, XL)
When you are asked to rate an instructor as
(Poor Almost Poor
(Poor, Poor, Not Good,
Good Good
Good, Very
Good, Almost Excellent, Excellent)
T
Types off variables
i bl
Quantitative variables are recorded as
numerical values,, and the data are either
measurements (height, weight) or as
counts taken on each individual ((# of
siblings, # of classes)
Synonyms:
Measurement variable
Numerical variable
T
Types off variables
i bl
Continuous variable is a subcategory of
quantitative data and is used for variables
q
that can take every value within some
interval.
Examples:
Height
Weight
Hand-Length
Hand Length
E
Exercise
i
Measuring household incomes. What type of
variable is this?

If I measure the household income as follows:


1 means <$25000
2 means $25000-$50000
3 means $50000-$75000
4 means >$75000

Wh t type
What t off variable
i bl is
i thi
this?
?
E l
Explanatory
t vs Response
R V
Variables
i bl

Explanatory variable is the variable that


is thought to partially explain the
response variable
Examples:
Do taller people have larger handspans?
Explanatory variable: height
Response
R variable:
i bl h handspan
d llength
th
Does smoking cause lung cancer?
Explanatory variable: smoking (Yes/No)
Response variable: develops lung cancer (Yes/No)
S
Summarizing
i i C Categorical
t i l variables
i bl

To summarize a categorical variable:


Firstcount how many individuals fall in each
category
Then calculate the percentages
p g
Percentages are usually more informative
Frequency
Frequency is the number of observations that fall into a
category (synonym to count)
Relative frequency is the percentage of observations in
each category, (relative to the total)
A frequency distribution for a categorical variable is a
li ti off allll categories
listing t i along
l with
ith th
their
i ffrequencies
i
A relative frequency distribution is a listing of all
categories along with their relative frequencies
Summarizing one categorical
variable
We asked 3042 Response Count %
12th grade students
How
How often do you Always
y 1686 55.4
wear a seatbelt Most Times 578 19.0
when driving a
car? Sometimes 414 13 6
13.6
Here we have one Rarely 249 8.2
categorical variable
only
N
Never 11
115 38
3.8
Total 3042 100.0
Summarizing one categorical
variable
Pie charts are useful for summarizing a
single
g categorical
g variable if there are not
too many categories
Bar graphs are useful for summarizing
one or two categorical variables
Very useful in making comparisons
Pi chart
Pie h t
Pie Chart of Responce

Always (1686; 55.4%)

Sometimes ( 414; 13.6%)

Most Times ( 578; 19.0%)


Rarely ( 249; 8.2%)
Never ( 115; 3.8%)
B graph
Bar h Frequency of wearing seatbelts

1800

1600

1400

1200

1000
ounts

Count
Co

800

600

400

200

0
Always Most Times Sometimes Rarely Never
Response
Summarizing two categorical
variables
Now we want to make some comparisons based
on the gender of the people that answered.

Always Most Times Sometimes Rarely Never Total

Female
915 276 167 84 25 1467
(62 4%)
(62.4%) (18 8%)
(18.8%) (11 4%)
(11.4%) (5 7%)
(5.7%) (1 7%)
(1.7%) (100%)

Male
771 302 247 165 90 1575
(49 0%)
(49.0%) (19 2%)
(19.2%) (15 7%)
(15.7%) (10 5%)
(10.5%) (5 7%)
(5.7%) (100%)
B Ch
Bar Chartt ((with
ith counts)
t ) Frequancy of seat belt usage by gender

1000

900

800

700

600
C oun ts

Female
500
Male

400

300

200

100

0
Always Most Times Sometimes Rarely Never
Response
B Ch
Bar Chartt ((with
ith percentages)
t )
Relative frequance of seat belt usage by gender

70

60

50

40
entage

Female
Perce

Male
30

20

10

0
Always Most Times Sometimes Rarely Never
Response
B chart
Bar h t
Which of the two bar graphs is correct?

Which
Whi h off th
the ttwo b
bar graphs
h hhas more
important information?
T bl
Tables
If we go back to the last table we will see that we have
only row percentages and not column percentages. That
is, there is nothing saying that out of the 1686 people
that always wear seatbelts the 45.7% are male and the
54.3% are female.
The reason is that when you summarize two categorical
variables, the variable of interest is identified as your
response variable (or outcome variable) and is the
one that is defined in the columns
T bl
Tables
So when constructing a table we put the:
Explanatory variable in the rows (gender).
This is the variable in which we are interested
for the percentage inside each category
Response variable in the columns (Frequency
of seatbelt usage). This is the variable in
which we are interested for the ppercentage
g of
each category in all the categories of the
explanatory variable
F t
Features off Quantitative
Q tit ti Data
D t
Every quantitative data have the following
important characteristics:
Location
L ti
Spread
Shape
Outliers
For now, we will see how to calculate these and
also how to illustrate them using several plots
(we will see later how to calculate them)
U f ld
Useful details
t il
We denote with n the number of
observations in a dataset.
The raw data values are represented with
x1 , x2 ,..., xn
L
Location
ti
Location measures usually give
indications for the center of the data
The two most important location measures
are:
Mean,is the arithmetic average.
Median,
Median is the middle value of the data
data.
Mean
Is denoted with x .
For calculating the mean we sum all the
values in the dataset and then divide by
the number of values
n

x i
x i 1

n
Mean
Example:
Iask 7 students how many credit hours they
have this semester. The answers are: 4, 6, 3,
9, 5, 4, 3. Find the mean number of credit
hours they have this summer term.
M di
Median
Median is the middle data value of a
samplep that is ordered.
It is denoted with the letter M.
M di
Median
To calculate the median
first
you have to order the data from the
smaller to the largest value.
Then if n is odd, the median is the n 1 th value
in the dataset. 2
If n is even, then the median is the average of the
n n
th and the 2 1 th value in the dataset
2
M di
Median
Example 1:
Iasked 7 students how many credit hours
they have this semester. The answers they
gave me were: 4, 6, 3, 9, 5, 4, 3. Find the
median number of credit hours they have this
summer term.
M di
Median
Example 2:
Let
Lets
s
say my sample has 8 students. That is
the data are 4, 6, 3, 9, 5, 4, 3, 6. What is the
median now?
S
Spread
d
Spread measures usually give indication
for the variability
y of the dataset.
The most useful spread measures are
the range
range,
the interquartile range
the standard deviation
variance
Range
Range is the difference between the two
extreme values. That is Range=max-min
g
Example:
Iasked 7 students how many credit hours
they have this semester. The answers they
gave me were: 4,, 6,, 3,, 9,, 5,, 4,, 3. Find the
g
range of the number of credit hours they have
this summer term.
I t
Interquartile
til Range
R (IQR)
The interquartile range is the range between the
difference between the upper quartile and the
lower quartile, that is:
IQR Q3 Q1
Lower quartile is the median of the lower half of
the ordered data values and is denoted by: Q1
Upper quartile is the median of the upper half of
the ordered data values and is denoted by: Q
3
I t
Interquartile
til Range
R (IQR)
In this context median is denoted by Q2
Example 1:
I asked 7 students how many credit hours
they have this semester.
semester The answers they
gave me were: 4, 6, 3, 9, 5, 4, 3. Find the
interquartile range of the number of credit
hours they have this summer term.
I t
Interquartile
til range (IQR)
Example 2:
Let
Lets
ssay my sample has 8 students. That is
the data are 4, 6, 3, 9, 5, 4, 3, 6. What is the
interquartile range now?
I t
Interquartile
til range (IQR)
!!!!!!!!!!!!!!!!!!!!!!!!!!!IMPORTANT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Throughout this class we will learn to use
Mi it b Mi
Minitab. Minitab
it b calculates
l l t th the llower andd upper
quartile using a different procedure. That is, you
might find different results than Minitab. When
you are asked to do an exercise by hand you
should use the procedure I have just shown,
when
h you are asked k d to
t do
d it in
i Minitab
Mi it b th
then you
should report Minitabs answer.
St d d deviation
Standard d i ti
Standard deviation is roughly the average
distance, the values fall from the mean.
L t say I ask
Lets k 5 students
t d t h how many credit
dit h
hours
they have this semester and I get the following
answers: 4, 4, 4, 4, 4
Here the mean is 4 and the standard deviation is 0
If I ask another 5 students and the answers I get
g
is 3, 3, 4, 5, 5 then
The mean is 4 and the standard deviation is 1
St d d deviation
Standard d i ti
How to calculate standard deviation.
Step 1: Find the mean
Step 2: For each observation in the sample find the
square distance from the mean.
Step 3: Sum all the square distances and divide by
n 1
Step 4: Take the square root of the result in Step 3
3.
n
Or else using the formula:
x x
2
i
s i 1

n 1
V i
Variance
Variance is just the square of standard
deviation. That is:
n

x x
2
i
s
2 i 1

n 1
E
Example
l
I asked 5 students how many credit hours
theyy have this semester. The answers
they gave me were: 3, 5, 6, 4, 7. Find the
standard deviation and the variance of the
number of credit hours they have this
summer term.
Shape
Sh
Shape is a measure that describes how
the values are distributed.
The easiest way to see this is by
histograms.
histograms
I will explain first how to create histograms
and then how to identify the shape
shape.
Hi t
Histogram
How to create a histogram:
Step 1: Decide how many equally spaced intervals to
use for horizontal axis. Usually something between 6
and 15 is appropriate.
Step 2: Decide if you will use frequencies or relative
frequencies
q on the vertical axis
Step 3: Draw the appropriate number of equally
spaced intervals
Step 4: Determine the frequency or relative frequency
for each data values in each interval and draw a bar
corresponding to the height.
Hi t
Histogram
Example:
Iask 24 students to tell me what was the
fastest speed they have ever driven. The
answers I get are summarized in the following
histogram:
Hi t
Histogram

4
ncy
Frequen

50 60 70 80 90 100 110
Speed
Shape
Sh
Now, that we know how to construct a
histogram, we will see how to make inference on
what type of shape we have
have.
Shape can be:
Symmetric
Skewed
A symmetric
y dataset can be bell shaped
p
A skewed dataset can be skewed to the left or
skewed to the right.
Shape
Sh
6

quency
Freq 3

50 60 70 80 90 100 110
Speed

The above
Th b iis a symmetric
ti d dataset.
t t
It is also bell shaped.
Shape
Sh
4

Frequency 2

50 60 70 80 90 100 110
Speed_1

This is a symmetric histogram


This is not bell shaped
Shape
Sh
8

6
Frequenccy
5

50 60 70 80 90 100 110
Speed
p _1_1

This is not symmetric


This is skewed to the right
Shape
Sh
8

6
Frequencyy

50 60 70 80 90 100 110
Speed_1_1_1

This is not symmetric


This is skewed to the left
M d
Mode
Another important measure in a dataset is
the mode.
Mode is the most frequent value in a
dataset.
Example:
Iask 7 students how many credit hours they
have this term and the answers are: 5, 6, 5, 4,
4, 4, 3. What is the mode of this dataset.
M d
Mode
A dataset can have more than one modes.
Example:
Iask 7 students how many credit hours they
have this term and the answers are: 55, 5
5, 5
5, 4
4,
4, 4, 3. What is the mode of this dataset?
Shape
Sh
You can use the mode to define another
aspect
p of the shape
p of the distribution.
Using mode a distribution can be:
Unimodal if it has just one peak
Bimodal if it has two peaks
Multimodal if it has more than two peaks
peaks.
Shape
Sh
6

uency
Frequ 3

50 60 70 80 90 100 110
Speed

This is symmetric unimodal dataset


Why not bimodal?
Shape
Sh
8

uency
5

Frequ 4

50 60 70 80 90 100 110
Speed_1_2

This is symmetric,
symmetric not bell-shaped
This is bimodal
O tli
Outliers
There is no formal definition of an outlier. In
general, is a data point not consistent with the
bulk of data.
Example:
Credit hours example. I asked 5 students the number
of credit hours they have this semester and I get the
following answers: 5, 6, 3, 22, 4. Which value can be
considered as an outlier.
Eff t off outliers
Effect tli
What is the effect of outliers?
Outliers generally affect the mean and the
standard deviation a lot while at the same
time they do not affect the median and the
IQR that much. So the median and IQR is
said to be resistant or robust to outliers
Eff t off outliers
Effect tli
I asked 7 students the number of credit hours
they are enrolled. I got the following answers 4,
6 3,
6, 3 9,
9 5,
5 4,
4 3.
3
Mean: 4.86, standard deviation: 2.116
Median:
ed a 4,, IQR:
Q 3
Now if I ask another 7 students and I get the
following answers 4, 6, 3, 29, 5, 4, 3
Mean: 7.71, standard deviation: 9.45
Median: 4, IQR: 3
B
Boxplot
l t
The easiest way to identify outliers is by drawing a
boxplot.
A boxplot is constructed as follows:
Step 1: Find the quartiles of the dataset.
Step 2: Find the IQR
Step 3: Put on one axis the numbers from the minimum to the
maximum of the data
Step 4: Draw a box with the lower end at the lower quartile and
the upper end at the upper quartile and a line exactly where the
median is
.
B
Boxplot
l t
Step 5: Draw a line that extend from the lower
quartile to the smallest data value that is not
smaller of Q1 1.5*
1 5* IQR
Step 6: Draw a line that extend from the upper
quartile to the largest data value that is not
greater of Q3 1.5* IQR
Step 7: Mark all the values that are between
the two values, Q1 1.5* IQR and Q3 1.5* IQR
with an asterisk. Those are the outliers
B
Boxplot
l t
Example:
The weight of nine students in pounds where the
following: 188.5, 183.0, 194.5, 185.0, 214.0, 203.5,
186.0, 178.5, 109.0.
First we put the data in order: 109
109.0,
0 178
178.5,
5 183
183.0,
0
185.0, 186.0, 188.5, 194.5, 203.5, 214.0
Find the median: (sample size is 9, that is odd, and so
the (9+1)/2=5th value is the median which is equal to
186.0
B
Boxplot
l t
Now take the lower portion of the data and find the
lower quartile. That is: 109.0, 178.5, 183.0, 185.0.
So the median of this part of the data with size 4 is
the average of the 2nd and the 3rd value. That is:
(178.5 +183.0)/2 which gives us 180.75
Doing the same thing with the upper portion of the
data 188.5, 194.5, 203.5, 214.0 we have that the
upper quartile is (194.5+203.5)/2
(194 5+203 5)/2 which gives us
199.0
B
Boxplot
l t
Calculate the IQR=199-180.75=18.25
Calculate 1.5*IQR=1.5*18.25=27.375
Any value less than Q1 1.5*
1 5* IQR that is 180
180.75-
75
27.375=153.375 is an outlier
Now since the smaller point greater than 153.375 is
178.5 the line will extend up to 178.5
Any value greater than Q3 1.5* IQR that is
199.0+27.375=226.375 is an outlier.
Now since the larger point less than 226.375 is 214.0
the line will extend up to 214.0
B
Boxplot
l t
Here
it is:
200
ht
W eigh

150

100
H
How tto h
handle
dl outliers
tli
Reasons why we have outliers:
The outlier is a legitimate data value and represents
natural variability of the variable measured
A mistake was made while taking the measurement
or while entering it into the computer
The individual in question belongs in another group
that the one of interest
In the first case we keep the outliers while in the
last two we should discard them.
Oth plots
Other l t
For quantitative data we have seen the
histogram and the boxplot. There are
another three ways that we can use in
order to view information about location,
spread and shape
shape. Those are
are:
Fivepoint summary
Stem and Leaf plot
Dotplot
Fi point
Five i t summary
Five point summary is consist of the maximum
and the minimum values, the upper and lower
quartiles and the median
median.
Example:
The weight of nine students in pounds where the
following: 188.5, 183.0, 194.5, 185.0, 214.0, 203.5,
186.0, 178.5, 109.0
Minimum is 109
109.0,
0 Maximum is 214214.0,
0 Lower Quartile
is 180.75, Upper Quartile is 199 and the Median is
186
Fi point
Five i tSSummary
Median 186
Quartiles 180.75
180 75 199
Extremes 109 214

You have to write it like this. The five


numbers look like a pyramid.
St
Stem and
dLLeaff plot
l t
Those plots are divided into two parts:
The stem that contains all but the last one or two
digits of a number
The leaf that are the remaining one or two digits of
the number
To create a Stem and Leaf plot you have:
Step 1: Determine the stem values, which also means
the values of categories
Step 2: Attach all the leaves to the appropriate stem
in increasing order
St
Stem and
dLLeaff plots
l t
Example:
I ask 11 professors in the Stat department to tell me
their age. The answers are: 27, 40, 46, 85, 77, 54,
52, 49, 64, 62, 69, 48
The stem and leaf diagram will be as follows:

Always write the key


D t l t
Dotplot
Maybe the simplest plot to create.
Example:
Iask 30 professors in Penn State to tell me
their age
age. The answers are not listed
listed, but you
can see the dotplot that follows.
Dotplot
Age of Professors in Penn State

28 38 48 58 68 78 88
Age
P
Percentiles
til
The quartiles and the median are special cases
of percentiles
The kth percentile is a number that has k% of
the data values at or below it.
Example:
I ask7 students how many credit hours they have this
semester.
t The
Th answers are: 4, 4 6
6, 3
3, 9,
9 5,
5 4
4, 3
3. Fi
Find
d
the 20th percentile of those numbers.
P
Percentiles
til
Since 20th percentile we need at least 20% of the data
to be below it. So that is one fifth of the data. Since I
have 7 numbers that means that we need 1 1.4
4 of those
numbers to be below that percentile.
So if we order them we have the following g order 3, 3,
4, 4, 5, 6, 9
So the 20th percentile is between the first and second
number which are both equal to 3 3. So the 20th
percentile is equal to 3.
B ll Shaped
Bell Sh d Di
Distributions
t ib ti
Nature seems to be fair most of the times
and most numerical variables seems to
follow what is called a bell-shaped
y
symmetric curve.
This curve is so important that it is also
called a normal curve or normal
distribution.
N
Normal
l curve
Ch
Characteristics
t i ti off normall curve
Each normal curve is characterized by its:
Mean
Standard deviation
If the mean is equal to 0 and the standard
deviation equal to 1 then we have the
standard normal curve
E i i l rule
Empirical l
The empirical rule states that for any bell
shaped
p curve:
68% of the values fall within 1 standard
deviation of the mean in either direction
95% of the values fall within 2 standard
deviations of the mean in either direction
99.7% of the values fall within 3 standard
deviations of the mean in either direction
E i i l rule
Empirical l
E i i l rule
Empirical l
With the above empirical rule we have
another relationship.
p The standard
deviation can be approximated to be:
Range
R
s
6
E ii lR
Empirical Rule
l
Example:
I ask 30 students the number of credit hours
they have this semester and the answers
have mean 6 and standard deviation 1.2. By
the empirical rule:
68% of the data will be between and .
95% of the data will be between and .
99.7% of the data will be between and .
St d di i th
Standardizing the normall curve
As we said earlier the mean and standard
deviation of the sample is characterizing the
normal curve that we fit to it.
The easiest normal curve to work with is the
standard
t d d normall curve; th the one th
thatt h
has mean 0
and standard deviation 1.
S we have
So, h a way tto convertt every dataset
d t t in
i
order to fit a standard normal curve.
St d di i th
Standardizing the normall curve
In order to standardize a score we have
the following procedure:
Score mean
z
Standard deviation

The standardized scores are also called z-


z
scores
E
Example
l
Lets say that the average number of credit
hours register this term is normal curve with
mean 6 and standard deviation 1.2
If someone has 8 credit hours what is his
standardized
t d di d score? ?
z=(8-6)/1.2=1.6667
If someone has a standardized score of 0 how
many credit hours he ahs this semester?

Potrebbero piacerti anche