Sei sulla pagina 1di 77

Chapter 3

Displaying and Summarizing Quantitative


Data

Wednesday
Announcements
Office hours for the TAs and myself have been

posted on Blackboard under Course Information.


Note that there are no office hours on Fridays.
There is a worksheet posted on Blackboard
under Chapter 3, Lecture. We will likely use this
in class on Friday. Please bring a copy to class or
have it available on your laptop in class. It will
not be handed in.
Make sure you are reading your book. Assigned
readings listed on syllabus.

Wednesday
Announcements
Assignments:
For homework (online and written), make sure to

read the Guidelines posted on Blackboard under


Course Information.
For written homework, if we cannot read your
handwriting, you may lose points.
You have a written homework and online homework
due next Wednesday, the 10th.
Survey: Complete before Monday at 8am. It is part of
your first lab grade.
Keep working on your Personal Glossary! Link on
Blackboard for each chapter.

Chapter Outline
Review Quantitative Variables
Describing Quantitative Variables Graphically
Describing Quantitative Variables Numerically

Quantitative Variables
Variables with numbers as values.
Age
Weight
Height
Number of siblings

Describing One Quantitative


Variable
Distribution of variable
Summary of different values observed for the
variable
Includes the 3 ss: shape, center, and spread
Spread is also known as variation or variability.
Always start with making a picture
Histogram
Stem-and-Leaf Display

Displaying Quantitative
Data
Histogram
Stem-and-leaf display

Example: Percent of Population of


Hispanic Origin
Who: 50 states
What: % of States Population of Hispanic

Origin
Where: United States
When: 2000
Why: Looking at demographic changes over
time
How: U.S. Census

Histogram of Percent of Population


of Hispanic Origin

Displaying Quantitative
Data
Histogram
Divides the values of the variable into equalwidth piles (called bins).
Count # of whos belonging to each bin.
Plot bin values on x-axis.
Plot # of whos belonging to each bin on yaxis.
Compare heights of bars = # of whos with
values in the range of the bin.

Stem-and-Leaf Display
Picture of Distribution.
Generally used for smaller data sets.
Group data like histograms.
Still have original values (or close to it).

Stem-and-Leaf Display
Two columns
Left: Stem
Right: Leaf

Leaf
Contains the last digit of the values.
Arranged in increasing order away from stem.

Stem
Contains the rest of the values.
Usually arranged in increasing order from top to bottom.
JMP does opposite, increasing order from bottom to top!

Key at bottom decodes value of stem and leaf

Stem and Leaf Plot of Percent of


Population of Hispanic Origin

Interpreting Histograms and Stem


and Leaf Displays
Shape
Number of Modes
Symmetry
Outliers

Looking at Distributions Shape


Shape
How many humps (called modes)?
None = uniform
One = unimodal
Two = bimodal
Three or more = multimodal

Looking at Distributions Shape


Shape
Is it symmetric?

Symmetric = roughly equal on both sides

Skewed = more values on one side


Skewed to the right More smaller values trails off
to the larger values
Skewed to the left More larger values trails off to
the smaller values

Looking at Distributions Shape


Shape
Are there any outliers?
Interesting observations in data
Can impact statistical methods

Percent of Population of
Hispanic Origin
How many modes?

Skewed which direction?

Are there Potential

outliers?

Looking at Distributions - Center


Where is the typical value located?
Center
Median
Mean

Looking at Distributions - Spread


How far apart are the values?
Variation (Spread)
Range
Interquartile Range = IQR
Standard deviation = s

Looking at Distributions Center


and Spread (or variation)
2 common options:
First
Median (Center)
Range (Variation)
IQR (Variation)
Second
Mean (Center)
Standard deviation (Variation)

Median
50th percentile
50% of the observations are below the median
50% of the observations are above the median

Median is the middle number


Measures the center of the observations
Different calculation when
n is odd
n is even

Median (n is odd)
Order the data from smallest to largest.
Median is the middle number on the list.
(n+1)/2 number from the bottom
Ex: If n=11, median is the (11 + 1)/2 = 6th
number from the bottom.
Ex: If n=37, median is the (37 + 1)/2 = 19th
number from the bottom.

Example

Year

HR

Year

HR

Year

HR

54
55
56
57
58
59
60
61

13
27
26
44
30
39
40
34

62
63
64
65
66
67
68
69

45
44
24
32
44
39
29
44

70
71
72
73
74
75
76

38
47
34
40
20
12
10

Example (n is odd)
Order the data from smallest to largest.
10 12 13 20 24 26 27 29 30 32 34 34
38 39 39 40 40 44 44 44 44 45 47

Median is the (23+1)/2 = 12th number from

the bottom
Median = 34

Median (n is even)
Order the data from smallest to largest.
Median is the average of the two middle

numbers.
(n+1)/2 will be halfway between these two
numbers.
Ex: If n=10, (10 + 1)/2 = 5.5,

median is average of 5th and 6th numbers


from bottom.
Ex: If n = 28, (28 + 1)/2 = 14.5
median is average of 14th and 15th numbers
from bottom.

Barry Bonds
Year

HR Year

HR

Year HR

86
87
88
89
90
91
92

16
25
24
19
33
25
34

46
37
33
42
40
37
34

00
01
02
03
04
05
06
07

93
94
95
96
97
98
99

49
73
46
45
45
5
26
28

Example (n is even)
Order the data from smallest to largest.
5

16 19 24 25 25 26 28 33 33 34

34 37 37 40 42 45 45 46 46 49 73

(22+1)/2 = 11.5
Median is the average of the 11th and 12th

numbers from the bottom = 34 and 34.


Median = 34

Properties of the Median


Which observations affect the median?

For Barry Bonds, 73 is an outlier


Does this observation affect the median?

Example of finding the


median
The incarceration rate (per 100,000) for each

US State in 2008 was recorded.


Below is a histogram of the data.

Incarceration rate (per 100,000)

Incarceration Rates-Highlights
148

185

197

200

209

226

238

247

265

269

286

288

302

309

317

323

340

361

363

366

373

373

376

378

385

387

403

416

432

434

439

443

445

445

448

458

468

472

474

479

495

508

552

556

572

648

648

654

686

867

148
185
247
309
373
867

=
=
=
=
=
=

Maine (1st)
Minnesota (2nd)
Nebraska (8th)
Iowa (14th)
Illinois (21st)
Louisiana (50th)

Incarceration Rates
Compute the median
148

185

197

200

209

226

238

247

265

269

286

288

302

309

317

323

340

361

363

366

373

373

376

378

385

387

403

416

432

434

439

443

445

445

448

458

468

472

474

479

495

508

552

556

572

648

648

654

686

867

Range
Measures variation (spread)
Minimum 0th percentile
Maximum 100th percentile
Range = Maximum Minimum
Total variability of the observations

Example Barry Bonds


Minimum = 5
Maximum = 73
Range = 73 5 = 68

Properties of the Range


Which observations affect the range?

For Barry Bonds, 73 is an outlier


Does this observation affect the range?

IQR (Interquartile Range)


Measures variation (spread)
IQR = Q3 Q1
25th percentile = Q1
75th percentile = Q3
Variability of the middle 50% of the

observations

Finding Q1 and Q3
In general,
Q1 is the median of the lower half of the
ordered observations.
Q3 is the median of the upper half of the
ordered observations.
Actual calculations from textbook and JMP are

slightly different.

Example - Barry Bonds


Order the home runs from smallest to

largest

Q1 = Median of Lower Half = 25


Q3 = Median of Upper Half = 45
IQR = 45 25 = 20

5-Number Summary
Minimum
Q1
Median
Q3
Maximum

Example Barry Bonds


Minimum = 5
Q1 = 25
Median = 34
Q3 = 45
Maximum = 73

Incarceration Rates
Compute the quartiles
148

185

197

200

209

226

238

247

265

269

286

288

302

309

317

323

340

361

363

366

373

373

376

378

385

387

403

416

432

434

439

443

445

445

448

458

468

472

474

479

495

508

552

556

572

648

648

654

686

867

Median = 386
Q1 =
Q3 =

Incarceration Rates
JMP gives different quartiles

Incarceration Rates
Compute the IQR
148

185

197

200

209

226

238

247

265

269

286

288

302

309

317

323

340

361

363

366

373

373

376

378

385

387

403

416

432

434

439

443

445

445

448

458

468

472

474

479

495

508

552

556

572

648

648

654

686

867

Q1 = 302
Q3 = 472
IQR =

Mean
Ordinary average
Add up all observations.
Divide by the number of observations.

Mean
Formula
n observations
y1, y2, y3, , yn are the observations.
n

y1 y2 y3 yn
y

y
i 1

Example
Barry Bonds HRs per season

y1 y2 y3 yn
y

y
i 1

Properties of the Mean


What effect do the observations have on the

mean?

For Barry Bonds, 73 is an outlier. What effect

does this observation have on the mean?

Standard Deviation
Denoted by letter s.
Measures variability (spread) from mean.
Values closer to mean = smaller contribution to
s.
Values far away from mean = larger
contribution to s.
s depends on how far away values are on

average from the mean.

What is the same?

What is different?

What is one measure of center that we could

use to compare to all data points to describe


how variable the data points are?

Calculate the deviations from the


mean
A
xi

Deviation from
mean
xi -

B
yi

Deviation from
mean
yi -

C
zi

2
0

1
0

2
0

1
5

1
0

2
0

2
0

1
5

2
0

2
5

2
0

2
0

3
0

5
0

Deviation from
mean
zi -

How can we convert this to one number to

describe the variability?


How do we get rid of the negatives?

Calculate the squared deviations


A

xi -

xi
2
2
0
0
2
2
0
0
2
2
0
0
2
2
0
0
2
2
0
0

0
0
0
0
0
0
0
0
0
0

Squared B
Deviatio
n
yi

yi -

1
1
0
0
1
1
5
5
2
2
0
0
2
2
5
5
3
3
0
0

-10
-10

5
5

-15
-15

-5
-5

1
1
0
0
1
1
5
5
2
2
0
0
5
5
0
0

-10
-10

0
0
5
5
10
10

Squared
Deviatio
n

zi -

zi

-5
-5
0
0
30
30

Squared
Deviatio
n

Now, how can we convert this to one number

to describe the variability?

Sum of the squared deviations


A

xi -

Squared B
Deviatio
n
yi

yi -

0
0

0
0

0
0

0
0

0
0

0
0

0
0

0
0

0
0

0
0

Sum
Sum
=
=

0
0

xi
2
2
0
0
2
2
0
0
2
2
0
0
2
2
0
0
2
2
0
0

1
1
0
0
1
1
5
5
2
2
0
0
2
2
5
5
3
3
0
0

Squared
Deviatio
n

zi -

Squared
Deviatio
n

-10
-10

100
100

5
5

-15
-15

225
225

-5
-5

25
25

1
1
0
0
1
1
5
5
2
2
0
0
5
5
0
0

-10
-10

100
100

0
0

0
0

-5
-5

25
25

5
5

25
25

0
0

0
0

10
10

100
100

30
30

900
900

Sum
Sum
=
=

250
250

Sum
Sum
=
=

1250
1250

zi

Divide by (n-1), not n


A

xi -

Squared B
Deviatio
n
yi

yi -

0
0

0
0

-10
-10

100
100

5
5

-15
-15

225
225

0
0

0
0

-5
-5

25
25

-10
-10

100
100

0
0

0
0

0
0

0
0

-5
-5

25
25

0
0

0
0

5
5

25
25

0
0

0
0

0
0

0
0

10
10

100
100

1
1
0
0
1
1
5
5
2
2
0
0
5
5
0
0

30
30

900
900

xi
2
2
0
0
2
2
0
0
2
2
0
0
2
2
0
0
2
2
0
0

Sum 0
Sum
= 0
2=

1
1
0
0
1
1
5
5
2
2
0
0
2
2
5
5
3
3
0
0

Squared
Deviation

C zi -

Squared
Deviation

zi

Sum 250
Sum
= 250
2=

Sum 1250
Sum
= 1250
2=

Compare s2 (variance) to the


graphs

Dataset s2
A

62.5

312.5

In the process we squared the


deviations
How do you un-square a value?

Standard Deviation
Datase
t

s2

62.5

7.91

312.5 17.68

What do the standard deviations represent?


How far, on average, each value is from the
mean.

Compare s (standard deviation) to


the graphs

Datase
t

s2

62.5

7.91

312.5 17.68

Properties of standard deviation


Can the standard deviation be negative?

Standard Deviation
n

( y1 y ) ( y 2 y ) ( yn y )
s

n 1
2

2
(
y

y
)
i
i 1

n 1

Standard Deviation
Usually calculate using computer or

calculator.
Choose n-1 option on calculator.

If calculating by hand, make table.

Standard Deviation of Number of


Home Runs per Season for Barry
Bonds

Properties of s
What effect do the observations have on the

value of s?

For Barry Bonds, 73 is an outlier. What effect

does this observation have on the value of s?

General Properties of s
Can the standard deviation be negative?
Can the standard deviation be 0?

s has the same units as the data.


Variance = s2

Comparing standard
deviations
Look at the pairs of graphs on the handout.
For each pair, determine which has the larger

standard deviation, or if they are the same.

Comparison of the Mean and


Median
Median = 50th percentile (middle number)

Mean = fair share value (balancing point)

Mean vs. Median


Mean and Median are generally similar when
Distribution is symmetric with no outliers
Mean and median are generally different

when either
Distribution is skewed
Outliers are present

Influence of Outliers on the Mean


and Median
Small Example: Income in a small town of 6

people
$25,000 $27,000 $29,000
$35,000 $37,000 $38,000

Mean income is $31,830


Median income is $32,000

Influence of Outliers on the Mean


and Median
Bill Gates moves to town.
$25,000 $27,000 $29,000
$35,000 $37,000 $38,000 $100,000,000
The mean income is $14,313,000
The median income is $35,000

Influence of Outliers
Summaries not affected by outliers are called

ROBUST (or resistant).


Center
Median = Robust
Mean = Not Robust

Variation (Spread)
Range = Not Robust
IQR = Robust
s = Not Robust

Influence of Skewness on the Mean


and Median
The observations in the tail influence the

mean.
These observations do not (usually) influence
the median.
Skewed to the right (large values)
Mean > median
Skewed to the left (small values)

Mean < median

Mean vs. Median


Always question when means are reported for

skewed data
Income
Housing prices
Course grades

Comparison of Range, IQR and


Standard Deviation
Report Range and IQR when you report

Median Value
Report Standard Deviation when you report
Mean Value

Which summaries are the


best?
Five Number Summary
Distribution is skewed
Outliers are present
Mean and Standard Deviation
Distribution is symmetric with no outliers
ALWAYS GET A PICTURE OF YOUR DATA.

Potrebbero piacerti anche