Sei sulla pagina 1di 64

# Data Presentation

Data Type
Qualitative
Summary
Table
Bar
Chart
Pie
Chart
Quantitative
Frequency
Distribution
Histogram
Displaying Qualitative Data
Summary Table or Frequency Table
List categories or classes in a column and the total count (or % or
both) of each category in another column
A, B, A, B, C, A, D, C, C, A, D, B, A, A, D, C, A, B, D, C
Grades # of Students % of Students

A 7 35%
B 4 20%
C 5 25%
D 4 20%
Summary/Frequency Table of grades of 20 students
1. What % of students earned at least a C grade?
2. What is the most frequently observed grade?
Ans. 1. 80%; 2. A
Bar graph

Bar Graph for Qualitative Data
categories are represented by bars where the height of
each bar is the corresponding frequency or percentage
Bars have the same width, and leave equal spaces between
successive bars

7
4
5
4
0 2 4 6 8
A
B
C
D
G
r
a
d
e
s
Number of Students
Pie Chart

Pie Chart for Qualitative Data
categories are represented by slices of a pie
where the size of each slice is proportional to
percentage (35% A, 20% B, 25% C, 20% D) of
class frequency
A
35%
B
20%
C
25%
D
20%
Sample Question

All of the following are characteristics of bar graphs
except
(a) The bars of the graph should be of the same width
(b) Bar graphs are used to depict qualitative or
categorical data
(c) There should be no spaces between bars of the
bar graph

Ans (c)
Sample Question
The following pie chart shows the distribution of students
in a Math course (10% Freshmen, 46% Sophomores,
30% Juniors, 14% Seniors).

What percentage of the class took the course prior
to reaching their senior year?
(a) 44% (b) 86% (c ) 54% (d) 14%
Ans. (b)
Displaying Quantitative Data

Ungrouped Frequency Table: Organize data in a
table with two columns

Column 1 (Variable Values)
distinct values (in order of magnitude) of the variable
under consideration
Column 2 (frequency)
the number of times each value is repeated (a third
column that shows percent or proportion of each
class frequency is also recommended).
Example: Number of courses taken by 30 students

Distribution of the # of courses of 30 students

Number of Courses Number of Students
3 4
4 18
5 6
6 2

How many students enrolled in at least four courses?
Ans. 26
What percentage of students enrolled in four courses?
Ans. 60%
What percentage of students enrolled in at most five
courses?
Ans. (28/30)100%
Data: 3, 4, 5, 6, 3, 4, 5, 6, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
Grouped Frequency Distribution

Organize data in tabular form with two columns

Column 1: class intervals with class boundaries
divide the range of data into several (about 5 to 15)
intervals preferably with equal width in such a way that
no data value belongs to two different intervals

Column 2: frequency
the number of data values that fall in the corresponding
class interval in column 1
Example- Grouped Frequency Distribution
Raw Data: 24.0, 26.5, 24.5, 35.6, 27.4, 27.9, 30.3, 37.4,
32.3, 38.7, 14.0, 16.8, 19.1, 20.3, 22.7
Boundaries
(Upper + Lower Boundaries) / 2
Width
Class interval Midpoint Frequency
23.99 - 28.99 26.49 5
28.99 - 33.99 31.49
2
33.99 38.99 36.49 3
18.99 23.99
13.99 18.99 16.49
21.49
2
3
Histogram for Quantitative Data
Histograms are graphs of the
frequency or relative frequency of
a variable.
Class intervals (with boundaries)
make up the horizontal axis;
The frequencies or relative
frequencies are displayed on the
vertical axis.
McClave, Statistics , 11th ed. Chapter 2:
Methods for Describing Sets of Data
11
Class Boundary
F
R
E
Q
U
E
N
C
Y

0
1
2
3
4
5
Histogram of Previous Example
0 13.99 18.99 23.99 28.99 33.99 38.99

Class Boundary
No Space
between bars
Class Freq.
13.99-18.99
18.99-23.99
23.99-28.99
5
28.99-33.99 2
33.99-38.99 3
Frequency
2
3
13
Histogram- Study shapes of the distribution of quantitative data

Symmetric
Skewed to Left
Skewed to Right
Income
Test scores of an easy test
Heights in inches
#

o
f

S
t
u
d
e
n
t
s
#

o
f

S
t
u
d
e
n
t
s
McClave, Statistics , 11th ed. Chapter 2:
Methods for Describing Sets of Data
14
What do you see in a Histogram?

Center or location of data
Shape of the distribution of data
skewed, symmetric
Presence of outliers, if any
Presence of multiple peaks, if any, in the data.

Sample Question
Parking times (in nearest minutes) are
recorded for a group of 2000 students. Which
of the following graphs would be most
appropriate to display parking times?

(a) Bar chart
(b) Histogram
(c ) Pie chart
Ans (b)
Summary of Tables/Graphs/Charts for one variable

Frequency Tables, Bar and Pie Chart for categorical data

These tables and charts are used to show relative differences in
categories; Pie chart for % data, and bar chart show frequency or
relative frequency of categories

Grouped and Ungrouped frequency tables and Histogram

Measurement data are often summarized in grouped frequency table,
count data in ungrouped frequency tables, and both are displayed in
histograms
Histogram shows the shape of the distribution of data
Choice of intervals greatly affects the shape and may lead to misleading
conclusion
Bivariate Data
Data on two categorical variables from each unit
e.g., Class Rank (FR SO JR SR) and Employment Status
(Full-time, Part-time, Unemployed) of students

Data on one categorical and one quantitative variable
from each unit
e.g., Class Rank and Age of students

Data on two quantitative variables from each unit
e.g., Height and Weight of students
Presenting Bivariate Data
Cross-Table for two categorical variables from each unit
collect class rank (FR, SO, JR, SR) and employment status (full-time,
part-time, unemployed) of 20 students and summarize in a cross-table
Employment Status
Full-
Time
Part-
time
Unem-
ployed
Total
Fr 1 2 1 4
So 1 2 1 4
Jr 3 1 2 6
Sr 1 3 2 6
Total 6 8 6 20
Class
Rank
What % of the survey is Sr?
What proportion of Sr is unemployed?
Ans. 30%
Ans. 1/3
Data Set
Class Employment
Rank Status

Fr Full-time
Sr Part-time
Jr Unemployed
. .
. .
. .
. .
. .
. .

Example
In a factory, three machines produced 1000 items.
machine A produces 70% of the items; 1 of every 20 is defective
machine B produces 10%; 1 of every 100 is defective
machine C produces 20%; 2% are defective
Display the information in a cross table.
Defective Non- Defective Total
A 35 665 700
B 1 99 100
C 4 196 200
Total 40 960 1000
What % of defective items is produced by A?
Ans. 40
Ans. 87.5%
Ans. 5%
How many items are defective?
What % of items produced by A is defective?
Machine
Quality of Items
Presenting Bivariate Data
One categorical and one quantitative variable

Distribution of Age by Class Rank
Class
Rank
Median
Age
Fr 19
So 20
Jr 22
Sr 25
Presenting Bivariate Data
Two quantitative variables from each unit
Use scatter plot to study the relationship
between two variables
Height
W
e
i
g
h
t
Positive correlation
Heart
Failure
Rate
Negative correlation
G
P
A
No Correlation
Exercise time
(in minutes)
Toe Size (mm)
Numerical Measures of Quantitative Data

Properties of
Quantitative Data
Location/
Central Tendency

Dispersion/
Relative Position
Mean
Median
Mode
Range
Standard deviation
Variance

Quartile
Percentile
Z-score
Summation Notation:
x
1,
x
2
, . x
n
n
n
i
i x x x x x + + + + =

=
... 3 2 1
1
2 2
3
2
2
2
1
1
2
...
n
n
i
i
x x x x x + + + + =

=
( )
2 3
2
2
1
1
...
n
i n
i
x x x x x
=
| |
= + + + +
|
\ .

Data
Square each
then square
Measures of Location/
Central Tendency
Look for value(s) around which the data tend to
cluster. Three measures of location or center are
1
n
i
i
x
x
n
=
=

middle number when data values are
arranged from low to high or high to low
most frequently observed value(s)
Mean
Median
Mode
Mean
Sample Mean:

Data: x
1
= 1, x
2
= 2, x
3
= 3, x
4
= 6
mean =12/4 = 3.0
Data: x
1
= -1, x
2
= 0, x
3
= 3, x
4
= 6
mean =8/4 = 2.0

Note: Mean is affected by extreme values
Population mean is denoted by (mu)
1
n
i
i
x
x
n
=
=

26
Median
Median is the middle number in the ordered data
(from low to high or high to low)
For even number of observations, median is the mean of
the two middle values.
Median is not affected by extreme values
Example
Data: 1, 5000, 3; Ordered data: 1, 3, 5000;
Median = 3
Example
Data: 1, 50, 3; Ordered data: 1, 3, 50;
Median = 3
Example
Data:1, 50, 3, -4; Ordered data: -4, 1, 3, 50
Median = (1 + 3)/2 = 2
Mode
The most frequently observed
value(s). A data set may have
unique mode (i.e. one value
in the data set)
more than one mode
no mode
Example
Data: 1, 5, 3, 3
mode= 3
Example
Data: 1, 5, 3, 3, 5
mode = 3, 5
Example
Data: 1, 5, 3, 3, 5, 1
mode = none
Mode can also be
used for categorical
data where you
look for category
with highest
frequency, e.g.,
among all types of
cancer, the one that
occurs most often is
modal cancer
28
Three Measures of Central Tendency
(Mean, Median, Mode)
Mode is the x-value under the peak in each graph
Mode
Mode
Mode
Perfectly symmetric data set;
Mean = Median
Few extremely high
values in the data set;
Mean > Median
(Rightward skewness)
Few extremely low
values in the data set
Mean < Median
(Leftward skewness)
Sample question
For the scores (x)
1, 4, 3, -3, 0
Find x
2
Ans: (c )

(a) 25 (b) 17 (c ) 35 (d) none of the above

Sample Questions
Find the mode of the data set 1, 1, 5, 5.
(a) 1
(b) 5
(c ) 1, 5
(d) 0
(e) no mode
Ans. (e) No mode
True or False

A data set has no mode means the mode is equal to zero
Ans. False

In the data set 1, 1, 1, 4, 4, 4, both 1 and 4 appear 3 times. So the mode is 3.
Ans. False

In a distribution that is skewed to the right, mean is greater than the median.
Ans. True
STA-2023 students are expected to study at least 45 minutes every day except Friday
Ans. True

Sample Question
Find the median of the dataset: 1, 4, 0, 5, 8.
(a) 0
(b) 5
(c ) 4
Ans. (c)

Find the median of the dataset: 1, 4, 2, 10, 8, 6.
(a) 2
(b) 10
(c ) 6
(d) 5
Ans. (d)
The two sets have same location but spread of data values
in set I is more than that in set II
32
Numerical Measures of Variability
Measures of variability give us an idea of how
spread out the data are around the center

Data Set I: 10 20 30
Center

Data Set II: 15 20 25
Center

33
Numerical Measures of Variability
Range= Highest value Lowest value= H L

Sample Variance (s2)

2
2
1
( )
1
i
n
i
x x
s
n
=

=

## Population Standard deviation is denoted by (sigma)

Coefficient of Variation:
Example: Mean and Variance
= (14 - 12)/2 = 1
(x)
2

n
n-1
(x
2
) -
s
2
=
Sample data set: 1, 2, 3
Sample Mean = x/ n = 2
x

= 1+2+3 = 6
x
2
= 1 + 4 + 9 = 14
Coefficient of variation (cv) = 100(1/2)%=50%
Location and dispersion measures
from frequency table
Distribution of # of courses taken by 30 students
Courses Students
(x) (f)
3 4
4 18
5 6
6 2
Find mode, median, mean, and variance of the number of courses
(must take both (x) and ( f) columns into consideration)
Mode = ??
Ans 4
Median = ??
Ans 4
Location and dispersion measures
from frequency tables
Distribution of # of courses taken by 30 students
Courses Students
(x) (f) x.f x
2
.f
3 4 12 36
4 18 72 288
5 6 30 150
6 2 12 72
Total 30 126 546

(f) (x.f) (x
2
.f)

Mean = (x.f)/ (f) = 126/30 = 4.2

(x.f)
2
(x
2
.f)
f
f-1

Variance = = 0.58
(126)
2
546
= 30
30 - 1

Mean and Variance from grouped frequency tables
Raw Data: 24.0, 26.5, 24.5, 35.6, 27.4, 27.9, 30.3, 37.4,
32.3, 38.7, 14.0, 16.8, 19.1, 20.3, 22.7
Class interval Midpoint(x) Frequency(f)
23.99- 28.99 26.49 5
28.99 -33.99 31.49 2
33.99 38.99 36.49
3
18.9923.99
13.9918.99 16.49
21.49
2
3
x.f x
2
.f
32.98

64.47

132.45

62.98

109.47
543.8402

1385.4603

3508.6005

1983.2402

3994.5603
Total 15 402.35 11415.7015
=(f) =(x.f) =(x
2
.f)
Mean = (x.f)/ (f) = 402.35/15 =26.82; variance = 44.52
38
does not change
e.g., Std dev of 2, 4, 5 is the same as the std dev of 12,
14, 15. (check at home)

Std dev is dependent on scale factor
If you multiply all values of a variable by c, then std dev of the new
variable is c times the std dev of the original variable
e.g., take
Data set I: 1, 2, 4
Data set II: 2, 4, 8 (multiply data set I by 2)
Std dev of set II is 2 times the std dev of set I (check at home)
39
Variance is average squared distance from mean
uses all data values
not appropriate if you want to compare the spread of two or more
data sets measured in different units, e.g., do not compare variation
in SAT scores with the variation in ACT scores of a group of students
taking both exams.

Coefficient of variation is a unit free measure
appropriate to compare the variability of two or more data sets
measured on different scale or unit of measurements.
e.g., Use CV to
compare variation in blood pressure with variation in heart rate of a group of
football fans on game day
compare variation in SAT score with variation in ACT scores of the same group
of students

Questions

TRUE or FALSE
The median of a data set with three distinct observations is 2
False
Median of data set is also equal to one-half of the range of
the data set
False
The range of a data set with all negative numbers is also
negative
False
The variance of 2, 4, 8 is same as the variance of 82, 84, 88.
True
Dr. Practice is the best instructor for STA 2023.
Ans. True
Question
The mean and standard deviation of test scores of a group of
students are 50 and 10 respectively. Course Instructor wants to
help students in one of two ways:
(i) increase each students score by 5
(ii) increase each students score by 10%

What would be the mean and standard deviation under (i)
Ans. Mean = 55, std dev = 10
What would be the mean and standard deviation under (ii)
Ans. Mean = 55, std dev = 11

Numerical Measures of Quantitative Data

Properties of
Quantitative Data
Location/
Central Tendency

Dispersion/
Relative Position
Mean
Median
Mode
Range
Standard deviation
Variance
Coefficient of
variation
Quartile
Percentile
Z-score

43
Measures of Relative Position

Percentile
Percentile scores are numbers that divide an
ordered data set into 100 equal parts.
1% 1% 1% 1%
1
st

2nd 3rd
99
th
Percentiles
Data
Arrange data from low to high on the number line
44
Measures of Relative Position

Quartile
Quartiles are numbers that divide an ordered
data set into four equal parts.
25% 25% 25% 25%
1
st

2nd 3rd
Notations

Q
1
Q
2
Q
3
Median
Quartiles
What is the relationship between percentile and quartile?
Data (from low to high)
Q
1
= 25
th
percentile, Q
2
= 50
th
percentile, Q
3
= 75
th
percentile
45
Relative Position

Z-score =

Z-score represents the distance between a
given data value x
0
and the mean of all data
value, expressed in standard deviation
If z-score for x
0
is 2, then x
0
is 2 standard deviation
above the mean

All observations with absolute z-scores more
than three are generally considered outliers.
raw score mean
Standard deviation
X -
=

McClave, Statistics , 11th ed. Chapter 2:
Methods for Describing Sets of Data
46
More on Relative Position

Z-scores and Percentiles are
appropriate to compare the relative position of
scores measured on different scale or unit of
measurements.
e.g., to see if you did better in SAT or ACT, look
at your z-scores (or percentiles) of both scores;
higher is better

Numerical Measures of Quantitative Data

Properties of
Quantitative Data
Location/
Central Tendency

Dispersion/
Relative Position
Mean
Median
Mode
Range
Standard deviation
Variance
Coefficient of
variation
Quartile
Percentile
Z-score

McClave, Statistics , 11th ed. Chapter 2:
Methods for Describing Sets of Data
48
Interpreting Mean and Standard Deviation
Empirical Rule: For approximately mound-shaped distributions of
data, approximately

68% of the measurements will fall within 1 std dev of the mean
95% of the measurements will fall within 2 std dev of the mean
99.7% (we will treat this as100%) of the measurements will fall
within 3 std dev of the mean
For bell-shaped symmetrical distributions, these percentages can
further be divided as shown below.
McClave, Statistics , 11th ed. Chapter 2:
Methods for Describing Sets of Data
49
Interpreting Mean and Standard Deviation
Chebyshevs Rule: Valid for any data set
For any number k >1, at least (1-1/k
2
)100% of the
observations will fall within k standard deviations of
the mean
k
At least (1- 1/ k
2
)100%
2 At least 75%
3 At least (8/9)100%
4 At least (15/16)100%
SAT scores of 1
st
year UCF students
SAT scores of first year UCF students
Mean = =1175; Std Dev = = 120
distribution is approximately bell-shaped symmetric.
-3 -2 - + +2 +3
815 935 1055 1175 1295 1415 1535
Find the minimum score of a student who scored among the top
2.5% student?
What percentage of students scored between 1055 and 1415?
Ans. Approximately 81.5%
Ans. At least 1415
34% 34%
13.5%
2.5%
13.5%
2.5%
Example
The number of printing jobs submitted per day
has a bell-shaped symmetric distribution with a
mean of 83 jobs and a standard deviation of 10.
What proportion of the days do the number of jobs
submitted exceed 93?
What proportion of the days do the number of jobs
submitted fall
between 83 and 93?
between 63 and 93?

To answer these questions, first display the
distribution and clearly mark % and scores.
Display the distribution
What percentage of the days do the # of jobs submitted
exceed 93?
Ans Approximately 16%
What percentage of the days do the number of jobs
submitted fall
between 83 and 93?
Ans Approximately 34%
Betwwen 63 and 93?
Ans Approximately 81.5%
-3 -2 - + +2 +3
53 63 73 83 93 103 113
34% 34%
13.5%
13.5%
2.5% 2.5%
Example
Distribution of the amount of monthly utility bill of a 3-
bedroom house using gas or electric energy had =
\$125 and = \$10.

What percentage of homes will have a monthly utility bill
between \$95 and \$155?

What percentage of homes will have a monthly utility bill
of less than \$105 or more than \$145?
-3=95 -2=105 -=115 =125 +=135 +2=145 +3=155
Ans. At least 88.89%
Ans. At most 25%
At least 75%
Chapters 1 and 2 - Summary/ Sample Questions
Statistical Terms
Sample and population, qualitative and quantitative variables,
descriptive and inferential statistics, statistic and parameter
What are the measures of location?
Mean
Median
Mode
What are the measures of dispersion?
Range
Variance
Standard Deviation
Coefficient of Variation
What are the measures of position?
Percentile
Quartile
Z-score
Interpretation/description of data using
Empirical Rule, Chebyshevs Rule, z-scores
Chapters 2 - Sample Questions

Identify the correct answer in each case.

In a data set with five distinct numbers, the value of mode is
(a) zero (b) one (c) there is no mode (d) five.
Ans (c)

Which is not a measure of central tendency or location?
(a) Mean (b) Range (c) Mode (d) Median.
Ans (b)

Which is not a measure of relative position?
(a) Median (b) Quartile (c) standard deviation (d) Percentile
Ans (c)
Sample Questions
TRUE or FALSE
In a symmetric distribution, we expect the values of the mean, median, and mode
to differ greatly from one another.
False
In skewed distributions, the mean is the best measure of the center of the
distribution since it is least affected by extreme observations.
False
For the distribution drawn below, identify
mean, median, and mode.
(a) A = median, B = mode, C = mean
(b) A = mode, B = mean, C = median
(c) A = mode, B = median, C = mean
(d) A = mean, B = mode, C = median
Ans. (c )
Sample Question
The age of patients in an adult care facility averages
75 years and has a standard deviation of five years.
Assume that the distribution of age is bell-shaped
symmetric.
Find the 16
th
percentile in the age distribution.
(a) 65 years
(b) 70 years
(c ) 75 years
(d) 80 years
Ans. (b) 70 years
Find the median age.
(a) 65 years (b) 70 years (c ) 75 years (d) 80 years

Ans. (c) 75 years
Sample Question (Try in class)

The results of pass-fail exams in math and physics given to 200 students are
displayed in the following cross-table

Pass Physics Fail Physics Total
Pass Math 45 65 110
Fail Math 35 55 90
Total 80 120 200

What percentage of students failed math but passed
physics?
(35/200)100% = 17.5%

What percentage of students failed physics and passed
math?
(65/200)100% = 32.5%
Sample Question
In professional basketball games during 1980-82,
when Larry Bird of Boston Celtics shot pairs of free
throws, 5 times he missed both, 251 times he made
both, 34 times he made only the first and 48 times he
made only the second. Display this information in a
cross-table.
Second Shot
Missed 48 5
What proportion of the times he made his second shot?
(a) 251/338
(b) 251/285
(c) 285/338
(d) 251/299
(e) 299/338
Ans (e)

Sample Question
Sales record of a shoe company shows that the
mode for the shoe sizes of mens shoes is a size
10. Interpret this result to the company president.

(a) The average shoe size of men is a size 10

(b) Half of the shoes sold to men are larger than a size 10

(c ) Half of all mens shoe sizes made by the company
are size 10

(d) The company sold size 10 mens shoe more than any
other mens shoe size

Ans (d)
Sample Question
If nothing is known about the shape of a
distribution, what percentage of data fall within 2
standard deviation of the mean?
(a) At least 75%

(b) At least 100(8/9)%

(c ) Approximately 68%

(d) Approximately 95%

Ans (a)

Sample Question
Which of the following graphs is most appropriate to study
relationships between two quantitative variables?
(a) Bar chart
(b) Pie chart
(c ) Scatterplot
(d) Histogram
Ans (c )
The average out-of-class study time per week for STA-2023 students is 8
hours with a standard deviation of 1.25 hours. You studied only 4 hours the
week prior to test #1. Which of the following statements is a good description
(a) z-score for my study time is 3.20
(b) My study time is an outlier
(c ) Both (a) and (b)
(d) None of the above
Ans (b)
Sample Question
The mean and standard deviation of blood pressure of all adults
is 115 and 15 respectively. Which of the following blood
pressure values are outliers? (For outliers, Z > 3 or Z < -3)
70, 80, 115, 160, 165, 170

(a) 70
(b) 165, 170
(c ) 70, 165, 170
(d) 70, 160, 165, 170

Ans (b)
Sample Questions
TRUE or FALSE
In skewed distributions, the median is the best measure of the center of
the distribution since it is least affected by extreme observations.
True
Both 2 and 5 are mode of the data set 2, 5, 2, 5, 5, 2, 3
True
Median of data set is always equal to second quartile
True
The variance of 1, 2, 3 is same as the variance of 101, 102, 103
True
50
th
percentile of a data set is also equal to (Q
3
- Q
1
)/2.
False