Sei sulla pagina 1di 64

Graphical &

Tabular
Descriptive
Techniques

2.1 Types of data and information


A variable - a characteristic of population or
sample that is of interest for us.

Cereal choice
Capital expenditure
The waiting time for medical services

Data - the actual values of variables

Interval data are numerical observations


Nominal data are categorical observations
Ordinal data are ordered categorical observations

Interval Data
Real numbers, i.e. heights, weights,
prices, etc.
Also referred to as quantitative or
numerical.
Arithmetic operations can be performed
on Interval Data, thus its meaningful to
talk about 2*Height, or Price + $1, and so
on.

Nominal Data
The values of nominal data are categories.
Nominal data are also called qualitative or
categorical.
For example, responses to questions about marital
status, coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4
Because the numbers are arbitrary arithmetic
operations dont make any sense (e.g. does
Widowed 2 = Married?!)

More Examples: Nominal Data


Type of Bicycle

Mountain bike, road bike, chopper, folding,BMX.

Ethnicity

White British, Afro-Caribbean, Asian, Chinese,


other, etc. (note problems with these categories).

Smoking status

smoker, non-smoker

Ordinal Data
Ordinal Data appear to be categorical in nature, but
their values have an order; a ranking to them:
For example, college course rating system:
poor = 1, fair = 2, good = 3, very good = 4, excellent = 5
While its still not meaningful to do arithmetic on this data
(e.g. does 2*fair = very good?!), we can say things like:
excellent > poor or fair < very good
That is, order is maintained no matter what numeric
values are assigned to each category.

Examples:Ordinal Data
A type of categorical data in which order is
important.
Class of degree-1st class, 2:1, 2:2, 3rd class,
fail
Degree of illness- none, mild, moderate,
acute, chronic.
Opinion of students about stats classesVery unhappy, unhappy, neutral, happy,
ecstatic!

Types of Data & Information


Data

Categorical?

Interval
Data

Ordinal
Data

Y
Ordered?
Categoric
al Data

Nominal
Data

Types of data - examples


Interval data

Nominal data
With nominal data,
all we can do is,
calculate the proportion
of data that falls into
each category.

Age -- income
income
Age
55
55
42
42

75000
75000
68000
68000

..
..
.. Weight
.. gain
Weight
gain
+10
+10
+5
+5

..
..

IBM
IBM
25
25
50%
50%

Dell Compaq
Compaq Other
Other
Dell
11
11
88
66
22% 16%
16%
12%
22%
12%

Total
Total
50
50

Exploratory Data Analysis


is the process of using simple math and
pictures to summarize data .

Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Types of data analysis


Knowing the type of data is necessary to properly
select the technique to be used when analyzing data.
Type of analysis allowed for each type of data

Interval data arithmetic calculations


Nominal data counting the number of observation in each
category
Ordinal data - computations based on an ordering process

2.2 Graphical Techniques for


Nominal data
The only allowable calculation on nominal
data is to count the frequency of each value
of a variable.
When the raw data can be naturally
categorized in a meaningful manner, we can
display frequencies by

Bar charts emphasize frequency of occurrences


of the different categories.
Pie chart emphasize the proportion of
occurrences of each category.

Graphical & Tabular Techniques for


Nominal Data
First we need to summarize the data in
a table that presents the categories and
their counts called a frequency
distribution.
A relative frequency distribution lists
the categories and the proportion with
which each occurs.

Nominal Data (Tabular


Summary)

The Pie Chart


The pie chart is a circle, subdivided into
a number of slices that represent the
various categories.
The size of each slice is proportional to
the percentage corresponding to the
category it represents.

The Pie Chart

Other
11.1%

Accounting
28.9%

General
management
14.2%
Finance
20.6%

Marketing
25.3%

(28.9 /100)(3600) = 1040

The Bar Chart


Rectangles represent each category.
The height of the rectangle represents the frequency.
The base of the rectangle is arbitrary
Bar Chart

F re qu e ncy

80

73

70
60

64

52

50
40

36

28

30
20
10
0
1

4
Area

More

Itallthesameinformation,
(basedonthesamedata).
Justdifferentpresentation.

The Bar Chart


Use bar charts also when the order in which
nominal data are presented is meaningful.
Total number
number of
of new
new products
products introduced
introduced inin
Total
NorthAmerica
America inin the
the years
years 1989,,1994
1989,,1994
North
20,000
15,000
10,000
5,000
0
89

90

91

92

93

94

2.3 Graphical Techniques for


Interval Data
Example 2.1 Goal: Display & describe
information concerning the monthly bills
of new telephone subscribers

Collect data
Prepare a frequency distribution
Draw a histogram

How many classes to use?

With 200
observations, we
should have
between 7 & 10
classes

Alternatively,wecoulduseSturgesformula:
Numberofclassintervals=1+3.3log(n)=1+3.3(2.3)=8.6

Preparing the histogram


Collect data
Bills
42.19
38.45
29.23
89.35
118.04
110.46
0.00
72.88
83.05
.
.
(There are 200 data points

Prepare a frequency distribution


How many classes to use?
Number of observations
Less then 50
50 - 200
200 - 500
500 - 1,000
1,000 5,000
5,000- 50,000
More than 50,000

Number of classes
5-7
7-9
9-10
10-11
11-13
13-17
17-20

Class width = [Range] / [# of classes]


[119.63 - 0] / [8] = 14.95
Largest
Largest
Largest
Largest
observation
observation
observation
observation

Smallest
Smallest
Smallest
Smallest
observation
observation

observation
observation

15

Building the Histogram


1) Collect the Data
2) Create a frequency distribution for the data
a) Determine the number of classes to use. [8]
b) Determine how large to make each class. [15]
c) Place the data into each class
each item can only belong to one class;
classes contain observations greater than
their lower limits and less than or equal to
their upper limits.

Drawing the histogram


Draw a Histogram

Interpreting the Histogram


What information is visible from this histogram?

60
40

Bills

120

105

90

75

60

45

30

20
15

Frequency

About half of all A few bills are in Relatively,


the bills are small the middle range large number
of large bills
80 71+37=108 13+9+10=32
18+28+14=60

Additional notes: Relative frequency


It is often preferable to show the relative frequency
(proportion) of observations falling into each class,
rather than the frequency itself.
Classrelative
relativefrequency
frequency==
Class

Classfrequency
frequency
Class
Totalnumber
numberofofobservations
observations
Total

Relative frequencies are especially important when

comparing two or more histograms


the number of observations of the samples studied are
different

Additional notes: Class width


It is generally best to use equal class width, but
sometimes unequal class width are called for.
Unequal class width is used when the
frequency associated with some classes is too
low. Then,

several classes are combined together to form a


wider and more populated class.
It is possible to form an open ended class at the
higher end or lower end of the histogram.

Shapes of histograms

Shapes of histograms

Negatively skewed
Positively skewed

Modal classes
A modal class is the one with the largest
number of observations.

A unimodal histogram

The modal class

Modal classes
A bimodal histogram

A modal class

A modal class

Bell shaped histograms


Many statistical techniques require that the
population be bell shaped.
Drawing the histogram helps verify the shape of
the population in question

Interpreting histograms
Example 2.2: Selecting an investment

An investor is considering investing in one


out of two investments.
The returns on these investments were
recorded.
From the two histograms, how can the
investor interpret the
Expected returns
The spread of the return (the risk involved with

each investment)

Comparing two Histograms


181614121086420-15

The center
for A

0 15 30 45 60 75

181614121086420-15

Return on investment A

The
center for
B

0 15 30 45 60 75

Return on investment B

Interpretation: The center of the returns of Investment


is slightly lower than that for Investment B

Comparing two Histograms


181614121086420-15

Sample size =50 18-

17
34
46
0 15 30 45 60 75

Sample size =50

1614121086420-15

Return on investment A

16
26
43
0 15 30 45 60 75

Return on investment B

Interpretation: The spread of returns for Investment A


is less than that for investment B

Comparing two Histograms


181614121086420-15

0 15 30 45 60 75

181614121086420-15

0 15 30 45 60 75

Return on investment A Return on investment B

Interpretation: Both histograms are slightly positively


skewed. There is a possibility of large returns.

Conclusion: two Histograms


Example 2.2: Conclusion

It seems that investment A is better, because:


Its expected return is only slightly below that of

investment B
The risk from investing in A is smaller.
The possibility of having a high rate of return exists
for both investment.

Another example: comparing two


histograms
Example 2.3: Comparing students
performance

Students performance in two statistics classes were


compared.
The two classes differed in their teaching emphasis
Class A mathematical analysis and development of

theory.

Class B applications and computer based


analysis.
The final mark for each student in each course was
recorded.
Draw histograms and interpret the results.

Comparing two histograms


Frequency
Frequency

Histogram
Histogram
40
40
20
20
00
50
50

60
60

70
80
70
80
Marks(Manual)
Marks(Manual)

90
90

100
100

70
80
70
80
Marks(Computer)
Marks(Computer)

90
90

100
100

Frequency
Frequency

The mathematical emphasis


creates two groups, and a
larger spread.
Histogram
Histogram
40
40
20
20
00
50
50

60
60

Stem & Leaf Display


Retains information about individual observations that
would normally be lost in the creation of a histogram.
Split each observation into two parts, a stem and a leaf:
For example, observation value 48.19:
There are several ways to split it up
We could split it at the decimal point:
(and round)

Ste
m
48
Or split it at the tens position (still rounding)
4

Leaf
2
8

Stem & Leaf Display


Continue this process for all the observations. Then,
use the stems for the classes and each leaf
becomes part of the histogram (based on
Example 2.4 data) as follows
Stem Leaf
0
1
2
3
4
5
6
7
8
9
10
11

0000000000111112222223333345555556666666778888999999
000001111233333334455555667889999
The length of each line
0000111112344666778999
represents the frequency
001335589

124445589
33566
3458
022224556789
334457889999
00112222233344555999
001344446699
124557889

of the class defined by


the stem.
Westillhaveaccesstothe
originaldatapointsvalue!

Cumulative Relative
Frequencies:
firstclass
nextclass:.355+.185=.540

:
:

lastclass:.930+.070=1.00

Ogives
Ogives are cumulative relative frequency
distributions.
Example 2.1 - continued
Cumulative relative
relative frequency
frequency
Cumulative

Cumulativerelative
relativefrequency
frequencyfor
fortelephone
telephonebills
bills
Cumulative
Cumulative
Cumulative
Class Frequency
Frequency frequency
frequency
Class
0-15
71
71
0-15
71
71
15-30
37
108
15-30
37
108
30-45
13
121
30-45
13
121
45-60
130
45-60
99
130
60-75
10
140
60-75
10
140
75-90
18
158
75-90
18
158
90-105
28
186
90-105
28
186
105-200
14
200
105-200
14
200

}}

Cum.Relative
Cum.Relative
frquency
frquency
71/200=.355
71/200=.355
108/200=.540
108/200=.540
121/200=.605
121/200=.605
130/200=.650
130/200=.650
140/200=.700
140/200=.700
158/200=.790
158/200=.790
186/200=.930
186/200=.930
200/200=1.000
200/200=1.000

.700
.650
.605
.540

.790

.930 1.000

.355

15

30

45

Bills
Bills

60

75

90

105 120

Constructing an ogive
1) Calculate the relative frequencies.
2) Calculate the cumulative relative frequencies by
adding the current class relative frequency to the
previous class cumulative relative frequency.
3) Graph the cumulative relative frequencies

Why draw an ogive?


The ogive can be
used to answer
questions like:
What telephone
bill value is at the
50th percentile?

around $35

Ogive App.

46
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

47
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.


) (

48
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

49
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

50
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Another App.
Pareto Rule (20/80)

51
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

2.5. Describing relationships


between two variables(bivariate data)
First, determine the type of variables.
To compare two nominal variables,
use contingency tables with bar/pie
charts
To compare two interval variables, use
scatter diagrams.

A Contingency Table for two


Nominal variables
A sample of newspaper readers was asked to report which newspaper they read:
Globe and Mail (1), Post (2), Star (3), or Sun (4), and to indicate whether they
were blue-collar worker (1), white-collar worker (2), or professional (3).

Notehowthisreaderiscross
classifiedaccordingtoboth
variables

Contingency Table
Interpretation: The relative frequencies in the columns 2 & 3 are similar,
but there are large differences between columns 1 and 2 and between
columns 1 and 3.
This tells us that blue collar workers tend to read different newspapers
from both white collar workers and professionals and that white collar
and professionals are quite similar in their newspaper choice.

similar

dissimilar

Graphing a contingency table


Professionalstend
toreadtheGlobe&
Mailmorethan
twiceasoftenasthe
StarorSun

The Relationship Between Two


Interval Variables
A scatter diagram plots one variable against
the another.
The independent variable is labeled X while
the other, dependent variable, is labeled Y.
For example: A real estate agent
wants to study the relationship
between house price and house size
X variable: Size
Y variable: Price

Size
Price
23
315
24
229
26
335
27
261
..
..

Scatter Diagram
It appears that in fact there is a relationship: the greater
the house size the greater the selling price:

Typical Patterns of Scatter Diagrams


Positive linear relationship

No relationship

Negative nonlinear relationship


This is a weak linear relationship.
A non linear relationship seems to
fit the data better.

Negative linear relationship

Nonlinear (concave) relationship

Line Chart
Observations measured at the
same point in time are called
cross-sectional data.
Observations measured at
successive points in time are
called time-series data.

60
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Time-series data graphed on a


line chart, which plots the value
of the variable on the vertical
axis against the time periods on
the horizontal axis.

61
Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Example, plot the total amounts of U.S. income


tax for the years 1987 to 2002:

Line Chart
From 87 to 92, the tax was fairly flat. Starting 93, there was a rapid
increase taxes until 2001. Finally, there was a downturn in 2002.

Summary
Interval
Data

Nominal
Data

Histogram, Ogive,
Single Set of or Stem-and-Leaf
Display
Data

Frequency and
Relative Frequency
Tables, Bar and
Pie Charts

Relationship Scatter Diagram


Between
Two
Variables

Contingency Table,
Bar Charts

Potrebbero piacerti anche