Sei sulla pagina 1di 18

Topic Notes

QM161

Topic 1 Summarising Sample Data

Learning Outcomes

In this topic, you will learn:

the types of data used in business;


to construct tables and charts for numerical data;
to describe the properties of central tendency, variation, and shape in numerical
data;
to construct and interpret a boxplot; and
to compute descriptive summary measures for a population.

QM161 - Business Statistics 1

Topic Notes

Topic 1 Summarising Sample Data

1.1

Types of Data

The LSS text classifies data into two types, categorical data and numerical data. For
numerical data the difference between discrete and continuous data is emphasised. It is
important that you are able to recognise whether the measurement of a random variable
uses discrete or continuous units of measurement. Why? Because the type of data often
governs the choice of methodology used in subsequent analysis.

You should now read LSS, Sections LGS1 LGS3 and 1.1 1.3.
It is interesting to consider what level of information is contained in statistical data. A
commonly used classification of data is based on the level of measurement.
Statistical data come in four different types:

(i)

Nominal

These data are often also known as categorical data. Here, elements of a population or
sample are measured according to categories. For example, people are classified as male
or female. Voters may be classified as Liberal/National, Labor, Democrat, etc.
It is important to realise that if numbers are used, for example,
Liberal/National = 1
Labor = 2
Democrat = 3
this is only a coding device. Such numbers cannot be manipulated using the ordinary
rules of arithmetic.

(ii)

Ordinal

This is a higher level of measurement than nominal, in that an ordering is implied. For
example, workers may be classified on job performance as:
excellent = 1
good = 2
average = 3

QM161 - Business Statistics 1

Topic Notes

poor = 4
Clearly the numbers 1, 2, 3 and 4 are still a coding device, but now 1 is 'better than' 2,
which is better than 3, and so on.
The rules of arithmetic still cannot be applied, and it is particularly important to note that
good (= 2) cannot be taken to mean twice as good as poor (= 4).

(iii)

Interval

These data can be recognised by the fact that the 'zero' is arbitrarily chosen for
convenience. Two examples are IQ and temperature. With temperature, 0 does not mean
'no temperature'. It is simply a point on the temperature scale assigned the value zero,
and other temperatures are measured relative to that point. Many of the rules of
arithmetic apply, but some do not. For example, if on day 1 the temperature was 15 and
on day 2 it was 30, it is perfectly sensible to say the average temperature was 22.5.
Again, we can certainly say that day 2 was 15 warmer than day 1. However, we cannot
say day 2 was twice as hot as day 1.

(iv)

Ratio

This is the highest level of measurement. The origin is not arbitrary, but is unique, and
has a definite meaning. Take, for example, the amount of small change you have in your
purse or pocket at the moment. If you have $5 and your next-door neighbour has $20
then several implications follow immediately.
(a) He/she has more money (order);
(b) He/she has $15 more (interval);
(c) He/she has four times as much money.
The 'zero', in this case, is not arbitrary, but has a definite meaning, namely 'broke'!

QM161 - Business Statistics 1

Topic Notes

1.2

Describing Data with Graphs

Almost always the first task you will face in a statistical investigation is to get some 'feel'
for the data.
In the following sections I will be giving you some clues on a few basic techniques. In
order to do this I will be using a set of data, which will supplement the examples you will
find in LSS. These data are taken from the Environmental Protection Agency (EPA) and
give 100 observations on car miles per gallon.
TABLE 1: EPA MILEAGE RATING ON 100 CARS 1
36.3
32.7
40.5
36.2
38.5
36.3
41.0
37.0
37.1
39.9

41.0
37.3
36.5
37.9
39.0
36.8
31.8
37.2
40.3
36.9

36.9
41.2
37.6
36.0
35.5
32.5
37.3
40.7
36.7
32.9

37.1
36.6
33.9
37.9
34.8
36.4
33.1
37.4
37.0
33.8

44.9
32.9
40.2
35.9
38.6
40.5
37.0
37.1
33.9
39.8

36.8
36.5
36.4
38.2
39.4
36.6
37.6
37.8
40.1
34.0

30.0
33.2
37.7
38.3
35.3
36.1
37.0
35.9
38.0
36.8

37.2
37.4
37.7
35.7
34.4
38.2
38.7
35.6
35.2
35.0

42.1
37.5
40.0
35.6
38.8
38.4
39.0
36.7
34.8
38.1

36.7
33.6
34.2
35.1
39.7
39.3
35.8
34.5
39.5
36.9

Presented in this form, the data convey almost nothing. I am now going to introduce two
graphical techniques that will help to reveal the basic features of the data. Both these
methods show how the data are distributed across the range of observations.

1.2.1 Frequency Tables and Histograms


A histogram is a graphical representation of a 'frequency table'. The basic idea is to
divide the range of observations into mutually exclusive 'classes' or 'groups', and record
the number of observations falling in each class. You will find detailed instructions for
constructing a frequency table in LSS, pp. 75-83. The key to obtaining a useful
representation of the data is the choice of the number of classes. It is important to realise
that there is no 'correct' number. We will always be guided by the pragmatic ideal of
ending up with a table (and later, a histogram) which gives us the greatest feel for the
data. If the number of classes is either too large or too small, the main features of the
data will be concealed.

McClave, J. T. and Dietrich, F. H., A First Course in Statistics, 3rd ed, Dellen Macmillan, London, 1989. p. 15.

QM161 - Business Statistics 1

Topic Notes

Let us now consider the EPA data in some detail. As a start we will somewhat arbitrarily
divide the data into 10 classes. As you see from LSS, p.77, the class width (CW) should
satisfy

CW >

range
H L
=
number of classes
K

where H represents the highest number;


L represents the lowest number; and
K represents the number of classes.
For these data,
H = 44.9
L = 30.0
K = 10
CW >

44.9 30.0
= 1.49
10

To make the construction of the frequency table easier, we choose the class interval of
1.5.
We set up a frequency table with 10 classes, each of width 1.5, so that every observation
falls into one, but only one, class. The lower limit of the first class should be less than the
minimum observation, 30.0. We thus define the first class as 29.96 to 31.45, the second
as 31.46 to 32.95, etc. The tenth class is from 43.46 to 44.95, which includes the
maximum value, 44.9. We obtain the following frequency table:

QM161 - Business Statistics 1

Topic Notes

TABLE 2: FREQUENCY TABLE OF EPA DATA


Class
Class Midpoint
Frequency
29.96-31.45
30.7
1
31.46-32.95
32.2
5
32.96-34.45
33.7
9
34.46-35.95
35.2
14
35.96-37.45
36.7
33
37.46-38.95
38.2
18
38.96-40.45
39.7
12
40.46-41.95
41.2
6
41.96-43.45
42.7
1
43.46-44.95
44.2
1
100

Relative Frequency
0.01
0.05
0.09
0.14
0.33
0.18
0.12
0.06
0.01
0.01
1.00

The three columns of the table are defined as follows:


(a) The class centre is the 'middle number', and is the average of the lower and upper
class limits. So, for the first class, the class centre is (29.96 + 31.45)/2 = 30.705,
and for the second class, the class centre is (31.46 + 32.95)/2 = 32.2, etc. Notice
that the class centre increases by 1.5, which is the class width.
(b) The frequency is simply the number of observations that are in the particular class.
For example, 12 of the EPA mileages lie in the range 38.96 to 40.45. The
frequencies must add up to the total number of observations - here 100.
(c) The relative frequency is the frequency expressed relative to the total number of
observations. 12 out of the 100 (or 0.12 of the observations) lie in the class 38.96
to 40.45. As a general rule

relative frequency =

frequency
total number of observations

Relative frequencies must always add up to 1.00.


Excel instructions can be found in LSS EG2.2 pp. 124-126.
I used excel to produce the following version of Table 2 and a histogram based on the
frequency table (see Figure 1).

QM161 - Business Statistics 1

Topic Notes

Table 2: Frequency Table of the EPA Data (EPA)


BIN
Frequency Percentage
1
1.00%
31.45
5
5.00%
32.95
9.00%
9
34.45
14
14.00%
35.95
33.00%
33
37.45
18.00%
38.95
18
12.00%
12
40.45
6.00%
41.95
6
1.00%
43.45
1
1
1.00%
44.95

Figure 1

A glance at the histogram clearly reveals the most important characteristics of the data,
as follow:
(i)

The observations are centred at around 37 mpg.

(ii)

They distribute themselves approximately 'symmetrically' about the centre.

A histogram may be constructed using EXCEL/PHStat. Instructions are given in EG2.2 and
EG2.4 of LSS.

You should now read LSS, pp. 93-97.

QM161 - Business Statistics 1

Topic Notes

1.2.2 Stem-and-Leaf Diagrams


The stem-and-leaf diagram conveys very similar information to the histogram, with one
important difference. Given a frequency table or a histogram, it is not possible to recover
the original observations. Once observations have been put in classes, information is lost,
as all observations in a class are treated as being equal to the class centre. For example,
with the EPA mileage data, the five observations 32.7, 31.8, 32.5, 32.9 and 32.9 are all in
the class 31.45 to 32.95 and treated as being equal to the value 32.2 (the class centre).
With a stem-and-leaf diagram, the original observations can always be recovered.

You should now read LSS, pp. 92-93.

We now use EXCEL to obtain a stem-and-leaf diagram for the EPA data. The instructions
are in LSS, EG2.4, p. 128.
The 100 observations were typed into column a. The title EPA was entered in the first cell.
The data were in the range a2:a101.
In the Stem-and-Leaf Dialogue box, I entered the following (shown in bold):
Variable Cell Range: a1:a101
Click First Cell Contains Label
Click Autocalculate Stem Unit (This will need to be set for different data sets.)
Output Title: EPA DATA
Click Summary Statistics
The output is shown below.
The comments made with respect to the histogram still apply. In addition, we can see
that the observation 44.9 may be an 'outlier' - that is, it may not be typical. The original
observations can be recovered. For example, if we look at the fourth line, six
observations are represented, namely 33.1, 33.2, 33.6, 33.8 and 33.9 (twice).

QM161 - Business Statistics 1

Topic Notes

1.3

EPA DATA

30
31

0
8

Stem unit:1

32

5799

33

126899

Statistics
Sample Size

100

34
35

024588
01235667899

Mean
Median

36.994
37

36
37

01233445566777888999
000011122334456677899

Std. Deviation
Minimum

2.417897
30

38
39

0122345678
00345789

Maximum

44.9

40

0123557

41

002

42
43

44

Using Descriptive Measures

1.3.1 Introduction
In the previous sections, we explored a few of the common graphical methods that can
be used to discover the main features of a set of data.
We may define a descriptive measure as a 'single' number computed from the sample
data that provides information about the data'. As you will discover, one of the reasons
statistics is such a powerful science is that often one or two descriptive measures can
capture an enormous amount of information about the data, which could contain
hundreds, or even thousands, of observations. To start with, we are going to look at two
different types of descriptive measures, called measures of 'central tendency' and
measures of 'variation'.

1.3.2 A Digression - Sigma Notation


When we have a sample with n observations on some variable of interest, it is helpful (for
algebraic calculations) to denote the individual observations by
X1, X2, . . . , Xn,
where Xi represents the value for the i-th observation. The letter, i, is called the 'subscript'
of X. With the EPA data, X1 = 36.3, X2 = 32.7, . . . , X100 = 36.9. When we analyse the data,

QM161 - Business Statistics 1

Topic Notes

we are often going to want to add the observations, or add the squares of the
observations, and so on.
The sum of the values, X1 + X2 + ... + Xn, is important in our discussion below. We
represent the sum of a set of numbers by using what is called the 'sigma notation', as
follows:

X
i =1

= X 1 + X 2 + ... + X n .

This notation should be understood as follows:


'' (the upper-case Greek letter, sigma) stands for 'sum of'.
n

'

X
i =1

' simply represents 'the sum of all the numbers Xi, where i takes all the

integer values from 1 to n (inclusive)'.

The expression,

i =4

= X 4 + X 5 + X 6 , represents the sum of the values of Xi, where the

subscript, i, has values from 4 to 6 (inclusive). Generally, the summation notation is used
only when the values of the subscript go from 1 to n, as above.
n

The sum of the squares of all the Xi-values is conveniently represented by


n

by definition,

X
i =1

2
i

X
i =1

2
i

, that is

= X 12 + X 22 + ... + X n2 .
n

In our subsequent work, we often abbreviate

X
i =1

to Xi, when it is obvious that the sum

of all the Xi-values is involved. This simply means 'sum all the Xs'. Note that
n

i =1

j =1

k =1

X i , X j and X k are exactly the same thing. Using different subscripts does not
change the fact that the sum of all the values is involved.

You should now read LSS, Appendix A.4 pp. 698-701.

10

QM161 - Business Statistics 1

Topic Notes

1.3.3 Measures of Central Tendency of Sample Data


A measure of central tendency is a number that, in some sense, represents the 'middle' of
the sample data. It is a value around which the observations 'cluster'. Three measures are
often used, known as the sample mean, the sample median and the sample mode. In
these notes only the sample mean is discussed. This measure is by far the most
important in our later work. You should, however, read LSS on the sample median and the
sample mode and know how to obtain these measures.
The sample mean
The sample mean of n numbers is simply the arithmetic average, which is defined as the
sum of the numbers, divided by n. We denote it by X (called 'X bar').

X + X 2 + ... + X n
=
Thus, X = 1
n

For the EPA data: n=100,

X
i =1

X
i =1

1 n
Xi.
n i =1

= 3699.4, so X =

3699.4
= 36.994.
100

The sample mean of the EPA mileages is approximately 37 mpg. This is entirely
consistent with what we observed from the histogram. To calculate some summary
statistics for the EPA data we use the Data | Data analysis command in EXCEL (see LSS,
EG3.1 and EG3.2). We will be referring to this summary in the next section. The summary
is:

11

QM161 - Business Statistics 1

Topic Notes

EPA
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

36.994
0.24179
37
37
2.417897
5.846226
0.769923
0.050909
14.9
30
44.9
3699.4
100

Note: LSS uses the term "mean" rather than "sample mean" in the above data summary.
However, using the word "sample" before "mean" is recommended when a set of sample
data are being summarised. Later the word "mean" is used for something different from,
but related to, the "sample mean".

You should now read LSS, Section 3.1.

1.3.4 Measures of Variability of Sample Data


A second very important feature of a set of data is the degree of variability. Two sets of
data may have the same sample mean, yet one may show very little variation while the
other varies a lot. In other words, the observations in the first set might be almost the
same, whereas the observations in the second set exhibit large changes from observation
to observation.
Let us make a start on measuring variability by defining the deviation of the i-th
observation as the difference between Xi and the sample mean, X . That is,
Di = Xi - X .
The average of the deviations for all values in a data set is represented by

12

QM161 - Business Statistics 1

Topic Notes

D=

1 n
Di .
n i =1

Unfortunately, this is not useful to measure variability of observations, because


n

D = (X
i =1

i =1

X ) is always zero, no matter the data set involved.

The reason that the average deviation does not work is that a deviation may be positive
(if Xi is greater than X ) or negative (if Xi is less than X ) and the positive and negative
effects always cancel out.
An alternative approach uses the squares of the deviations (which are always never
negative). This leads to the idea of measuring variability by the average of the squared
deviations, that is,

Variability =

1 n 2 1 n
Di = ( X i X ) 2 .

n i =1
n i =1

You will see that in equation (3.6) of LSS on p. 143, the variability measure suggested has
n-1 as the divisor. This is a 'better measure', as discussed later.
This has now led us to a definition of what is known as the sample variance, always
denoted by S2, and given by

S2 =

1 n
( X i X )2 .
n 1 i =1

A second measure, known as the sample standard deviation, is the square root of S2
(denoted, naturally, by S), given by

S=

1 n
( X i X )2 .
n 1 i =1

The sample variance can be interpreted as an average of the squared differences in a data
set from their sample mean. The sample standard deviation can be thought of, somewhat
imprecisely, as an average deviation, which was our original intuitive approach to
measuring variability. When statisticians use the word 'standard', they are indicating
something similar to an average.

13

QM161 - Business Statistics 1

Topic Notes

The sample variance S2 will often be used in later statistical inference. However, the
sample standard deviation is much easier to interpret. The reason for this is because it is
in the same units of measurement as the original X-values.
The sample standard deviation of the EPA data is 2.418 mpg (see the above summary
statistics). This implies that the 'average' deviation of an observation from the sample
mean, 36.994, is 2.418 mpg.
On the other hand, the sample variance is (2.418)2 mpg2 = 5.847 mpg2. This does not
have an easy interpretation. The important point at this stage is that both measures
contain the same information, in the sense that if we know one, we can always find the
other.

Calculating S2
When computing the value of S2 with a hand calculator, the calculating formula that is
recommended is the following
2

1 n 2 n

S =
X i X i n .
n 1 i =1
i =1

This is much more convenient to use than that in equation (3.6) of LSS. The two
n

quantities,

X i and
i =1

X
i =1

2
i

, need to be calculated from the data prior to using the


n

above formula. For example, for the EPA data: n=100,

X
i =1

=3699.4, and

X
i =1

2
i

=137,434.38.

S2 =

Thus

1
[ 137,434.38 (3699.4) 2 / 100 ] = 5.846226 .
99

Hence, S = 2.42, correct to two digits behind the decimal point.


The concept of relative variability, as measured by the coefficient of variation (CV), is
quite important. Read about this in LSS, p. 146.

You should now read LSS, pp. 141-145.

14

QM161 - Business Statistics 1

Topic Notes

1.3.5 A Measure of Relative Position - the Z-score


In this section, we concentrate on a particular observation, rather than the whole data set,
and answer the question 'where does this particular observation come relative to the
whole sample?' We answer this by calculating the Z-score of the observation. We define Zi,
the Z-score of Xi, by the formula

Zi =

Xi X
.
S

Suppose that three observations from the EPA data are denoted by
X1 = 40.1,

X2 = 32.7,

X3 = 37.1.

(These are not the first three observations of the data in Table 1 of Section 1.2.)
From the formula above, the Z-score of X1 is

Similarly,

Z1 =

40.1 36.994
= 1.28.
2.418

Z2 =

32.7 36.994
37.1 36.994
= 1.78 and Z 3 =
= 0.04.
2.418
2.418

These Z-scores enable us to make the following comments about X1, X2, and X3.
(a) Because Z1 and Z3 are positive, then X1 and X3 are larger than the sample mean.
Because Z2 is negative, it implies that X2 is less than X .
(b) The magnitude of the Z-score tells us how many sample standard deviations the
observation is away from the sample mean. Thus X1 is about 1.3 sample standard
deviations greater than X , whereas X2 is about 1.78 sample standard deviations
less than X .
To every Xi there is a corresponding Z-score. So, for example, for the EPA mileage data,
there are 100 Z-scores. An interesting question is: 'What are the characteristics of the
sample of Z-scores?'

15

QM161 - Business Statistics 1

Topic Notes

Using EXCEL to generate the Z-scores, the following histogram, sample mean and sample
standard deviation of the 100 Z-scores are obtained.

Histogram
35

Frequency

30
25
20
Frequency
15
10
5
0
-3.1 -2.3 -1.7 -1.1 -0.4

0.2

0.8

1.4

2.0

2.7

3.3 More

Z-Score

Note the use of E


format. E-16

Z-scores
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

4.949E-16
0.1
0.002316
0.002316
0.9999999
0.9999998
0.7696473
0.0504045
6.1622831
-2.8927163
3.2695668
4.949E-14
100

means that you


should move the
decimal point 16
places to the left.

Note:
a) The shape of the histogram for the Z-scores is the same as that for the X-values.
b) The sample mean is Z = 0 , and the sample standard deviation is SZ = 1. This last
property of the sample mean and sample standard deviation is true for every
sample of Z-scores. For this reason, Z-scores can be regarded as standardising any
set of data to a comparable basis. The Z-scores are very important in later work on
statistical inference.

You should now read LSS, pp. 147-148.

16

QM161 - Business Statistics 1

Topic Notes

1.3.6 The Empirical Rule


On p. 163 of LSS, a rule is given that relates the sample mean and the sample standard
deviation. This Empirical Rule is important for future work. It comes in the form of three
statements that apply to samples with symmetric, 'bell-shaped', histograms. The EPA
mileage data exhibit this characteristic. The rule is stated in slightly different terms to
LSS.
For a sufficiently large data set that has a bell-shaped distribution:
1. Approximately 68% of observations in a data set lie within one sample
standard deviation of the sample mean.
2. Approximately 95% of observations in a data set lie within two sample
standard deviations of the sample mean.
3. Virtually all observations in a data set lie within three sample standard
deviations of the sample mean.
This rule can also be stated in terms of Z-scores, as follows:
1. Approximately 68% of Z-scores lie between -1 and +1.
2. Approximately 95% of Z-scores lie between -2 and +2.
3. Virtually all Z-scores lie between -3 and +3.
For the EPA data, we now revisit the question 'where does the observation, X1=40.1, occur
relative to the whole sample?' We know that its Z-score is Z=1.28. This tells us, by the
Empirical Rule, that it is somewhere in the top 16% of the sample. As discussed later, the
Z-scores enable us to make far more precise statements than this, but for the moment,
the Empirical Rule is helpful in gaining a feel for the data.
When the data is not bell-shaped, then we can use the Chebyshev Rule. See p. 164 of LSS.

17

QM161 - Business Statistics 1

Topic Notes

1.3.7 Boxplot Plots


A boxplot is a diagrammatic depiction of the data set. The box's edges are the lower
quartile (Q1) and the upper quartile (Q3) and it contains the median. Thus, the box
encloses the middle 50% of the data. The tails extend out from the box to the largest and
the smallest data values. To construct a boxplot we need five summary numbers from the
data: the smallest value, the lower quartile, the median, the upper quartile and the largest
value. Using Excel/PHStat (see pp. 181-182) and the EPA data we obtain the following
graphical depiction of the data:
Boxplot

Boxplot
Five-number Summary
Minimum
30
First Quartile
35.6
37
Median
Third Quartile 38.4
44.9
Maximum

EPA

20

25

30

35

40

45

50

Figure 3.5 on p. 159 in LSS illustrates how the boxplot relates to different shaped
distributions. Examining the EPA boxplot, we can observe that the data set is fairly
symmetrically distributed with a slight tendency to being right skewed.

1.4

You should now read LSS, pp. 154-159.

A Final Comment

All the information in Chapters 1 to 3 of LSS is interesting and potentially useful.


However, for examination purposes, only the items covered in these notes, or referred to
for reading, are examinable. Please do not take this as encouragement not to read widely.

18

QM161 - Business Statistics 1

Potrebbero piacerti anche