Sei sulla pagina 1di 25

ADDITIONAL MATERIALS DOCUMENT FOR STATISTICS 30001 COURSE

Bocconi University, 2017/18

Index

1. Introduction
2. Frequency distributions for categorical and numerical variables
3. The frequency density
4. Analysis of contingency tables
5. Simpson's paradox: gender discrimination in admissions to postgraduate courses
6. The shape of the distribution
7. Calculation of the median from grouped data

1. Introduction

The purpose of this short note is to make the descriptive content of the Statistics course textbook
more complete 1. This textbook will be mentioned in the following paragraphs, highlighting the
essential steps needed for a more comprehensive understanding of the topics.

In Section 2 we will cover some aspects related to the construction of frequency distributions and
to the cumulative frequency function; Section 3 is devoted to the graphical representation of
quantitative variables using histograms and ogives, especially when classes have different width;
Section 4, by exploring, through contingency tables, the relationship between two variables,
illustrates the concept of sub-populations; Section 5 shows an example of a misinterpretation of
conditional distributions (Simpson's paradox). At the end, in Section 6, we discuss shapes
distributions and their graphical representation through boxplots.

1P. Newbold, W.L.Carlson, B. Thorne (2013), Statistics for Business and Economics, Pearson - Prentice Hall, Global
edition, 8th ed. References will be to that textbook.

1
2. Frequency distributions for categorical and numerical variables 2

Data can be summarized using frequency distributions. A frequency distribution is a table that is
used to organize raw data. The following table is an example of a frequency distribution:

Category Absolute
Frequencies
Insurance 1
Automotive 5
Banking 5
Commercial 1
Financial 1
Petroleum 7

In general, looking at categorical data (nominal and ordinal) or at discrete quantitative data, we
note that values are repeated several times, contrary to what happens with continuous quantitative
data. For the first three cases, it seems natural to associate to each category or value the number of
times that it appears in the data and this is called category’s or value's absolute frequency. The
collection of the categories and their frequencies is called frequency distribution.

More precisely, consider N observed values 3 of a categorical or discrete quantitative variable X.


The table that collects the k categories xi of a variable X and the associated frequencies fi, (i =
1,…,k), is called its frequency distribution. The symbols x1 , x2 ,..., xk ( k ≤ N ) indicate the
different values 4, so the frequency distribution can be written as:

Category Absolute frequencies


( fi )
x1 f1
 
xk fk

where fi indicates the number of times (absolute frequency) that the value xi appears. Clearly,

∑f
i =1
i =N

2 See paragraph 1.3 and 1.5 of Newbold et al. (2013).


3 The textbook distinguishes between “N” (number of observations in a population) and “n” (number of observations
in a sample). In descriptive statistics that distinction is irrelevant and so we only use N.
4 For discrete quantitative variables the values xi are placed in increasing order, or xi < xi +1 , i = 1,2,...,k −1.

2
must hold. We define the relative frequency of category i, denoted by pi as the ratio

k
fi
pi = ; with∑ pi = 1.
N i =1

The relative frequency represents the portion of the total frequency associated with a specific
category, and it allows to easily perform comparisons on the weight that different categories or
values of the variable have in the data, or to perform comparisons between the distributions of
populations with different size 5.

The frequency distribution can also appear in its relative frequencies form:

Relative Frequencies
Category
( pi )
x1 p1
 
xk pk

Relative frequencies often appear as percentages (i.e., after multiplication by 100).

Regarding graphical representations 6, in case of an ordinal (categorical) variable, bar chart is used
more frequently than the pie chart.

For quantitative variables 7 it is also possible to introduce an additional function, called cumulative
frequency distribution.

By definition, this function is the set of all ordered pairs [x, F(x)], where x is any real value and
F(x) expresses the relative frequency with which values that are less than or equal to x are
observed, or

F(x)=Fr{X ≤ x},

where "Fr" indicates "relative frequency" 8.

5 Refer to Section 3 of this note to explore the concept of a comparison between distributions in sub-populations by
calculation of conditional frequency distributions.
6 Refer to Section 1.3 of Newbold et al. (2013).
7 Even for ordinal qualitative variables, although the representation is used relatively seldom.
8 Refer to the book E. Castagnoli, M. Cigola, L. Peccati (2009), Probability. A Brief Introduction, EGEA, par. 4.2.,
for an accurate definition of the cumulative distribution function and of its analytical properties. From the formal

3
In the case of discrete quantitative variables, it is easy to determine the value of the cumulative
frequency distribution function F corresponding to the category xi:

F(x j ) = ∑ pi = p1 + p2 + ...+ p j
i=1

For example, consider the following frequency distribution for the discrete quantitative variable
number of family members which is observed over 35.000 families:

Cumulative frequencies
Category fi pi
( Fi )

1 2304 0.0768 0.0768


2 9105 0.3035 0.3803
3 9525 0.3175 0.6978
4 7377 0.2459 0.9437
5 1266 0.0422 0.9859
6 345 0.0115 0.9974
7 78 0.0026 1

To calculate the cumulative frequency function corresponding to category 3 of the variable


number of family members, one simply adds up the relative frequencies of all the categories less
than or equal to 3:

3
F(x 3 ) = F(3) = ∑ pi = 0.0768 + 0.3035 + 0.3175 = 0.6978.
i=1

In the intervals included among the values taken by the variable with positive frequency, the
cumulative function is constant, as there is no frequency between such values.

Therefore, for this example we can write the following analytical expression for F(x):

point of view, there are no differences between the cumulative distribution function of a random variable and of a
statistical variable.

4
Since the frequencies are concentrated on specific values one has

Fr{X≤ xi} ≠ Fr{X< xi} whenever pi >0.

From its analytical expression, we can build the graphical representation of the cumulative
frequency function for discrete quantitative variables:

Figure 2.1. Example of cumulative distribution function of a discrete quantitative variable.


Cumulative frequency

Number of family members

In the case of a quantitative variable with a large number of categories (i.e. different values), the
frequency distribution for interval classes has to be built, and the cumulative frequency
distribution is graphically represented by the ogive or the cumulative frequency curve as
presented in Section 1.5 of the textbook (and further explained in the next section).

In case of categorical and nominal variables, the cumulative frequencies generally have no
meaning, because the categories of the variable do not have any intrinsic order. However, as
shown in the textbook 9, the process that leads to the construction of the Pareto diagram includes
the computation of a cumulative percentage. This percentage is nothing but the cumulative
frequency after ordering the categories in descending order according to their absolute frequency
(not to the values of the variable!).

9 See paragraph 1.3 of Newbold et al. (2013).

5
3. The frequency density

As shown in the textbook, in order to graphically represent quantitative variables with a large
number of different values, it is necessary to build the distribution of frequencies after collecting
the values into interval classes 10.
In case of continuous quantitative variables with a potentially large number of categories and in
the case of a discrete quantitative variable with a very large number of distinct categories that
make it similar in nature to a continuous variable, the representation of the data for all different
categories (values) is not useful. It will be instead appropriate to group the data into intervals
(classes).

Intervals (classes) can have different width: in this case using the graphical representation
described in the textbook can lead to errors (misleading histograms 11).

How do you determine the width of the intervals? In the majority of cases, and in the absence of
specific indications provided by the aims of analysis, one typically chooses intervals with
constant width w (thereby following the procedure illustrated in the textbook), or intervals having
(at least approximately) equal frequency. The choice of the size of the intervals is largely
arbitrary, and different choices generate distributions that may be quite different from each other.
Grouping in classes with small width, w, provides greater detail but affects the capacity of the
summary.

As an example, consider the variable head of the household’s age, recoded into 6 and 3 classes
(Table 3.1 and 3.2, respectively) of equal widths. We also show the effect of recoding the data
into 3 classes of different sizes but approximately equal frequency (Table 3.3). The distribution in
Table 3.1 preserves more detailed information on the variable, while the choice between
representations of Table 3.2 and Table 3.3 should be based on the aims of the analysis. Although
the choice of classes with equal size tends to be more intuitive, the distribution with classes that
have almost the same frequency highlights which values are the most frequent (in this case, the
class [49 ; 62 ) is the one with the smallest class size) and may help improve the readability of
subsequent analyses. Constructing intervals with a constant width may not be useful when there
are extreme classes with respects to the majority of observed cases: by adopting classes of equal
width one may obtain several intervals with zero frequencies.

10 See paragraph 1.5 of Newbold et al. (2013).


11 See paragraph 1.6 of Newbold et al. (2013).

6
Table 3.1. Variable Head of the household’s age in 6 classes of the same size

Lower Upper
bound of bound of fi pi
the classes the classes
23.00 33.67 68 8.71%
33.67 44.33 137 17.54%
44.33 55.00 162 20.74%
55.00 65.67 257 32.91%
65.67 76.33 125 16.00%
76.33 87.00 32 4.10%

Table 3.2. Variable head of the household’s age in 3 classes of the same size

Lower Upper
bound of the bound of fi pi
classes the classes
23.00 44.33 205 26.25%
44.33 65.67 419 53.65%
65.67 87.00 157 20.10%

Table 3.3. Variable head of the household’s age recoded into 3 classes, with frequencies that are
almost equal

Lower Upper
bound of bound of the fi pi
the classes classes
23.00 49.00 251 32.14%
49.00 62.00 262 33.55%
62.00 87.00 268 34.31%

The graphical representation of the distribution of continuous variables classified into intervals is
the histogram. For its construction the intervals are reported on the horizontal axis; over each
interval, a rectangle is drawn with an area proportional to the relative frequency of that class;
generally, the heights of these rectangles are determined by dividing the area by the size (width)
( )
of the interval wi = xi +1 − xi ; in this case the area of each rectangle is equal to the relative
frequency of the corresponding interval class. The height of each rectangle is called the frequency
density (ci) which may be interpreted as the amount of frequency per unit interval.

7
Below (Tables 3.4 and 3.5) we show the calculation of the frequency density function for the
variables presented in Table 3.2 and in Table 3.3. Round parentheses indicate that the bound of
the interval is not included in the class; square brackets indicate that the bound of the interval is
included in the class.

Table 3.4. Frequency density of the variable head of the household’s age recoded into 3 classes of
equal width

Class of measurement pi wi
ci = pi/ wi Fi
[23; 44.33) 0.2625 21.33 0.0123 0.2625
[44.33; 65.67) 0.5365 21.33 0.0251 0.7990
[65.67; 87] 0.2010 21.33 0.0094 1.0000

Table 3.5. Frequency density of the variable head of the household’s age recoded into 3 classes with
approximately equal frequencies

Class of
pi Wi ci = pi/ wi Fi
measurement
[23; 49) 0.3214 26 0.0124 0.3214
[49; 62) 0.3355 13 0.0258 0.6569
[62; 87] 0.3431 25 0.0137 1.0000

The histograms corresponding to the two different frequency distributions are shown in Figure 3.1
and in Figure 3.2, respectively. For histograms constructed for classes of different width the
vertical axis must indicate the frequency densities. If all classes have the same width, the vertical
axis may indicate either the frequency densities (as illustrated in Graph 3.1) or the relative or
absolute frequencies (as shown in the textbook): in fact, in the latter case, the proportionality
between the areas and the relative frequencies of the classes is maintained.

Since Figure 3.2 represents a histogram of classes with approximately equal frequencies, the area
of each rectangle (equal to the relative frequency) is approximately the same; the central class,
which is the narrowest class, has the highest frequency density.

Histograms can also be produced using the Excel macros available on the e-learning.

8
Figure 3.1. Histogram of variable head of the household’s age classified in 3 classes of equal width.

Head of
household’s age

Density

Figure 3.2. Histogram of the same variable recoded into 3 classes with roughly equal frequencies.

Head of
household’s age
Density

The cumulative frequency function, defined in the previous paragraph, can be constructed and is
used to represent the curve of cumulative frequency or ogive, represented in Figures 3.3 and 3.4.

Figure 3.3. Histogram of the variable head of the household’s age in 3 classes of equal width

1,2

0,8

0,6
Fi

0,4

0,2

0
0 20 40 60 80 100

9
Figure 3.4. Ogive of the variable recoded into 3 classes having approximately equal frequency

1,2

0,8

Fi 0,6

0,4

0,2

0
0 20 40 60 80 100

As illustrated in the four previous graphs, the one of the cumulative frequency function is formed
by segments connecting the values Fi of the cumulative frequencies reported in the corresponding
tables; the slopes of those segments is equal to the frequency density.

From the cumulative frequency function one may easily reconstruct the relative frequencies.
Given two values a and b, such that a < b:

Fr{a < X ≤ b} = F(b) − F(a). 12

4. Analysis of contingency tables

Sections 1.3. and 1.5. of the textbook describe how to illustrate data through side-by-side bar
charts, stacked bar charts, or frequency distributions relating to two categorical variables (or
quantitative discrete variables with a small number of values/categories). To study the
relationship between two variables it is useful to start from some concepts related to joint
frequency distributions.

For two categorical variables 13 such joint distributions show how the observations are distributed
among all the combinations of values of the two categorical (ordinal or nominal) variables.
Univariate frequency distributions cannot provide such information. In the following example a
sample of 750 companies is summarized in terms of Geographic location and Company's industry
type in a joint (absolute) frequency distribution:

Table 4.1. Contingency table of the absolute frequencies of Geographic location and Industry.

12 Note that when the quantitative variable is continuous one has Fr{a < X ≤ b} = Fr{a < X < b}, since
Fr{X = b} = 0 for any value b.

13 The same reasoning can be applied to a quantitative variable with a small number of values/categories.

10
Manufacturing Services Research Other
of science
Geographic location \ Industry type
&
technology
East 100 50 50 50
North 50 95 45 60
West 65 70 75 40

The table, called a contingency table, contains all of the joint absolute frequencies.
Given two categorical variables (or quantitative variables with a small number of
values/categories), then the joint frequency distribution of the absolute frequencies can be written
in general as follows:

X\Y y1 … yc Total
x1 f11 … f1c R1
… … … … …
xr fr1 … frc Rr
Total C1 … Cc N

where r is the number of rows in the table and corresponds to the number of categories of the first
variable; c is the number of columns in the table and corresponds to the number of categories of
the second variable; fij is the absolute joint frequency corresponding to each pair of categories
(xi;yj), with i = 1,…,r and j = 1,…,c.
In the last row and column are shown the Totals, i.e. the sums of the absolute frequencies by
column and by row, respectively. These frequencies correspond to the univariate absolute
frequency distribution of each variable, and are called marginal (absolute) frequencies.

From the contingency table it is possible to obtain the univariate frequency


distributions of X and Y adding the joint frequencies by row and column, respectively.
The univariate distributions of X and Y, called marginal distributions, are shown in the
last column and the last line of the contingency table, respectively.

The lowest right cell contains the total number of observations, N.

Back to the example, the following table illustrates the calculation of the marginal distributions of
the two variables:

11
Table 4.2. Contingency table with the absolute joint frequencies and absolute marginal frequencies
of Geographic location and Industry

Geographic location \ Manufacturing Services Research of science Other Total


Industry type & technology
East 100 50 50 50 250
North 50 95 45 60 250
West 65 70 75 40 250
Total 215 215 170 150 750

As an alternative, a contingency table may contain the joint relative frequencies.

The joint relative frequencies pij are obtained by dividing every joint absolute frequency fij by the
total number of observations: pij = fij/N. For the example we obtain the following table:

Table 4.3. Contingency table with joint relative frequencies and marginal relative frequencies for
Geographic location and Industry

Geographic location \ Manufacturing Services Research of science Other Total


Industry type & technology
East 0.133 0.067 0.067 0.067 0.333
North 0.067 0.127 0.060 0.080 0.333
West 0.087 0.093 0.100 0.053 0.333
Total 0.287 0.287 0.227 0.200 1.000

The joint (absolute or relative) frequencies allow one to obtain some information about the
relationship that exists between two variables. We note that the percentage of companies that are
both in the manufacturing industry and located in the East is 13.3%, the percentage of companies
in the services sector and located in the North is 12.7%, while the percentage of companies in the
Research of science&technology industry and located in the West is 10%. Comparing these
proportions to the other proportions in the table allows us to suspect that the two variables are
"related." It may seem that manufacturing and services companies are more often located in the
East and in the North, while Research of science & technology are more in the West, since the
corresponding absolute (and consequently relative) frequencies are higher when compared to the
others in the table. This conclusion is, however, not accurate as the marginal frequency of
Research of science & technology is lower than that of Manufacturing and Services. An
additional step is then necessary in order to properly analyze the association between the two
variables.

In particular, we should compare the distribution of companies among the three geographic areas
by taking into account companies of only one sector (i.e. Research of science & technology) and
then compare these conditional frequencies with those obtained in the other sectors. The
conditional frequency of the other two sectors (Manufacturing and Services) have to be build in

12
the same way of the previous variable, that is by considering only companies present in their own
sectors.

The conditional frequencies in the rows or columns represent the relative frequencies of a variable
(say, for the category xi of X) for a fixed category of the other variable (say, the category yj of Y),
and they are defined as follows 14:

Fr {X = xi | Y = yj} = fij / Cj ,

which we call the conditional frequency of X = xi given Y = yj.

Symmetrically, if we look at the frequency of a category yj of Y for a fixed category xi of X, then


we may define the conditional frequencies of Y = yj given X = xi as:

Fr {Y = yj | X = xi } = fij / Ri .

The collection of all the conditional frequencies corresponding to the same fixed conditioning
category (whether xi or yj) is called a conditional frequency distribution.
For example, the conditional frequency distributions of Geographic location given the different
Industries are reported within the columns of the following table:

Table 4.4. Conditional distributions of Geographic location given Industry.

Geographic location \ Manufacturing Services Research of science Other Total


Industry type & technology
East 0.465 0.233 0.294 0.333 0.333
North 0.233 0.442 0.265 0.400 0.333
West 0.302 0.326 0.441 0.267 0.333
Total 1.000 1.000 1.000 1.000 1.000

Symmetrically, these are the conditional distributions of Industry given Geographic location.

Table 4.5. Conditional distributions of Industry given Geographic location.

Geographic location \ Manufacturing Services Research of science Other Total


Industry type & technology
East 0.400 0.200 0.200 0.200 1.000
North 0.200 0.380 0.180 0.240 1.000
West 0.260 0.280 0.300 0.160 1.000
Total 0.287 0.287 0.227 0.200 1.000

14 The conditional frequencies are formally similar to the conditional probabilities (see, e.g., the syllabus of the
Mathematics II course, and refer to the book E. Castagnoli, M. Cigola, and L. Peccati (2009), Probability. A Brief
Introduction, EGEA.

13
Conditional frequencies help to easily identify possible relationships between the two variables by
dividing the entire population into subpopulations identified by the categories of one of the two
variables. By using absolute or relative frequencies, the relationship may not be as explicit or it
could even lead to wrong conclusions.

For example, by looking at the distributions available in the columns of Table 4.4, it is evident
that, compared to the marginal of Geographic location (uniform distribution between areas),
manufacturing companies are located more often in the East, while services companies are
mostly in the North, and Research of science & technology are mostly located in the West. In
particular, among all the companies in Research of science & technology industry, more than
44% are located in the West. This information is not immediately clear in Tables 4.1 and 4.2
because the Research of science & technology industry is a smaller sector than manufacturing and
services (170 companies, against 215 for the other two industry types), and therefore the joint
absolute (75) and relative (0.10) frequencies of Research of science & technology companies
located in the West do not emerge when compared to the data in the other columns. Instead, by
building the conditional frequencies and only considering the 100 companies in Research of
science & technology sector, that information becomes evident.

It is important to emphasize that even though conditioning either by row or by column, which
leads to different conditional distributions, is typically very useful, we should NOT compare the
percentages across the two resulting tables.

From the tables that we have obtained above we may graphically represent the conditional
frequency distribution to highlight the relationship between the two variables. By using side-by-
side bar charts or stacked bar charts, as shown below:

Figure 4.1. Side-by-side bar chart of the conditional distributions of Industry given Geographic
location.

0,45
0,40
0,35 Manufacturing
0,30
Services
0,25
0,20
Research of
0,15 science&technology
0,10 Other
0,05
0,00
East North West

14
Figure 4.2. Stacked bar chart of the conditional distributions of Industry given Geographic location.

1
0,9
0,8 Other
0,7
0,6 Research of
0,5 science&technology
0,4 Services
0,3
0,2 Manufacturing
0,1
0
East North West

Figure 4.3. Side-by-side bar chart of the conditional distributions of Geographic location given
Industry.

0,5
0,45
0,4
0,35
0,3
East
0,25
0,2 North
0,15
West
0,1
0,05
0
Manufacturing Services Research of Other
science&technology

Figure 4.4. Stacked bar chart of the conditional distributions of Geographic location given Industry.

1
0,9
0,8
0,7
0,6
0,5 West
0,4 North
0,3 East
0,2
0,1
0
Manufacturing Services Research of Other
science&technology

15
The different graphs have different purposes: if one is interested in highlighting the distribution of
the industries within the geographic areas, then Figures 4.1 and 4.2 should be used; if, on the
other hand, the goal is to study the distribution across the areas of each of the industries, then
Figures 4.3 and 4.4 should be examined.

One type of construction may be easier to interpret than the other depending on the type of
variables considered and on the objectives of the analysis. For example, if one wants to assess the
effectiveness (X) of a drug, with Y indicating the dose being administered, then the conditional
distributions of X|Y should probably be the ones to use, as X reasonably depends on Y and not
the other way around.

Note that the use of side-by-side and stacked bar charts with absolute frequencies shown on the
vertical axis, as suggested by the textbook 15, often leads to misinterpretation of the relationship
between the two variables, and it is therefore not recommended.

The side-by-side and stacked bar charts constructed from the conditional frequencies allow one to
gain useful information about the presence or absence of a relationship between variables.
Generally, when the conditional frequencies are (roughly) similar among themselves (and similar
to the corresponding marginal distribution) one talks about absence of association (or absence of
a relationship) between the two variables 16.

For example, below we show the results of a study performed on members of a club classified
into three age groups (up to 29 years, 30 to 49 years, 50 years or older). The club members were
asked which meeting times they preferred among the three options dinner, pre-dinner drinks, and
lunch. Tables 4.6 and 4.7 and Figure 4.5 represent the distribution of the responses, and they
reveal that the age of the club member and the preference expressed appear to be independent: in
all age groups there are similar (conditional) percentages of people who prefer dinner, pre-dinner
drinks, and lunch. In particular, dinner is the meeting time preferred by members of all three age
groups, while lunch had a very small percentage of preferences (around 10%).

Table 4.6. Joint absolute frequencies of Age Group and Preference

Age Group \ Preference Dinner Pre-Dinner Lunch Total


Drinks
Up to 29 31 24 7 62
From 30 to 49 75 60 16 151
50 or older 102 83 25 210
Total 208 167 48 423

15 See paragraph 1.3 of Newbold et al. (2013).


16 In paragraph 14.3 of Newbold et al. (2013) the inferential question of the absence of association in a population
through the Chi squared test of statistical independence will be discussed.

16
Table 4.7. Conditional frequencies of Preference given Age Group

Age Group\Preference Dinner Pre-Dinner Lunch Total


Drinks
Up to 29 50.0% 38.7% 11.3% 100.0%
From 30 to 49 49.7% 39.7% 10.6% 100.0%
50 or older 48.6% 39.5% 11.9% 100.0%
Total 49.2% 39.5% 11.3% 100.0%

Observe that from Table 4.6 the absence of association is not immediately evident: one does need
to construct the conditional frequency distributions in Table 4.7 to verify the proportionality of
the absolute frequencies of one variable among the categories of the other.

Figure 4.5. Side-by-side bar chart comparing Age Group and meeting times preferences.

60%
50,0% 49,7% 48,6%
50%
38,7% 39,7% 39,5%
40%
Dinner
30%
Pre-Dinner Drinks
20% Lunch
11,3% 10,6% 11,9%
10%

0%
Up to 29 From 30 to 49 50 or older

In the next paragraph we will show an important example of a potential misinterpretation of the
conditional distributions vs. the marginal distributions, due to lack of homogeneity of the sub-
populations being compared (Simpson's Paradox).

5. Simpson's paradox: gender discrimination in admissions to postgraduate courses 17

A field study on gender discrimination in admissions to postgraduate courses was carried out at
the Graduate Division of the University of California in Berkeley. During the study, 8442 men
and 4321 women applied for a postgraduate course. Approximately 44% of the male applicant
gained admission, against 35% of the female applicants. The percentages allow us to neglect the
differences between the number of requests for admission made by the two groups: in essence,

17 From the textbook Freedman, Pisani, and Purves (1998), Statistica, McGraw-Hill.

17
out of 100 applications for men, 44 were accepted, and out of 100 applications for women, only
35 were accepted.

Assuming that men and women who had applied were equally qualified (and there is no reason to
believe otherwise), the difference between the percentages of admissions of men and women
appears to be a strong empirical evidence that men and women receive different treatment at the
time of admission: the university seems to have a preference for male applicants.

Each department of the university had taken care of their applications independently, according to
their field of specialization. By distinguishing between the different departments, the university
thought they could determine which of them could be involved in discrimination. At this point an
apparent paradox appeared: when analyzing the single departments there seemed to be no
discrimination against women. Overall, if anything, men were possibly at a disadvantage. What
was happening? Let us look at the data.

Attention focused only on the six departments which received more than a third of the
applications, and whose behavior could be regarded as being typical of the whole university.
Table 5.1 shows the number of applications for men and women and the percentages of admitted
applicants, separately for each type of specialization.
For each type of specialization, the percentage of female admitted applicants is about equal to that
of male applicants. The only exception is Department A, which appears to discriminate against
men: in that department 82% of the women had been admitted against only 62% of the men.

The department that seems to discriminate the most against women is Department E, where 28%
of the men gained admission versus 24% of the women: a difference of only 4 percentage points.
However, when you consider all six branches of specializations together, we see that 44% of the
men gained admission, against 30% of the women: a difference of 14 percentage points!

Table 5.1. Data on admissions to post-graduate courses in the six main departments (branches of
specialization) of the University of California in Berkeley. 18

Men Women
19
Department Number of Number of
% of those admitted % of those admitted
Applications Applications
A 825 62% 108 82%
B 560 63% 25 68%
C 325 37% 593 34%
D 417 33% 375 35%
E 191 28% 393 24%
F 373 6% 341 7%

18 Source: The Graduate Division, University of California, Berkeley.


19 At the request of the university, the names of the departments are not reported.

18
Here is the explanation of the paradox:

• It was easier to get accepted into the first two departments, and more than half of the men
had applied to those departments;
• It was harder to get accepted into the other four departments, and more than 90% of the
women had applied to those departments.

The men had applied to departments that were easier to get into, while the women had chosen the
most difficult ones. Therefore, there was a bias effect due to the choice of department, and that
bias was confused with a possible gender bias.

The lesson is the following: the relationships between the percentages of subgroups (such as
admission rates for men and women in the various departments) can be distorted when the
subgroups are merged. This effect is known as Simpson's Paradox.

Because of such a phenomenon, when analyzing grouped data one may run into the possibility of
misinterpretations if other possible variables relevant to the analysis are not taken into account.
However, it will not be Statistics that should suggest which other variables should be taken into
account: rather, that is the job of the researcher, who must have a clear understanding of the
phenomena that are being studied.

6. The shape of the distribution 20

A characteristic of a set of quantitative data is the shape of their distribution. In particular, such
shape may be defined as being symmetric or asymmetric.

To gain a first indication about the shape of a distribution, we should look at the mean and
median. If a distribution is roughly symmetric, then these two measures of location are similar;
otherwise, they differ. If the mean exceeds the median, the data can be described as being skewed
to the right, while if the median exceeds the mean the distribution is called skewed to the left.
Note that this relationship is a necessary condition, but not a sufficient condition to identify the
shape of the distribution: things may change if for example the distribution is bimodal, or if there
are outliers that drive the value of the mean.

Another indication of the shape of the distribution can be achieved through the examination of the
graphical representation of a quantitative variable after it has been recoded into classes, the
histogram 21. However, the histogram can also lead to wrong conclusions since the construction of

20 See paragraph 2.1. P. Newbold et al.(2013)


21 See paragraph 1.5. P. Newbold et al.(2013)

19
the classes is not unique, so that two graphical representations obtained from the same set of data
can suggest different "shapes."

To identify and describe the main features of the data it is appropriate to use measures that
possess the properties of robustness, i.e. measures that are relatively insensitive to extreme values
or to marginal changes in the data. As we have seen, the quartiles have that property. Recall that a
distribution can be summarized by the five summary statistics minimum (min), first quartile (Q1),
median (Me), third quartile (Q3), and maximum (max).

The data distribution is symmetric if (roughly):


1) Median = Mean
2) Me – Q1 = Q3 – Me
3) Q1 – Min = Max – Q3

The distribution of the data is positively asymmetric (skewed to the right) if:
1) Median < Mean
2) Me – Q1 < Q3 – Me
3) Q1 – Min < Max – Q3

The distribution of the data is negatively asymmetric (skewed to the left) if:
1) Median > Mean
2) Me – Q1 > Q3 – Me
3) Q1 – Min > Max – Q3

A graphical representation of these statistical summaries is provided by the box plot (also called
"box and whiskers plot").

With the aid of Excel we can easily produce a graph where the "box" shows the interquartile
range, i.e. the interval between the first and third quartiles; more precisely, the upper limit of the
box is Q3 and the lower limit is Q1. The symbol inside the box is drawn in correspondence to the
median; the "whiskers" show the distance between the box and the minimum and maximum
values observed in the data 22. Values are the defined to be extreme data (or outliers, or atypical
values) within the distribution if they are "too far removed" from the box that contains the middle
50% of the observations. More precisely, one may define extreme observations to be those
observations that fall more than one and a half times the interquartile range far from the "box."

A single observation xi can therefore be defined to be an outlier if either one of the following two
conditions apply to it:

22 In some cases (using some statistical software, or also with the Excel macro Stat30001), the whiskers are drawn
from the box to the farthest observations that fall within 1.5 times the interquartile range frm the box itself. This
allows any outliers in the data to be highlighted.

20
• xi < Q1 – 1.5 (Q3 – Q1)
• xi > Q3 + 1.5 (Q3 – Q1)
As an example, the box plot of the variable Age of the Head of Household (Figure 6.1) indicates
that the variable has a nearly symmetric distribution: note the centrality of the median compared
to the first and third quartile, and the fact that the length of the two "whiskers" are similar. At the
same time, there is no evidence of presence of extreme observations in the data.

The case shown in Figure 6.2 is different. The boxplot summarizes the distribution of the amount
indicated on a collection of receipts issued by some businesses; the distribution is skewed to the
right, some of the values above 230000 Liras can be classified as being outliers.

Figure 6.1. Example of boxplot without outliers.

Q1
Min
Median
Max
Extreme data
Q3

Head of the household’s age

Figure 6.2. Boxplot with outliers.

Q1
Min
Median
Max
Extreme data
Q3

Receipts Amount

Note: The Excel macros Stat30001 and PhStat allow you to create boxplot graphs. With the macro Stat30001, which
creates a vertical box plot like those in the figures, it is also possible to identify the threshold values for the
determination of outliers and also to count them. Lastly, the Excel Macro PhStat can create horizontal boxplots as

21
described in Excel appendices, and while it does not allow one to identify outliers, it does allow one to compare
several boxplots, thus analyzing the distribution of the data within subpopulations.

22
7. Calculation of the median from grouped data

The textbook shows how to compute the median from raw data 23.
When the data are grouped in an absolute frequency distribution the calculation of the median is
conceptually the same, but it will be based on the information available form the frequency
distribution and not on the individual observations.

7.1 Grouped discrete data

For discrete data summarized in a frequency distribution table there is pretty much no change
compared to what one does when using raw data as the sorted raw data can be reconstructed
exactly from the frequency table. Consider for example variable Family Size in Table 7.1.
In particolar, since there are 30000 observations the median is defined as usual, i.e. as the average
of the observations in the positions 15000 and 15001. Since both are equal to 3, that number will
be the median. Alternatively, one could work with cumulative relative frequencies, without the
need for the overall sample size. In that case the median can be defined equivalently as the
smallest value xi such that Fi exceeds 0.5. If Fi=0.5, then all values in the interval [xi ; xi+1] can
be considered to be median values, and one uses the average of xi and xi+1. Note that this is
equivalent to what we have done in the previous paragraph when working on the raw data.
In Table 7.1 it is easy to see that the median is indeed equal to 3, since the corresponding
cumulative relative frequency of 3 is equal to 69.78% while the cumulative relative frequency of
2 is only 38.03%.

Table 7.1. Frequency distribution of variable Family Size

Size fi pi Cumulative Frequencies Fi

1 2304 0.0768 0.0768


2 9105 0.3035 0.3803
3 9525 0.3175 0.6978
4 7377 0.2459 0.9437
5 1266 0.0422 0.9859
6 345 0.0115 0.9974
7 78 0.0026 1

Total 30000 1

23 See par. 3.1 of P. Newbold, W.L.Carlson, B. Thorne (2013), Statistics for Business and Economics, Pearson -
Prentice Hall, Global edition, 8th ed.

23
7.2 Grouped continuous data

When using continuous data grouped in interval classes the calculation of the median from the
frequency distribution table will in general not coincide with the mediana s obtained form the raw
data. This is a similar phenomenon to what we have seen for the mean.
In particolar, the median Me is defined as that valuefor which the cumulative relative frequency is
equal to 0.5, i.e. such that Fr{ X ≤ Me } = 0.5.
To obtain the median one first identifies the median class, i.e. the interval class at the end of
which the cumulative distribution function reaches or exceeds the value 0.5 for the first time.
To obtain the value of the median one sets the analytical expression of the cumulative distribution
function within the median class (hence assuming a uniform distribution within the class), or
equivalently one sets the area under the density histogram to the left of Me so that it is equal to
0.5. This is quite simple, and it is shown below.
Table 7.2 shows the frequency distribution of Age of Head of Household after recoding into three
classes of equal width.

Table 7.2. Frequency distribution of variable Age of Head of Household

Class pi wi ci Fi
[23 ; 44.33) 0.2625 21.33 0.0123 0.2625
[44.33 ; 65.67) 0.5365 21.34 0.0251 0.7990
[65.67 ; 87] 0.2010 21.33 0.0094 1

The distribution is represented graphically by the density histogram in Figure 7.1. Setting the area
to the left of Me equal to 0.5 (see Figure 7.2) allows for the easy calculation of Me itself.

Figure 7.1. Density histogram of variable Age of Head of Household

0,025

0,02
Density

0,015

0,01

0,005

0
23 44,33 65,67 87
Variable X

24
Figure 7.2. Calculation of the median for variable Age of Head of Household

0,025

0,02
Densità

0,015

0,01

0,005

0
23 44,33 65,67 87
Variabile X

Indeed, it suffices to solve the following equation in Me:

Fi−1 + (Me − x i )⋅ c i = 0.5,

which in this case yields

0.2625 + (Me − 44.33)⋅ 0.0251 = 0.5.

Hence the median is easily found as being equal to 53.79 (please check).

25

Potrebbero piacerti anche