Sei sulla pagina 1di 19

Descriptive and Inferential Statistics

When analysing data, such as the marks achieved by 100 students for a piece
of coursework, it is possible to use both descriptive and inferential statistics in
your analysis of their marks. Typically, in most research conducted on groups
of people, you will use both descriptive and inferential statistics to analyse
your results and draw conclusions. So what are descriptive and inferential
statistics? And what are their differences?

Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps
describe, show or summarize data in a meaningful way such that, for
example, patterns might emerge from the data. Descriptive statistics do not,
however, allow us to make conclusions beyond the data we have analysed or
reach conclusions regarding any hypotheses we might have made. They are
simply a way to describe our data.

Descriptive statistics are very important because if we simply presented our


raw data it would be hard to visulize what the data was showing, especially if
there was a lot of it. Descriptive statistics therefore enables us to present the
data in a more meaningful way, which allows simpler interpretation of the
data. For example, if we had the results of 100 pieces of students'
coursework, we may be interested in the overall performance of those
students. We would also be interested in the distribution or spread of the
marks. Descriptive statistics allow us to do this. How to properly describe data
through statistics and graphs is an important topic and discussed in other
Laerd Statistics guides. Typically, there are two general types of statistic that
are used to describe data:

o Measures of central tendency: these are ways of describing the central


position of a frequency distribution for a group of data. In this case, the
frequency distribution is simply the distribution and pattern of marks scored by
the 100 students from the lowest to the highest. We can describe this central
position using a number of statistics, including the mode, median, and mean.
You can read about measures of central tendency here.

o Measures of spread: these are ways of summarizing a group of data by


describing how spread out the scores are. For example, the mean score of our
100 students may be 65 out of 100. However, not all students will have scored
65 marks. Rather, their scores will be spread out. Some will be lower and
others higher. Measures of spread help us to summarize how spread out these
scores are. To describe this spread, a number of statistics are available to us,
including the range, quartiles, absolute deviation, variance and standard
deviation.

When we use descriptive statistics it is useful to summarize our group of data using a
combination of tabulated description (i.e., tables), graphical description (i.e., graphs
and charts) and statistical commentary (i.e., a discussion of the results).

Inferential Statistics
We have seen that descriptive statistics provide information about our immediate
group of data. For example, we could calculate the mean and standard deviation of the
exam marks for the 100 students and this could provide valuable information about
this group of 100 students. Any group of data like this, which includes all the data you
are interested in, is called a population. A population can be small or large, as long as
it includes all the data you are interested in. For example, if you were only interested
in the exam marks of 100 students, the 100 students would represent your population.
Descriptive statistics are applied to populations, and the properties of populations, like
the mean or standard deviation, are called parameters as they represent the whole
population (i.e., everybody you are interested in).

Often, however, you do not have access to the whole population you are interested in
investigating, but only a limited number of data instead. For example, you might be
interested in the exam marks of all students in the UK. It is not feasible to measure all
exam marks of all students in the whole of the UK so you have to measure a
smaller sample of students (e.g., 100 students), which are used to represent the larger
population of all UK students. Properties of samples, such as the mean or standard
deviation, are not called parameters, but statistics. Inferential statistics are techniques
that allow us to use these samples to make generalizations about the populations from
which the samples were drawn. It is, therefore, important that the sample accurately
represents the population. The process of achieving this is called sampling (sampling
strategies are discussed in detail here on our sister site). Inferential statistics arise out
of the fact that sampling naturally incurs sampling error and thus a sample is not
expected to perfectly represent the population. The methods of inferential statistics are
(1) the estimation of parameter(s) and (2) testing of statistical hypotheses.
The field of statistics is divided into two major divisions: inferential and descriptive statistics. Each of
these segments of statistics is important, with different techniques that accomplish different objectives.
We will consider both of these areas and see what the differences are between descriptive and
inferential statistics.

DESCRIPTIVE STATISTICS

Descriptive statistics is the type of statistics that probably springs to most people’s minds when they
hear the word “statistics.” Here the goal is to describe.

Numerical measures are used to tell about features of a set of data. There are a number of items that
belong in this portion of statistics, such as:

The average, or measure of the center of a data set, consisting of the mean, median, mode, or
midrange.

The spread of a data set, which can be measured with the range or standard deviation.

Overall descriptions of data such as the five number summary.

Other measurements such as skewness and kurtosis.

The exploration of relationships and correlation between paired data.

The presentation of statistical results in graphical form.

INFERENTIAL STATISTICS

For the area of inferential statistics, we begin by differentiating between two groups. The population is
the entire collection of individuals that we are interested in studying. It is typically impossible or
infeasible to examine each member of the population individually. So we choose a representative subset
of the population, called a sample.

Inferential statistics studies a statistical sample, and from this analysis is able to say something about the
population from which the sample came. There are two major divisions of inferential statistics:

A confidence interval gives a range of values for an unknown parameter of the population by measuring
a statistical sample. This is expressed in terms of an interval and the degree of confidence that the
parameter is within the interval.

Tests of significance or hypothesis testing test a claim about the population by analyzing a statistical
sample. By design, there is some uncertainty in this process. This can be expressed in terms of a level of
significance.

DIFFERENCE BETWEEN THESE AREAS

As seen above, descriptive statistics is concerned with telling about certain features of a data set.
Although this is helpful in learning things such as the spread and center of the data we are studying,
nothing in the area of descriptive statistics can be used to make any sort of generalization. In descriptive
statistics, measurements such as the mean and standard deviation are stated as exact numbers. Though
we may use descriptive statistics all we would like in examining a statistical sample, this branch of
statistics does not allow us to say anything about the population.

Inferential statistics is different from descriptive statistics in many ways. Even though there are similar
calculations, such as those for the mean and standard deviation, the focus is different for inferential
statistics. Inferential statistics does start with a sample and then generalizes to a population. This
information about a population is not stated as a number.

Instead, we express these parameters as a range of potential numbers, along with a degree of
confidence.
It is important to know the difference between descriptive and inferential statistics. This knowledge is
helpful when we need to apply it to a real world situation involving statistical methods.

Classification of Data

Classification is the process of arranging the collected data into classes and to subclasses according to
their common characteristics. Classification is the grouping of related facts into classes. E.g. sorting of
letters in post office

Types of classification

There are four types of classification. They are

Geographical classification

Chronological classification

Qualitative classification

Quantitative classification

(i) Geographical classification

When data are classified on the basis of location or areas, it is called geographical classification

Example: Classification of production of food grains in different states in India.

States Production of food grains

(in '000 tons)

Tamil Nadu 4500

Karnataka 4200

Andhra Pradesh 3600

(ii) Chronological classification

Chronological classification means classification on the basis of time, like months, years etc.

Year Profits

( in 000 Rupees)

2001 72

2002 85

2003 92

2004 96

2005 95

Example: Profits of a company from 2001 to 2005.

Profits of a company from 2001 to 2005

(iii) Qualitative classification

In Qualitative classification, data are classified on the basis of some attributes or quality such as sex,
colour of hair, literacy and religion. In this type of classification, the attribute under study cannot be
measured. It can only be found out whether it is present or absent in the units of study.
(iv) Quantitative classification

Quantitative classification refers to the classification of data according to some characteristics, which
can be measured such as height, weight, income, profits etc.

Example: The students of a school may be classified according to the weight as follows

Weight (in kgs) No of Students

40-50 50

50-60 200

60-70 300

70-80 100

80-90 30

90-100 20

Total 700

There are two types of quantitative classification of data. They are

Discrete frequency distribution

Continuous frequency distribution

In this type of classification there are two elements (i) variable (ii) frequency

Variable

Variable refers to the characteristic that varies in magnitude or quantity. E.g. weight of the students. A
variable may be discrete or continuous.

Discrete variable

A discrete variable can take only certain specific values that are whole numbers (integers). E.g. Number
of children in a family or Number of class rooms in a school.

Continuous variable

A Continuous variable can take any numerical value within a specific interval.

Example: the average weight of a particular class student is between 60 and 80 kgs.

Frequency

Frequency refers to the number of times each variable gets repeated.

For example there are 50 students having weight of 60 kgs. Here 50 students is the frequency.

Frequency distribution

Frequency distribution refers to data classified on the basis of some variable that can be measured such
as prices, weight, height, wages etc.

The following are the two examples of discrete and continuous frequency distribution

The following technical terms are important when a continuous frequency distribution is formed

Class limits: Class limits are the lowest and highest values that can be included in a class. For example
take the class 40-50. The lowest value of the class is 40 and the highest value is 50. In this class there can
be no value lesser than 40 or more than 50. 40 is the lower class limit and 50 is the upper class limit.

Class interval: The difference between the upper and lower limit of a class is known as class interval of
that class. Example in the class 40-50 the class interval is 10 (i.e. 50 minus 40).
Class frequency: The number of observations corresponding to a particular class is known as the
frequency of that class

Example:

Income (Rs) No. of persons

1000 - 2000 50

In the above example, 50 is the class frequency. This means that 50 persons earn an income between
Rs.1, 000 and Rs.2, 000.

(iv) Class mid-point: Mid point of a class is formed out as follows.

Tabulation of Data

A table is a systematic arrangement of statistical data in columns and rows. Rows are horizontal
arrangements whereas the columns are vertical ones.

Simple Random Sampling and Other Sampling Methods

Sampling Methods can be classified into one of two categories:

 Probability Sampling: Sample has a known probability of being selected

 Non-probability Sampling: Sample does not have known probability of being selected as in
convenience or voluntary response surveys

Probability Sampling

In probability sampling it is possible to both determine which sampling units belong to which sample
and the probability that each sample will be selected. The following sampling methods are examples
of probability sampling:

1. Simple Random Sampling (SRS)

2. Stratified Sampling

3. Cluster Sampling

4. Systematic Sampling

5. Multistage Sampling (in which some of the methods above are combined in stages)

Of the five methods listed above, students have the most trouble distinguishing between stratified
sampling and cluster sampling.

Stratified Sampling is possible when it makes sense to partition the population into groups based on a
factor that may influence the variable that is being measured. These groups are then called strata. An
individual group is called a stratum. With stratified sampling one should:

 partition the population into groups (strata)

 obtain a simple random sample from each group (stratum)

 collect data on each sampling unit that was randomly sampled from each group (stratum)

Stratified sampling works best when a heterogeneous population is split into fairly homogeneous
groups. Under these conditions, stratification generally produces more precise estimates of the
population percents than estimates that would be found from a simple random sample. Table 3.2 shows
some examples of ways to obtain a stratified sample.
Table 3.2. Examples of Stratified Samples

Example 1 Example 2 Example 3

Population All people in U.S. All PSU intercollegiate All elementary students in
athletes the local school district

Groups (Strata) 4 Time Zones in the U.S. 26 PSU intercollegiate 11 different elementary
(Eastern,Central, teams schools in the local school
Mountain,Pacific) district

Obtain a Simple 500 people from each of 5 athletes from each of 20 students from each of the
Random Sample the 4 time zones the 26 PSU teams 11 elementary schools

Sample 4 × 500 = 2000 selected 26 × 5 = 130 selected 11 × 20 = 220 selected


people athletes students

Cluster Sampling is very different from Stratified Sampling. With cluster sampling one should

 divide the population into groups (clusters).

 obtain a simple random sample of so many clusters from all possible clusters.

 obtain data on every sampling unit in each of the randomly selected clusters.

It is important to note that, unlike with the strata in stratified sampling, the clusters should be
microcosms, rather than subsections, of the population. Each cluster should be heterogeneous.
Additionally, the statistical analysis used with cluster sampling is not only different, but also more
complicated than that used with stratified sampling.

Table 3.3. Examples of Cluster Samples

Example 1 Example 2 Example 3

Population All people in U.S. All PSU intercollegiate All elementary students in
athletes a local school district

Groups (Clusters) 4 Time Zones in the U.S. 26 PSU intercollegiate 11 different elementary
(Eastern,Central, teams schools in the local school
Mountain,Pacific.) district

Obtain a Simple 2 time zones from the 4 8 teams from the 26 4 elementary schools from
Random Sample possible time zones possible teams the l1 possible elementary
schools

Sample every person in the 2 every athlete on the 8 every student in the 4
selected time zones selected teams selected elementary
schools

Each of the three examples that are found in Tables 3.2 and 3.3 were used to illustrate how both
stratified and cluster sampling could be accomplished. However, there are obviously times when one
sampling method is preferred over the other. The following explanations add some clarification about
when to use which method.
 With Example 1: Stratified sampling would be preferred over cluster sampling, particularly if the
questions of interest are affected by time zone. For example the percentage of people watching
a live sporting event on television might be highly affected by the time zone they are in. Cluster
sampling really works best when there are a reasonable number of clusters relative to the entire
population. In this case, selecting 2 clusters from 4 possible clusters really does not provide
much advantage over simple random sampling.

 With Example 2: Either stratified sampling or cluster sampling could be used. It would depend
on what questions are being asked. For instance, consider the question "Do you agree or
disagree that you receive adequate attention from the team of doctors at the Sports Medicine
Clinic when injured?" The answer to this question would probably not be team dependent, so
cluster sampling would be fine. In contrast, if the question of interest is "Do you agree or
disagree that weather affects your performance during an athletic event?" The answer to this
question would probably be influenced by whether or not the sport is played outside or
inside. Consequently, stratified sampling would be preferred.

 With Example 3: Cluster sampling would probably be better than stratified sampling if each
individual elementary school appropriately represents the entire population as in aschool
district where students from throughout the district can attend any school. Stratified sampling
could be used if the elementary schools had very different locations and served only their local
neighborhood (i.e., one elementary school is located in a rural setting while another elementary
school is located in an urban setting.) Again, the questions of interest would affect which
sampling method should be used.

The most common method of carrying out a poll today is using Random Digit Dialing in which a machine
random dials phone numbers. Some polls go even farther and have a machine conduct the interview
itself rather than just dialing the number! Such "robo call polls" can be very biased because they have
extremely low response rates (most people don't like speaking to a machine) and because federal law
prevents such calls to cell phones. Since the people who have landline phone service tend to be older
than people who have cell phone service only, another potential source of bias is introduced. National
polling organizations that use random digit dialing in conducting interviewer based polls are very careful
to match the number of landline versus cell phones to the population they are trying to survey.

Non-probability Sampling

The following sampling methods that are listed in your text are types of non-probability sampling that
should be avoided:

1. volunteer samples

2. haphazard (convenience) samples

Since such non-probability sampling methods are based on human choice rather than random
selection, statistical theory cannot explain how they might behave and potential sources of bias are
rampant. In your textbook, the two types of non-probability samples listed above are called "sampling
disasters."

Importance of Sampling in Statistical Analysis

By Alex Silbajoris

The validity of a statistical analysis depends on the quality of the sampling used. The two most
important elements are random drawing of the sample, and the size of the sample. A small sample, even
if unbiased, can fail to include a representative mix of the larger group under analysis. A biased sample,
regardless of size, can lead to incorrect conclusions.

A Survey Design Must Draw the Right Sample

When a survey intends to depict trends in some particular group, its sample must come from that group.
Screening questions can sort out qualified and unqualified respondents for some known criterion, such
as whether they are consumers of a certain product or service, or whether they are parents or
household heads. But sometimes the screening must deal with probabilities, such as whether the
respondents are likely to vote or to buy something. In these cases, the sample should be chosen based
on criteria like past voting participation, or previous purchase of a similar product or service.

Convenience Sampling Has Disadvantages

Convenience sampling -- such as door-to-door or "man on the street" interviews -- is drawing


respondents who are convenient to the survey taker, perhaps people at an event or particular place.
While quick and easy, and seemingly random, it isn't a truly random sample because only those who are
eager to respond will be included. People with strong opinions will be over-represented in the sample.
The nature of the location can introduce bias, such as a political event where a disproportionate number
of respondents express some particular opinion.

Systematic Sampling Has Some Advantages

Systematic sampling draws one survey subject out of a given number from some group, such as one out
of every 50 names in a telephone book, or people in a particular place, or people in some defined group.
It isn't as subjective as convenience sampling because it doesn't accept just anyone who wants to
express an opinion. But it isn't completely random because only the people in the book, chosen place or
group can be in the pool. If the survey scope is limited to those groups, then its results can resemble
those of a random sampling method.

Random Sampling Relies on Probability

The point of random sampling is to avoid bias from factors such as the eagerness of respondents to
express an opinion, a limited base for the sample, and chance availability of respondents. It can require
persistence in attempting to contact selected individuals; otherwise, as with convenience sampling, only
those easily available or eager to express an opinion will be counted. A truly randomized sample of a
population under study offers a glimpse of the population as a whole.

Measures of Central Tendency

Introduction

A measure of central tendency is a single value that attempts to describe a set of data by identifying the
central position within that set of data. As such, measures of central tendency are sometimes called
measures of central location. They are also classed as summary statistics. The mean (often called the
average) is most likely the measure of central tendency that you are most familiar with, but there are
others, such as the median and the mode.

The mean, median and mode are all valid measures of central tendency, but under different conditions,
some measures of central tendency become more appropriate to use than others. In the following
sections, we will look at the mean, mode and median, and learn how to calculate them and under what
conditions they are most appropriate to be used.

Mean (Arithmetic)

The mean (or average) is the most popular and well known measure of central tendency. It can be used
with both discrete and continuous data, although its use is most often with continuous data (see
our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set. So, if we have n values in a data set and they have
values x1, x2, ..., xn, the sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter, ,
pronounced "sigma", which means "sum of...":

You may have noticed that the above formula refers to the sample mean. So, why have we called it a
sample mean? This is because, in statistics, samples and populations have very different meanings and
these differences are very important, even if, in the case of the mean, they are calculated in the same
way. To acknowledge that we are calculating the population mean and not the sample mean, we use the
Greek lower case letter "mu", denoted as µ:

The mean is essentially a model of your data set. It is the value that is most common. You will notice,
however, that the mean is not often one of the actual values that you have observed in your data set.
However, one of its important properties is that it minimises error in the prediction of any one value in
your data set. That is, it is the value that produces the lowest amount of error from all other values in
the data set.

An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.

When not to use the mean

The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are
values that are unusual compared to the rest of the data set by being especially small or large in
numerical value. For example, consider the wages of staff at a factory below:

Staff 1 2 3 4 5 6 7 8 9 10

Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean
value might not be the best way to accurately reflect the typical salary of a worker, as most workers
have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in
this situation, we would like to have a better measure of central tendency. As we will find out later,
taking the median would be a better measure of central tendency in this situation.

Another time when we usually prefer the median over the mean (or mode) is when our data is skewed
(i.e., the frequency distribution for our data is skewed). If we consider the normal distribution - as this is
the most frequently assessed in statistics - when the data is perfectly normal, the mean, median and
mode are identical. Moreover, they all represent the most typical value in the data set. However, as the
data becomes skewed the mean loses its ability to provide the best central location for the data because
the skewed data is dragging it away from the typical value. However, the median best retains this
position and is not as strongly influenced by the skewed values. This is explained in more detail in the
skewed distribution section later in this guide.

Median

The median is the middle score for a set of data that has been arranged in order of magnitude. The
median is less affected by outliers and skewed data. In order to calculate the median, suppose we have
the data below:

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark because
there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of
scores, but what happens when you have an even number of scores? What if you had only 10 scores?
Well, you simply have to take the middle two scores and average the result. So, if we look at the
example below:

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89

Only now we have to take the 5th and 6th score in our data set and average them to get a median of
55.5.

Mode

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a
bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular
option. An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most common
category, as illustrated below:
We can see above that the most common form of transport, in this particular data set, is the bus.
However, one of the problems with the mode is that it is not unique, so it leaves us with problems when
we have two or more values that share the highest frequency, such as below:

We are now stuck as to which mode best describes the central tendency of the data. This is particularly
problematic when we have continuous data because we are more likely not to have any one value that
is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1
kg). How likely is it that we will find two or more people with exactly the same weight (e.g., 67.4 kg)?
The answer, is probably very unlikely - many people might be close, but with such a small sample (30
people) and a large range of possible weights, you are unlikely to find two people with exactly the same
weight; that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data.

Another problem with the mode is that it will not provide us with a very good measure of central
tendency when the most common mark is far away from the rest of the data in the data set, as depicted
in the diagram below:

In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not
representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the
mode to describe the central tendency of this data set would be misleading.

Skewed Distributions and the Mean and Median

We often test whether our data is normally distributed because this is a common assumption underlying
many statistical tests. An example of a normally distributed set of data is presented below:
When you have a normally distributed sample you can legitimately use both the mean or the median as
your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode
are equal. However, in this situation, the mean is widely preferred as the best measure of central
tendency because it is the measure that includes all the values in the data set for its calculation, and any
change in any of the scores will affect the value of the mean. This is not the case with the median or
mode.

However, when our data is skewed, for example, as with the right-skewed data set below:
we find that the mean is being dragged in the direct of the skew. In these situations, the median is
generally considered to be the best representative of the central location of the data. The more skewed
the distribution, the greater the difference between the median and mean, and the greater emphasis
should be placed on using the median as opposed to the mean. A classic example of the above right-
skewed distribution is income (salary), where higher-earners provide a false representation of the
typical income if expressed as a mean and not a median.

If dealing with a normal distribution, and tests of normality show that the data is non-normal, it is
customary to use the median instead of the mean. However, this is more a rule of thumb than a strict
guideline. Sometimes, researchers wish to report the mean of a skewed distribution if the median and
mean are not appreciably different (a subjective assessment), and if it allows easier comparisons to
previous research to be made.

Summary of when to use the mean, median and mode

Please use the following summary table to know what the best measure of central tendency is with
respect to the different types of variable.

Type of Variable Best measure of central tendency

Nominal Mode

Ordinal Median

Interval/Ratio (not skewed) Mean

Interval/Ratio (skewed) Median

Measures of variability: the range, inter-quartile range and standard deviation

There are many ways of describing the variability in some data set. In this guide we discuss the range,
interquartile range and standard deviation.

Introduction

Measures of average such as the median and mean represent the typical value for a dataset. Within the
dataset the actual values usually differ from one another and from the average value itself. The extent
to which the median and mean are good representatives of the values in the original dataset depends
upon the variability or dispersion in the original data. Datasets are said to have high dispersion when
they contain values considerably higher and lower than the mean value.

In figure 1 the number of different sized tutorial groups in semester 1 and semester 2 are presented. In
both semesters the mean and median tutorial group size is 5 students, however the groups in semester
2 show more dispersion (or variability in size) than those in semester 1.

Dispersion within a dataset can be measured or described in several ways including the range, inter-
quartile range and standard deviation.
The Range

The range is the most obvious measure of dispersion and is the difference between the lowest and
highest values in a dataset. In figure 1, the size of the largest semester 1 tutorial group is 6 students and
the size of the smallest group is 4 students, resulting in a range of 2 (6-4). In semester 2, the largest
tutorial group size is 7 students and the smallest tutorial group contains 3 students, therefore the range
is 4 (7-3).

 The range is simple to compute and is useful when you wish to evaluate the whole of a dataset.

 The range is useful for showing the spread within a dataset and for comparing the spread
between similar datasets.

An example of the use of the range to compare spread within datasets is provided in table 1. The scores
of individual students in the examination and coursework component of a module are shown.

To find the range in marks the highest and lowest values need to be found from the table. The highest
coursework mark was 48 and the lowest was 27 giving a range of 21. In the examination, the highest
mark was 45 and the lowest 12 producing a range of 33. This indicates that there was wider variation in
the students’ performance in the examination than in the coursework for this module.

Since the range is based solely on the two most extreme values within the dataset, if one of these is
either exceptionally high or low (sometimes referred to as outlier) it will result in a range that is not
typical of the variability within the dataset. For example, imagine in the above example that one
student failed to hand in any coursework and was awarded a mark of zero, however they sat the exam
and scored 40. The range for the coursework marks would now become 48 (48-0), rather than 21,
however the new range is not typical of the dataset as a whole and is distorted by the outlier in the
coursework marks. In order to reduce the problems caused by outliers in a dataset, the inter-quartile
range is often calculated instead of the range.

The Inter-quartile Range

The inter-quartile range is a measure that indicates the extent to which the central 50% of values within
the dataset are dispersed. It is based upon, and related to, the median.

In the same way that the median divides a dataset into two halves, it can be further divided into
quarters by identifying the upper and lower quartiles. The lower quartile is found one quarter of the way
along a dataset when the values have been arranged in order of magnitude; the upper quartile is found
three quarters along the dataset. Therefore, the upper quartile lies half way between the median and
the highest value in the dataset whilst the lower quartile lies halfway between the median and the
lowest value in the dataset. The inter-quartile range is found by subtracting the lower quartile from the
upper quartile.

For example, the examination marks for 20 students following a particular module are arranged in order
of magnitude.
The median lies at the mid-point between the two central values (10th and 11th)

= half-way between 60 and 62 = 61

The lower quartile lies at the mid-point between the 5th and 6th values

= half-way between 52 and 53 = 52.5

The upper quartile lies at the mid-point between the 15th and 16th values

= half-way between 70 and 71 = 70.5

The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43 = 37.

The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the
outlying values.

Like the range however, the inter-quartile range is a measure of dispersion that is based upon only two
values from the dataset. Statistically, the standard deviation is a more powerful measure of dispersion
because it takes into account every value in the dataset. The standard deviation is explored in the next
section of this guide.

Calculating the Inter-quartile range using Excel

The method Excel uses to calculate quartiles is not commonly used and tends to produce unusual results
particularly when the dataset contains only a few values. For this reason you may be best to calculate
the inter-quartile range by hand.

The Standard Deviation

The standard deviation is a measure that summarises the amount by which every value within a dataset
varies from the mean. Effectively it indicates how tightly the values in the dataset are bunched around
the mean value. It is the most robust and widely used measure of dispersion since, unlike the range and
inter-quartile range, it takes into account every variable in the dataset. When the values in a dataset are
pretty tightly bunched together the standard deviation is small. When the values are spread apart the
standard deviation will be relatively large. The standard deviation is usually presented in conjunction
with the mean and is measured in the same units.

In many datasets the values deviate from the mean value due to chance and such datasets are said to
display a normal distribution. In a dataset with a normal distribution most of the values are clustered
around the mean while relatively few values tend to be extremely high or extremely low. Many natural
phenomena display a normal distribution.

For datasets that have a normal distribution the standard deviation can be used to determine the
proportion of values that lie within a particular range of the mean value. For such distributions it is
always the case that 68% of values are less than one standard deviation (1SD) away from the mean
value, that 95% of values are less than two standard deviations (2SD) away from the mean and that 99%
of values are less than three standard deviations (3SD) away from the mean. Figure 3 shows this concept
in diagrammatical form.
If the mean of a dataset is 25 and its standard deviation is 1.6, then

 68% of the values in the dataset will lie between MEAN-1SD (25-1.6=23.4)
and MEAN+1SD (25+1.6=26.6)

 99% of the values will lie between MEAN-3SD (25-4.8=20.2) and MEAN+3SD (25+4.8=29.8).

If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it would
indicate that the values were more dispersed. The frequency distribution for a dispersed dataset would
still show a normal distribution but when plotted on a graph the shape of the curve will be flatter as in
figure 4.

Population and sample standard deviations

There are two different calculations for the Standard Deviation. Which formula you use depends upon
whether the values in your dataset represent an entire population or whether they form a sample of a
larger population. For example, if all student users of the library were asked how many books they had
borrowed in the past month then the entire population has been studied since all the students have
been asked. In such cases the population standard deviation should be used. Sometimes it is not
possible to find information about an entire population and it might be more realistic to ask a sample of
150 students about their library borrowing and use these results to estimate library borrowing habits for
the entire population of students. In such cases the sample standard deviation should be used.

Formulae for the standard deviation

Whilst it is not necessary to learn the formula for calculating the standard deviation, there may be times
when you wish to include it in a report or dissertation.

The standard deviation of an entire population is known as σ (sigma) and is calculated using:

Where x represents each value in the population, μ is the mean value of the population, Σ is the
summation (or total), and N is the number of values in the population.

The standard deviation of a sample is known as S and is calculated using:

Where x represents each value in the population, x is the mean value of the sample, Σ is the summation
(or total), and n-1 is the number of values in the sample minus 1.

Summary

The range, inter-quartile range and standard deviation are all measures that indicate the amount of
variability within a dataset. The range is the simplest measure of variability to calculate but can be
misleading if the dataset contains extreme values. The inter-quartile range reduces this problem by
considering the variability within the middle 50% of the dataset. The standard deviation is the most
robust measure of variability since it takes into account a measure of how every value in the dataset
varies from the mean. However, care must be taken when calculating the standard deviation to consider
whether the entire population or a sample is being examined and to use the appropriate formula.

Potrebbero piacerti anche