Sei sulla pagina 1di 134

MTH 106

Introductory Statistics

Lecture Notes

T. Kassile
Department of Biometry and Mathematics
Faculty of Science
Sokoine University of Agriculture
Room 17, Administration Block, SMC, Tel: 0232604420 Ext. 2108

Draft
March 2013
Contents

0.1 Course objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i


0.2 Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.3.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.3.3 Continuous assessment (coursework) . . . . . . . . . . . . . . ii
0.3.3.1 Dates and times for the tests . . . . . . . . . . . . . ii
0.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
0.5 Computing: SPSS (optional) . . . . . . . . . . . . . . . . . . . . . . ii

1 Chapter One: Descriptive statistics 1


1.1 Definition of relevant statistical terminologies . . . . . . . . . . . . . 1
1.1.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Inferential statistics . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.6 Biometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.7 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.8 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.9 Parameter and statistics . . . . . . . . . . . . . . . . . . . . . 7
1.1.10 Operational definition . . . . . . . . . . . . . . . . . . . . . . 8
1.1.11 Validity and reliability . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Primary versus secondary data . . . . . . . . . . . . . . . . . 9
1.2.1.1 Collection of primary data . . . . . . . . . . . . . . 10
1.2.1.2 Collection of secondary data . . . . . . . . . . . . . 12
1.2.2 Editing of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 The sample survey . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3.1 Methods of sampling . . . . . . . . . . . . . . . . . 13
1.2.3.1.1 Random sampling procedures . . . . . . . . 13
1.2.3.2 Important issues in survey research . . . . . . . . . 16

i
1.2.4 Basic survey designs . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.5 Sample size determination . . . . . . . . . . . . . . . . . . . . 18
1.2.6 Questionnaire design . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Data analysis/presentation . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.2 Measures of central tendency (averages) . . . . . . . . . . . . 34
1.3.3 Measures of spread (dispersion) . . . . . . . . . . . . . . . . . 49
1.3.4 Simple Linear Regression analysis . . . . . . . . . . . . . . . 57
1.3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 57
1.3.4.2 Simple linear regression model . . . . . . . . . . . . 58
1.3.4.3 Fitting a simple linear regression model: the method
of least squares . . . . . . . . . . . . . . . . . . . . . 59
1.3.5 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . 65

2 Chapter Two: Statistical inference 68


2.1 Point and interval estimation . . . . . . . . . . . . . . . . . . . . . . 68
2.1.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.1.1.1 Some properties of estimators . . . . . . . . . . . . 68
2.1.2 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . 69
2.1.2.1 Case I: Confidence interval estimation of the mean µ
(σ unknown) . . . . . . . . . . . . . . . . . . . . . . 69
2.1.2.2 Case II: Confidence interval estimation of the mean
µ (σ known) . . . . . . . . . . . . . . . . . . . . . . 70
2.1.2.3 Confidence interval for a difference of population means 74
2.2 Elementary Probability . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.2.2 Some basic terminologies . . . . . . . . . . . . . . . . . . . . 77
2.2.2.1 Random experiment . . . . . . . . . . . . . . . . . . 77
2.2.2.2 Mutually exclusive events . . . . . . . . . . . . . . . 78
2.2.2.3 Probability function . . . . . . . . . . . . . . . . . . 78
2.2.2.4 Exhaustive events . . . . . . . . . . . . . . . . . . . 78
2.2.2.5 Equally likely events . . . . . . . . . . . . . . . . . . 78
2.2.2.6 Addition law for mutually exclusive events . . . . . 78
2.2.2.7 Addition law for not mutually exclusive events . . . 78

ii
2.2.2.8 Conditional probability . . . . . . . . . . . . . . . . 79
2.2.2.9 Independent events . . . . . . . . . . . . . . . . . . 79
2.2.2.10 Multiplication law for not independent events . . . . 79
2.2.2.11 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . 79
2.2.3 Probability density function (discrete r.v) . . . . . . . . . . . 80
2.2.4 Probability density function (continuous r.v) . . . . . . . . . 81
2.2.5 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . 82
2.2.5.1 The Binomial distribution . . . . . . . . . . . . . . . 82
2.2.5.2 The Poisson distribution . . . . . . . . . . . . . . . 83
2.2.6 Continuous probability distribution . . . . . . . . . . . . . . . 85
2.2.6.1 The normal distribution . . . . . . . . . . . . . . . . 85

3 Chapter Three: Sampling distributions 88


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.1.1 Sampling distribution of the mean . . . . . . . . . . . . . . . 88
3.1.2 The Student’s t distribution . . . . . . . . . . . . . . . . . . . 88
3.1.3 The Chi-square (χ2 ) distribution . . . . . . . . . . . . . . . . 89
3.1.4 The . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4 Chapter Four: Hypothesis testing 91


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Level of significance . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.2 Confidence coefficient . . . . . . . . . . . . . . . . . . . . . . 92
4.1.3 The β risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.4 The power of a test . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Type I and II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 One-sided and two-sided tests . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Steps involved in hypothesis testing . . . . . . . . . . . . . . . . . . . 95
4.5 Tests of hypotheses for the mean of a single population . . . . . . . . 95
4.6 Testing for the difference of two population means . . . . . . . . . . 98

5 Appendices 100
5.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

iii
5.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.1 Suggested solution for question 4 . . . . . . . . . . . . . . . . 106
5.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.11 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.12 Exercise 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.12.1 Suggested solutions . . . . . . . . . . . . . . . . . . . . . . . . 121
5.13 Exercise 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.13.1 Suggested solutions . . . . . . . . . . . . . . . . . . . . . . . . 123
5.14 Exercise 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.14.1 Suggested solutions . . . . . . . . . . . . . . . . . . . . . . . . 126

iv
Preamble

0.1 Course objective

To introduce the students to some basic concepts in statistics (theory and practice)
which are necessary for handling numerical observations.

0.2 Course description

Descriptive Statistics
Definitions of relevant statistical terminologies; introduction to elementary statistics:
data collection, organization and presentation: frequency distribution, statistical
measures of central tendency and dispersion, measures of symmetry and skewness,
simple linear regression and correlation analysis.
Statistical Inference
Elementary probability theory; introduction to probability distributions: discrete
distributions, e.g., poisson, binomial; continuous probability distribution, e.g., nor-
mal.
Sampling distributions
Sampling distributions, e.g., student’s t distribution, Chi-square distribution, F-
distribution.
Estimation theory
Point and interval estimation.
Hypothesis testing or test of significance
Null and alternative hypotheses, level of significance, Type I and Type II errors, one
tail and two tail tests.

0.3 Requirements

0.3.1 Readings

You are required to do the readings1 before we discuss them in class. You will be
informed at the end of each lecture, which aspects are to be read for the next lecture.

0.3.2 Exercises

There will be a number of exercises throughout the semester. You should use these
exercises to assess yourself whether acceptable progress is made. You are strongly
urged to complete the exercises. In addition, you are encouraged to work together in
teams of 2-4 students to help each other in understanding the course material and
completing the exercise problems. However, if you find are having trouble working
1
Handouts will be provided. However, to complete your understanding in each of the different
aspects that will be discussed in this course, you are advised to consult any of the reference books
listed in Section 1.1.

i
through the exercises or understanding the material covered in class, you should see
the course instructor as soon as possible. “The earlier the better ”. NO credit(s)
would be given to the exercises. Partial solutions to some of the problems in the
exercises may be provided if necessary.

0.3.3 Continuous assessment (coursework)

Two components: 2 assignments and 2 timed tests.

0.3.3.1 Dates and times for the tests

i. Test 1: Date..........................2013 Time (From................ -.................)

ii. Test 2: Date..........................2013 Time (From................ -.................)

Assignments and tests will contribute 40% of the total credits allotted to this course
(coursework) and the final written university exam (UE) will contribute 60%. All
tests will be closed lecture notes, books, etc. You are expected to complete the
coursework assessment (tests) during the course of the semester as indicated above,
NO exceptions.

0.4 References

i. Chao, L. L. (1974). STATISTICS: Methods and Analyses. McGraw-Hill, Inc.

ii. Gupta, S. C. and Kapoor, V. K. (1994). Fundamentals of Mathematical Sta-


tistics. Sultan Chand and Sons, New Delhi.

iii. Grimmett, G. and Welsh, D. (1986). PROBABILITY: An Introduction. Ox-


ford University Press, New York.

iv. Miller, I. and Miller, M. (1999). Mathematical Statistics. Printice-Hall, Inc.

v. Montgomery, D. (2001). Introduction to Linear Regression Analysis. Wiley


and Sons, Inc.

vi. Zar, J.H. (1984). Biostatistical Analysis. 2nd Edn., Prentice-Hall.

0.5 Computing: SPSS (optional)

Where necessary, SPSS will be used to illustrate how to generate results or carry out
data analysis in a software package.

ii
1 Chapter One: Descriptive statistics

1.1 Definition of relevant statistical terminologies

1.1.1 Descriptive statistics

This deals with presenting the data we have. Presentation of data can be: (i) vi-
sually (through graphs, e.g., line graphs to display trend over time such as maize
production in Tanzania for 20 years; charts such as pie charts to display for example,
people’s opinions about the effects of climate change on food production and liveli-
hood in Tanzania, etc.), (ii) numerically (through averages such as mean, median,
mode, etc,). The fundamental objective of descriptive statistics is to present the
data in an clear/logical or meaningful way.

Illustration

• Suppose that you have data on total family income of each applicant (20, 000
in total) seeking sponsorship from the Higher Education Students’ Loan Board
(HESLB) for the academic year 2012/2013.

• Data were collected from 12, 000 babies born between 2000 and 2010 at a
certain public hospital in the country. The aim of the study was to understand
whether mother’s smoking status during pregnancy, number of physician visits
during the first trimester, history of hypertension, age of mother at birth, etc.
are risk factors for low birth weight, defined birth weight less than 2500 grams.

• Suppose that you have data on GPA of 15, 000 first year students enrolled in
4-year degree programmes at Sokoine University of Agriculture (SUA). The
aim of the study is to understand whether first year GPA predicts final year
GPA.

Goal: Describe the HESLB, the hospital, and the university in terms of the total
family income of each student, birth weight of each baby, and GPA score of each
student respectively.

Problem: If we wish to describe HESLB, the hospital, and the university in terms
of total family income, birth weight, and GPA score, respectively, the listing of 20,000
family incomes, 12,000 birth weights, and 15,000 GPA scores would be unwieldy.

Solution
Use descriptive statistics. As described above, descriptive statistics provides us
with graphical and numerical techniques for describing the HESLB, the hospital,
and the university concisely in terms of the total family income of the applicants,
birth weight of the babies, and GPA score of its students.

1
1.1.2 Inferential statistics

Is concerned with data analysis for decision-making. That is, employs sample data
to make estimates, decisions, predictions, or other generalizations about a popula-
tion. More later when we discuss hypothesis testing.

Illustration
Suppose we wish to estimate the proportion of all married women in Morogoro region
who have completed at least A-level secondary school education between 2000 and
2010. Suppose, further, that a reasonably complete list of all married women in
Morogoro is available at the National Bureau of Statistics (NBS). Can we locate and
interview each of the women in the list? This will be costly and time-consuming.
An easier and more efficient approach would be to randomly sample say 800
women from the list of all married women and contact each of the selected woman
individually.

• Use proportion of married women in the sample who have at least A-level
secondary school education to estimate the proportion of married women with
the same attribute or characteristic.
• The sample proportion is expected to be close to the proportion of all married
women in Morogoro with at least A-level secondary education.
• It is possible to tell by how much the sample estimated is expected to differ
from the proportion of all married women in Morogoro with at least A-level
secondary education.

More examples

• Predicting election results, e.g., outcome of 2010 presidential election in Tan-


zania predicted by institutions such as Synnovate, REDET, etc.
• Estimating survival rate of under five children in a malaria endemic population
in Tanzania.
• Estimating failure rate of newly developed light bulbs or Power Tiller Tractors
for Kilimo Kwanza initiative in Tanzania.

Characteristics common to all inferential statistics problems


A critical analysis of the above examples reveals the following:

i. Each example involved making an observation or measurement that could


not be predicted with certainty in advance,
ii. Each example involved sampling,
iii. Each example involved the collection of data, one measurement correspond-
ing to each element of the sample and
iv. Each example aimed at making inference about a larger set of measurements
called the population.

2
1.1.3 Population

Totality of all actual or conceived objects of a certain class where data are collected
or is an entire group of objects about which information is gathered.

1.1.4 Sample

That part of population by means of which one seeks to represent the whole popu-
lation (in some situations, a sample may include the whole of the population). In
practice, the intention is to use sample information to make an inference about a
population. For this reason, it is particularly important to define the population
under discussion and to obtain a representative sample from the defined population.
NOTE: To avoid making erroneous conclusions, a sample must be representative of
the population. To obtain a representative sample, we employ the rules for drawing
the sample items- the principle of randomness. To be discussed later in the course.
Question
A statistical population is composed of:
(a) Persons or things
(b) Data
(c) Characteristics of persons or things
(d) Measurements

1.1.5 Statistics

Development and application of theories and methods to handle the collection, analy-
sis, and interpretation of data for drawing useful conclusions.

1.1.6 Biometry

Is a branch of statistics in which statistical techniques are used for biological investi-
gations or it is the use of statistical techniques to arrive at a decision about a certain
biological problem.

1.1.7 Data

Data (plural and datum singular) is a collection of facts, such as values or mea-
surements. It can be numbers, words, measurements, observations or even just
descriptions of things.
Examples

i. Number of animals with a certain skin condition in a cage or number of bees


in a colony: 20, 43, 10 or 1000, 1230, 3280, etc.;

ii. Marks in MTH 106 tests and assignments: 20, 32, 10, 50, 38,48;

3
iii. Marrital status of a sample of 35-year men and women in Morogoro region:
married, divorced, widowed, widower, separated, living as married,never mar-
ried;
iv. Times to abatement of symptoms from 4 samples of patients treated with 2
different drugs;
v. The classification of each in a group of 80 patients as having “high”; “average”,
or “low” systolic blood pressure.

Types of Data
Two types: Discrete and continuous data. Discrete data can take distinctive val-
ues, which can be clearly identified and separated. For example, number of students
texting messages during MTH 106 lecture in MLT 8 can only take values of 0, 1, 2,
3, and so on, with nothing in between. Continuous data can take any value. For
example, when you measure leves of selected heavy metals in a sample of water,
soil or fish, it could take any value, depending on the instrument of measurement
and how accurately you do the measurement. This can take on values such as 2.50,
0.05, 0.55, etc.; average amount of electricity and water consumed per household in
Morogoro per month, etc.

1.1.8 Variables

A variable is a characteristic that changes (i.e., shows variability) from unit to unit
or one individual to another individual (e.g., heights, weights, plots, etc). Variables
are often denoted by upper case letters,e.g., X, Y, H and so on. If a variable can
assume only one variable is called a constant. Technically, Data are observations
of variables.
Types of variables
Variables may be either quantitative or qualitative. A quantitative variable is
one for which the resulting observations can be measured, or the observations are
in the form of numerical values. For example, heights, weights, etc. Observations
on quantitative variables may be further classified as continuous or discrete. A
continuous variable is one for which all values in some range are possible. In con-
tinuous variables we are limited in recording the exact values by the precision/and
or accuracy of the measuring device. Examples include height, weight, etc. By con-
trast, a discrete or discontinuous variable is one for which the possible values are
not observed on a continuous scale because of the existence of gaps between possi-
ble values. Often discrete observations are integers because they arise from counting.

Examples of quantitative variable


The number of petals on a flower, the number of households in Mji Mpya or Mazimbu
Campus, the number of insects caught in the sweep of a net, the number of accidents
and deaths, the number of rooms in a house, etc.
Qualitative Variable. A qualitative variable is one whose observations vary in
kind but not in degree. For example, Religious affiliation (e.g., Muslim, Christian,
other), sex (male, female), marital status (e.g., married, widowed, widower, divorced,

4
separated, never married), political affiliation (e.g., “CCM”, “CUF”, “CHADEMA”,
“UDP”, etc.).

Quantitative vs. qualitative data


On the basis of the above discussion, one can distinguish between quantitative data
and qualitative data. The former (quantitative data) are measurements that are
recorded on naturally occurring numerical scale whereas the latter (qualitative data)
are measurements that cannot be measured on a natural numerical scale, rather they
can only be classified into one of a group of categories.

Illustration
Classify each of the following variables measured as quantitative or qualitative, and
continuous or discrete.

i. Yield: quantitative, continuous (yield can take on any fraction of bushels/acre;


however, we are limited in observing that fraction by how precisely we can
measure fractions);

ii. Time: quantitative, continuous. (limited only by the precision of our time-
recording method);

iii. Number of defectives: Quantitative, Discrete;

iv. Blood pressure rating: Qualitative (categorical, with 3 categories: high,


average, low).

Exercise
1. Chemical and manufacturing plants often discharge toxic-waste materials such as
DDT into nearby rivers and streams. These toxins can adversely affect the plants
and animals inhibiting the river and the river bank. The National Environment
Management Council (NEMC) conducted a study of fish in river Ngerengere in Mo-
rogoro region and of its four tributary creeks: C1, C2, C3, and C4. A total of 200
fish were captured, and the following variables were measured for each:

i. River/creek where each fish was captured

ii. Species (tilapia, largemouth bass, or smallmouth buffalo fish)

iii. Length (cm)

iv. Weight (gm)

v. DDT concentration (ppm)

Classify each of the five variables measured as quantitative or qualitative.


2. The following are examples of quantitative data except:
(a) Height of the student
(b) Sex of the student

5
(c) Cholesterol level in blood
(d) Number of blood cells/ml. of blood

3 (hypothetical). A survey is to be conducted in which 1,000 individuals are asked


whether the decision reached in 2012 by the Energy and Water Utilities Regulatory
Authority (EWURA) to suspend BP Tanzania Limited from providing services to its
esteemed customers for a three-month period was made fairly. The 1,000 individuals
are selected by random-digit telephone dialing and asked the question over the phone.

(a) What is the relevant population?


(b) What is the variable of interest? Is it quantitative or qualitative?
(c) What is the sample?
(d) What is the inference of interest to the pollster?

4 (hypothetical). The food and beverage section of SHOPRITE Company is consid-


ering marketing a new snack food. To see how consumers react to the product, the
company conducted a taste test using a sample of 500 randomly selected shoppers
at Mlimani City in Da es Salaam. The shoppers were asked to taste the snack food
and then fill out a short questionnaire that requested the following information:

i. What is your age?

ii. Are you the person who typically does the food shopping for your household?

iii. How many people are in your family?

iv. How would you rate the taste of the snack food on a scale of 1 to 10, where 1
is least tasty?

v. Would you purchase this snack food if it were available on the market?

vi. If you answered yes to part (v), how often would you purchase the product?

Classify the data generated for each question as quantitative or qualitative. Justify
your classifications.

Measurement scales of variables


The four commonly used measurement scales-from the weakest to the strongest level
of measurement-are the nominal, ordinal, interval, and ratio scales. Data ob-
tained from a categorical variable are said to have been measured on a nominal or
on an ordinal scale. A nominal scale defines specific categories by name. These
categories are called levels of the scale.

Examples of nominal scale


Political party affiliation (as listed above); type of motor vehicle insur-
ance-“Third-part” or “Comprehensive”; ownership of house-“yes”, “no”. We
cannot assign an order of magnitude to the various levels. On the other hand, if the

6
observed data are classified into distinct categories in which ordering is implied, an
ordinal level of measurement is attained. Therefore, an ordinal scale incorporates
the feature of a nominal scale and an additional feature that observations can be
ordered or ranked from low to high.

Examples of ordinal scale


Rank of academic members of staff (lowest –highest)-Tutorial Assistant, Assis-
tant Lecturer, Lecturer, Senior Lecturer, Associate Professor, Professor. Note that
although we can rank the academic members of staff on an ordinal scale from low to
high, we cannot assign a distance between the ranks.

Interval versus ratio scale


An interval scale is an ordered scale in which the difference between the measure-
ments is a meaningful quantity. That is, an interval scale incorporates all features of
an ordinal (and hence nominal) scale and the additional feature that we can specify
distances between levels on a scale.

Example of interval scale


IQ tests (e.g., 150, 128, 126, 122, etc.) for students across schools can be ranked
from lowest to highest and exact distances be measured in score units on the IQ test
between the schools.

An undesirable feature of the interval scale is that the origin on the scale is unde-
termined, that is, we do not know where 0 is located. For example, for the IQ test
score, a zero (0) IQ score does not mean zero intelligence.

On the other hand, if a meaningful zero point can be defined for an interval scale,
the scale becomes a ratio scale. That is, a ratio scale incorporates all the features
of interval (and hence nominal, and ordinal) scales and the additional feature that
the ratios can be formed with levels of the scale.

Examples of ratio scale


Salary (in Tsh.), birth and death rates, height (in centimetres), weight (kilograms),
age (in years), and divorce rates.

1.1.9 Parameter and statistics

Any value describing a population, e.g., population mean, population variance is


called a parameter, while the corresponding value from the sample is called a
statistic.

7
1.1.10 Operational definition

An operational definition provides a meaning to a concept or variable that can


be communicated to other individuals. That is, in the context of surveys, an oper-
ational definition for the responses to the question. Example, the question “what is
your age?” may have different responses or meaning to the interviewer. Age may be
reported to the nearest birthday or age as of the last birthday.

Example
Which of the following is an operational definition of obesity?

i. A condition characterized by excessive body fat

ii. A condition that is a high priority topic for nurses researchers

iii. A condition associated with heightened risk of health problem

iv. A score greater than 30 on the Body Mass Index (BMI)

Exercise
Provide an operational definition for each of the following:
(a) An outstanding student
(b) A hard worker
(c) A nice day
(d) Fast service
(e) Study time
(f) A manager
(g) A boring class
(h) Commuting time to school or work
(i) An interesting book
(j) A leader

1.1.11 Validity and reliability

When deciding on the variable(s) of interest for a study, one need to consider the
validity-do the variables measure what they are intended to measure-and reliabil-
ity-are the measurements obtained from the variables of interest stable?

Exercise
1 Explain the difference between a categorical and a numerical random variable and
give an example of each.

2. Determine if each of the following random variables is categorical or numerical. If


numerical, determine whether the phenomenon of interest is discrete or continuous.
In addition, provide the level of measurement and an operational definition for each
of the variables.

8
i. Number of cellular phones per household

ii. Number of international calls made per month

iii. Length (in minutes) of longest international call made per month

iv. Number of local calls made per month

v. Length (in minutes) of longest local call made per month

vi. Ownership of a laptop

vii. Gender

viii. Amount of money spent on telecommunications in June, 2012

ix. Number of textbooks purchased

x. Number of credits registered for in the current semester

1.2 Data collection

What are the steps required for data collection? Data collection procedure can be
divided into three major stages namely:

i. determination of method of data collection. The researcher may use primary or


secondary data and may use observation or a questionnaire (to be discussed
later on).

ii. designing the instruments (e.g., questionnaire) of data collection. This entails
formulation of relevant questions and corresponding responses for close-ended
questions (to be discussed shortly).

iii. sampling and field work or execution of the study. Sampling involves selection
(random or non-random depending on sampling frame available, degree of
representation desired and whether inference is required), determination of
sample size (size depends on availability of resources, level of precision/margin
of error required).Field work involves administration of the designed data col-
lection instrument (e.g., questionnaire) through face-to-face interviews, mail,
telephone, web, etc. as discussed below.

1.2.1 Primary versus secondary data

Data are classified according to source as primary data or secondary data. Pri-
mary data is a set of information that is collected for the first time, and thus
happen to be original in character. In contrast, secondary data is a set of infor-
mation that has already been collected for you by someone else or institution such
as the National Bureau of Statistics (NBS). It is a set of information that has been
summarized in some form and available in published sources such as a book, a journal
article, conference proceedings, etc.

9
1.2.1.1 Collection of primary data
Primary data can be collected through either a census survey or sample survey.
As defined before, the former (census survey) involves the collection of data from
the whole population (complete enumeration) whereas the latter (sample survey)
involves the collection of data from part of the population.
Whether a sample survey or census surveys we can obtain primary data through
methods such as:

i. direct personal observation and measurement

ii. personal interviews (e.g., face-to-face interviews)

iii. mailed questionnaire

iv. combination of methods.

I: Direct personal observation and measurement. In this method, information


is sought by way of investigator’s own direct observation without asking from the
respondents. For instance, in a study relating to consumer behaviour, the investiga-
tor instead of asking the brand of wrist watch used by the respondent, may himself
look at the watch.
Advantages

i. Subjective bias is eliminated, if observation is done accurately.

ii. Independent of respondents’ willingness to respond and as such is relatively


less demanding of active corporation from the respondents as it is the case
with other methods such as face-to-face interviews discussed below.

Disadvantages

i. expensive both in terms of resources (time and money).

ii. useful and practical when the sample sizes or populations are relatively small.

iii. The information provided by this method is very limited.

iv. Sometimes unforeseen factors may interfere with the observation exercise. At
times, the fact that some people are rarely accessible to direct observation
creates obstacle for this method to collect data effectively.

This method is particularly suitable in studies, which deal with subjects (i.e. respon-
dents) who are not capable of giving verbal reports of their feelings for one reason
or the other. Examples of areas where direct observation has been used are:

i. Some aspects of food consumption surveys.

ii. Price collection exercises, where enumerators can purchase the produce and
record prices.

10
II: Personal interviews. Under this method information is collected through face-
to-face. That is, interviewer asks questions and respondent gives responses then the
interviewer records the responses in a data collection tool or questionnaire.
Advantages

i. useful in large scale inquiries

ii. high response rate

iii. greater potential for collecting information on difficult items which are likely
to yield ambiguous answers in other methods such as mailed questionnaire.

Disadvantages

i. Different interviewers may give different interpretations to the questions.

ii. In the process of probing, some interviewers may suggest answers to respon-
dents.

iii. Interviewers may read questions wrongly because of the divided attention of
interviewing and recording.

III: Mailed questionnaire. In this method a questionnaire is sent (usually by


post) to the persons concerned with a request to answer the questions and return
the questionnaire.
Advantages

i. It is cheaper.

ii. Sample can be widely spread.

iii. Interviewer bias is eliminated.

iv. It is quick.

Disadvantages

i. Non-response is usually high.

ii. The answers to the questions are taken at their face value as there is no op-
portunity to probe.

iii. If it is an attitude survey, it is difficult to ascertain whether the respondent


answered the questions unaided.

iv. The method is useful only when the questionnaires are fairly simple, and,
therefore, it is not a suitable method for complex surveys.

11
1.2.1.2 Collection of secondary data
This can be collected from the following sources:

i. Publication of the statistical office such as NBS or various ministries, e.g.,


Ministry of Agriculture and Food Security;

ii. From Banks, e.g., Bank of Tanzania, District Councils, Municipalities, City
Councils, etc.;

iii. Publication of associations e.g., Tanzania Chamber of Commerce Industry and


Agriculture (TCCIA),etc.;

iv. From Journals;

v. Research organisations such as Universities and other Institutions.

Question
How do you decide which mode of data collection to employ? Choice may be influ-
enced by things like:

• population of interest

• characteristics of the sample

• types of questions

• question topic

• response rate desired

• available resources (cost and time)

In general, when you have a problem that requires data you can:

• Use published data

• Design an experiment

• Conduct a survey (census survey or sample survey)

1.2.2 Editing of Data

Editing of data is a process of examining the collected data (specially in surveys) to


detect errors and omissions and to collect these when possible. As a matter of fact,
editing involves a careful scrutiny of the completed questionnaires. Editing is done
to assure that the data are accurate, consistent with other facts gathered, Uniformly
entered as complete as possible and have been well arranged to facilitate further
treatment.

12
1.2.3 The sample survey

Remark: in practice, because of limited resources (time and money), people often
opt for a selection of respondents, i.e. selection of a small proportion of the total
population of interest. If we wish to make inference about the entire population
from which the sample is drawn we must obtain a representative sample. A
representative sample exhibits characteristics typical of those possessed by the target
population.
Question
How do we achieve the representative sample requirement?
Answer
To select a random sample. A random sample ensures that every element in the
population of interest has the same chance of being selected to constitute the sam-
ple.

Definition
The selection process of a sample is called sampling technique. The survey so
conducted is known as sample survey.

1.2.3.1 Methods of sampling


There are two types of sampling procedures, namely nonrandom sampling and
random sampling. In the former (nonrandom) sampling procedure, the chances
of selecting units to constitute the sample are not the same, whereas in the latter
(random) sampling procedure, every individual in the population has the same
chance of been selected to form the sample. Accordingly, there are two types of
samples generated: nonprobability sample and probability sample respectively.
Examples of nonprobability samples include judgemental sample-selecting study
subjects based on reasonable judgement that the selected subjects are more likely
to provide the required information, convenience sample-selecting study subjects
based on their accessibility, and quota sample-dividing the population into mutu-
ally exclusive and exhaustive portions (quota) then select a pre-determined number
from each section.
Since the major objective of many statistical analyses is to make inferences (e.g.,
prediction, making decisions) about specific characteristics of a population based on
information contained in a random sample drawn from the entire population. As
already alluded to above, the condition for randomness is essential to make sure
the sample is representative of the population. Thus, we restrict our attention
to random sampling or probability sampling procedures.

1.2.3.1.1 Random sampling procedures


There are four commonly used random sampling procedures commonly used in prac-
tice:

i. simple random sampling

13
ii. stratified random sampling

iii. systematic random sampling

iv. cluster random sampling.

Accordingly, there are four kinds of probability samples:

i. simple random sample

ii. stratified sample

iii. systematic sample and

iv. cluster sample

I: Simple random sampling


Simple random sampling is a method of selecting n units of the N such that every
one of the NCn distinct samples has an equal chance of being drawn.

• The units in the finite population are numbered from 1 to N (population


frame); a series of random numbers between 1 and N is then drawn, either
using a table of random numbers or computer programs such as the procsurvey
select procedure in SAS.

• Simple random sampling can be with replacement or without replace-


ment.

Exercise
Which sampling mechanism (with replacement or without replacement) would you
prefer to use and why?

II: Stratified random sampling


If a population from which a sample is to be drawn does not constitute a homoge-
neous group, stratified sampling technique is generally applied in order to obtain a
representative sample.

• Under stratified sampling the population is divided into several sub-populations


(called strata) that are individually more homogeneous than the total popula-
tion and then we select items from each stratum to constitute a sample.

Question
How are strata formed and how should items be selected from each stratum?

III: Systematic random sampling


In systematic random sampling, every kth (k = N/n) item on a list is selected. An

14
element of randomness is introduced into this kind of sampling by using random
numbers to pick up the unit with which to start.

where: N = Population size, n = Sample size, k= selection interval.


Note: If k is not an integer, the next whole number value is used.

Exercise
Suppose you have decided to use a systematic sampling procedure for a study. The
known population size is 5,000, and the sample size desired is 250. What is the
sampling interval? If the first element selected is 23, what would be the fourth, fifth,
sixth, and ninth elements selected?

IV: Cluster or area sampling

• If the study area/population is large, divide it into a number of smaller non-


overlapping areas and then randomly select a number of these smaller areas
(often called clusters) with the ultimate sample consisting of all (or samples
of) units in these smaller areas or clusters.

• In cluster sampling the total population is divided into a number of relatively


small subdivisions, which are themselves, clusters of still smaller units and then
some of these clusters are randomly selected for inclusion in the overall sample.

• Note: when the clusters are too large, a second set of clusters is taken from
each original cluster. This leads to what is commonly known as two –cluster
stage sampling. If a third set of clusters is taken it is known as three-stage
cluster sampling, etc.

Review question
At this stage you must be able to anser the question. Why do we go for sample or
census surveys?
Reasons for sample survey include:

i. Reduced cost: Because the sample involves few individuals;

ii. Can save time: Because the number of individuals covered is small. This is
especially important when results are immediately required;

iii. Great scope: Highly trained enumerators, supervisors, instruments of data


collection can be used;

iv. Save product: For destructive surveys, collecting data from a sample can save
the product being studied;

v. If it is impossible to access the population, the sample is the only option (for
infinitely many members).

15
Reasons for census survey include:

i. The selected sample may not be a good representative of the population to


eliminate the chance that the sample is not a representative a census can be
conducted;

ii. A client (person authorizing and / or underwriting the study) might not have an
appreciation for random sampling but feels more comfortable with conducting
census;

1.2.3.2 Important issues in survey research


Potential errors
It is important to note that even if a probability sampling mechanism is employed,
there are potential errors that the investigator need to be aware of when designing
the survey. There are four types of errors in survey research. These are:

i. Coverage error-results from the exclusion of certain groups of subjects from


the population frame.

ii. Nonresponse error-results from the failure to collect data on all subjects in
the sample.

iii. Sampling error-reflects the heterogeneity or chance differences from sample


to sample based on the probability of subjects being selected in the particular
samples.

iv. Measurement error-refers to inaccuracies in the recorded responses that


occur because of a weakness in question wording, an interviewer’s effect on the
respondent, or the effort made by the respondent.

Ethical issues
It is also important to note that not all survey research is ethical. For instance,
purposive exclusion of some or particular groups of individuals from the population
frame in order to obtain results that favourable to the sponsor of the survey is uneth-
ical. Furthermore, designing of questions that are likely to guide the respondent in
particular direction, which captures responses that would result into positive results,
is unethical.

Review questions
At this juncture I expect you to be able to answer the following conceptual questions.
Make sure that you have a clear understanding of the concepts, be able to explain
any of the concepts to your fellow student who, for practical reasons, missed any of
the lectures. If you find having problems answering any of these question, re-read
the appropriate section(s) in the course notes or consult the instructor for further
elucidation on the concept(s).

16
Question 1

i. What is the difference between a sample and a population?

ii. What is the difference between a statistic and a parameter?

iii. What is the difference between an enumerative study and an analytical study?

iv. What is the difference between a categorical and a numerical random variable?

v. What is the difference between discrete and continuous data?

vi. What are the various levels of measurement?

vii. What is an operational definition and why is it so important?

viii. What are the main reasons for obtaining data and what methods can be used
to accomplish this?

ix. What is the difference between probability and nonprobability sampling?

x. Why is the compiling of a complete population frame so important for survey


research?

xi. What is the difference between sampling with versus without replacement?

xii. What distinguishes the four potential sources of error when dealing with sur-
veys designed using probability sampling?

Question 2
For each of the following statements, write True if the statement is true and False
if it is not true.

i. Interval scale of measurement is characterized by an equal units of measurement


and arbitrary zero point.

ii. In principal, the ordinal scale presumes that if ”a” is greater than ”b” and ”b”
is greater than ”c”, then it is true that c<b<a.

iii. Continuous variable is the variable that can theoretically assume a finite num-
ber of values.

iv. In quantitative data analysis, normally, bar graphs are used to show/present
the frequencies that characterize a quantitative variable.

v. Sampling units refers to the limited members of the population selected.

vi. A randon sample is the one that is always typical of the population.

vii. The standard deviation is a quantity of fundamental importance in statistics.

viii. For all practical purposes, infinite populations are large populations while finite
populations are small populations.

17
ix. When you estimate population parameters based on the properties of the sam-
ple you are making a sampling error.

x. The general ethical issue is that the research design should subject respondent
to material disadvantage .

xi. Access to both primary and secondary data depends on related objectives,
research questions except research design.

xii. Stratified random sampling techniques are based on the geographical proximity
or a common characteristic.

xiii. Focus group interviews involve face-to-face, repeated interaction between the
researchers and her/his informants.

Question 3
Given a population of n=93, using a table of random numbers draw a random sample
of size n=15 without replacement. List the 15 coded sequences obtained and compute
the sample mean. Repeat the exercise by sampling with replacement. Compare the
results obtained.

1.2.4 Basic survey designs

Cross-sectional surveys
In cross-sectional survey designs the required data are collected at one point in time
from a sample selected to represent a larger population.
Longitudinal Surveys
This involves collecting information over a given period of time. Longitudinal studies
may involve survey of sample population at different points in time-trend, study of
same population each time data are collected, although samples studied may be
different-cohort, and collection of data at various time points with the same sample
of respondents-panel.

1.2.5 Sample size determination

There are numerous formulas and software packages for estimating the size of a sam-
ple for finite populations. However, the formulas vary depending on the quantity
(mean, proportion, difference between two population means, etc.) to be estimated.
To illustrate this aspect, we consider sample size calculation for estimating a pop-
ulation mean.
Consider the equation
X −µ
Z= (1)
√σ
n

Multiplying both sides of equation (8) by √σ


n
we obtain Z √σn = X − µ. Defining
e = X − µ(the sampling error-the difference between the sample mean, X and

18
the population mean, µ) and solving for n we obtain

Z 2σ2
n= (2)
e2

From equation (24) the sample size depends on three quantities:

• The confidence level desired, which determines the value of Z, the critical value
from the normal distribution.

• The sampling error permitted, e

• The population standard deviation, σ

Example
A survey is planned to determine the average annual family medical expenses of
employees of a large company. The management of the company wishes to be 95%
confident that the sample average is correct to within ±TZS 50 of the true average
family medical expenses. A pilot study indicates that the standard deviation can be
estimated as TZS 400.

i. How large a sample is necessary?

ii. If management wants to be correct to within ±TZS 25, what sample size is
necessary?

Solution
(i) Given e=50 (sampling error), σ=400 (population standard deviation obtained
from pilot study), and 95% confidence of estimating the true mean (Z=1.96 critical
value from the normal distribution)

Substituting these quantities in equation (24) we have

(1.96)2 (400)2 614656


n= = = 245.8624
(50)2 2500
Therefore, n=246 (rounded up to the nearest integer value)

(ii) Given e=25, σ=400, and 95% confidence (Z=1.96)

Again substituting these quantities in equation (24) we have

(1.96)2 (400)2 614656


n= = = 983.4496
(25)2 2250
Therefore, n=984 (rounded up to the nearest integer value)

19
Note: the general rule in determining sample size is to always round up to the
nearest integer value.

Exercise
1. An advertising agency that serves a major radio station in Tanzania would like
to estimate the average amount time the station’s audience spends listening to radio
on daily basis. From past studies the standard deviation is estimated as 45 minutes.

i. What sample size is needed if the agency wants to be 90% confident of being
correct to within ±5 minutes?

ii. If 99% confidence, what sample is size is necessary?

2. Suppose that REDET wants to estimate the proportion of voters who will vote
for the CCM candidate in the 2015 presidential election. REDET would like 90%
confidence that its prediction is correct to within ±0.4 of the population proportion.

i. What sample size is needed?

ii. If REDET wants to have 95% confidence, what sample size is needed?

iii. If it wants to have 95% confidence and a sampling error of ±0.03, what sample
size is needed?

2
Hint: use n = Z p(1−p)
e2 where p is true proportion of success and note that if no
prior knowledge or estimate of the true proportion, p is available use p=0.5

1.2.6 Questionnaire design

The questionnaire is an important tool in the data collection process which involves
the transfer of information from one part-the respondents- to another part-the in-
terviewer. Therefore, in order to collect, correct and reliable information, the ques-
tionnaire must be properly designed. The size and format of the questionnaire are
crucial considerations when designing the questionnaire. Generally a good question-
naire should:

• Enable the collection of accurate information to meet the needs of potential


data users in a timely manner;

• Facilitate the work of data collection, data processing and tabulation;

• Ensure economy in data collection, that is, avoid collection of any non-essential
information.

• Permit comprehensive and meaningful analysis and purposeful utilization of


the data collected.

20
Formulation of questions
Questionnaires can be of closed-ended/forced choice and open-ended questions. When
designing a questionnaire, always use simple, clear, precise and unambiguous lan-
guage.
General guidelines
1. Try to be be concise
Example
Poor formulation: How do you feel about building an airport at Jangwani grounds
which have not been used optimally for a number of years?
Better formulation: An airport should be built at Jangwani grounds
1 = strongly agree
2 = agree
3 = disagree
4 = strongly disagree
2. Use mutually exclusive and exhaustive categories
Poor formulation: What is your marital status?
1=Married
2=Single
Better formulation: What is your marital status?
1=Married
2=Divorced
3=Separated
4=Widowed
5=Never Married
3. Use caution when asking personal questions
Poor formulation: How much do you earn each year? TZS.......................
Better formulation: In which category does your annual income last year best fit?
1=Below TZS 100,000
2=TZS 100,001-TZS 200,000
3=TZS 200,001-TZS 300,000
4=TZS 300,001-TZS 400,000
5=TZS 400,001-TZS 500,000
6=Over TZS 500,000
4. Question Order

• Most familiar to least familiar

• Avoid items that look alike

• Sensitive questions should be well after the start of the survey

• End with easy questions!

21
5. Limit ”skip”patterns
Do you participate in sports?
1 = No (GO TO QUESTION NO.)
2 = Yes (circle all sports that apply)
1=Football
2=Volleyball
3=Basketball
4=Soccer
5=Swimming
6=Other (Please specify.......................................................)
More examples
When an unforeseen event occurs, do you address it by selling some of your assets?
1=Usually
2=Always
3=Sometimes
4=Not at all
Other ordinal scales commonly used:

• Excellent, Very Good, Fair, Poor

• Always, Very Often, Fairly Often, Sometimes, Almost Never, Never

• Completely Satisfied, Very Satisfied, Somewhat Satisfied, Somewhat Dissatis-


fied, Very Dissatisfied, Completely Dissatisfied

• Definitely True, True, Don’t Know, False, Definitely False

• None, Very Mild, Mild, Moderate, Severe

22
Example of questionnaire with open-and close-end questions

Open-ended: How helpful are your fellow students in an event of a health shock?

Closed-ended: My fellow students are helpful in an event of a health shock. Circle one.
1=Definitely agree
2=Agree
3=Disagree
4=Definitely disagree
Closed-ended: In general, how would you describe relations in your workplace between management
and employees?
1=Very good
2=Quite good
3=Neither good nor bad
4=Quite bad
5=Very Bad
Closed-ended: Have you ever attended school?
1=Yes
2=No
Closed-ended: Which of the following books have you read? Circle all that apply.
1=Statistical methods
2=Experimental designs
3=Statistics for managers
4=Clinical trials
5=Introductory statistics

Exercise
Suppose that the National Health Insurance Fund (NHIF) would like to survey 1,500
of its members in Morogoro primarily to determine the percentage of its members
that currently own more than one car.

i. Describe both the population and sample of interest to NHIF.

ii. Describe the type of data that the NHIF primarily wishes to collelct.

iii. Develop a first draft of the questionnaire needed by writing a series of five
categorical questions and five numerical questions that you feel would be ap-
propriate for this survey. Provide an operational definition for each variable

1.3 Data analysis/presentation

1.3.1 Introduction

Data analysis is a very important but challenging part of any research study. The
challenge is that there is no single statistical analytical technique that is available
for use to analyze every set of collected data. Different techniques do different
things! Therefore, in order for the results of any study/investigation to be useful,
the collected data must be appropriately analyzed taking into consideration the
prime objective of the analysis. That is, asking the question: why is it necessary to

23
organize the collected data? Alternatively stated: Why do you need to carry out an
analysis of the collected information?
One goal of statistical inference (to be discussed in Chapter Two) is to use sample
information to learn something about the population. Hence, we often carry out an
analysis in order to aid the process of learning something about the characteristic(s)
of the population based on sample information. That is, draw conclusions about the
characteristics of a population. In this chapter we will discuss different techniques
for analysing or presenting data.
Presentation of analysis results (optional)
One you have finished analysing the data, you have to present the results in a sys-
tematic way. In academic writing (e.g., special projects, master’s dissertation, etc.),
the results of any research have to be presented following accetable formart. Report
or presentation of results of an academic research study has to include at least the
following five sections:

i. Abstract, in which the researcher/author summarize the study main features


(research problem, methodology used, key findings, conclusion and recommen-
dation)

ii. Introduction, in which the purpose of the investigation is described. That is,
problem statement, objectives and significance of the study are all included in
this section.

iii. Literature review, in which the state of evidence on a topic under study is
given. It covers both theoretical and empirical literature. Theoretical litera-
ture provides the theoretical underpining/foundation upon which the problem
under investigation is based. On the other hand, empirical literature, gives
the investigator an opportunity to understand the current state of evidence on
the topic thus identify gaps (or limitations of previous empirical studies) in
knowledge on the topic, which therefore, need to be addressed.

iv. Methodology, in which the population studied and the techniques (data col-
lection and statistical analyses) used are described

v. Results and Discussion, in which the findings of the investigation are pre-
sented, interpreted and discussed

vi. Conclusion and Recommendations, in which the essence of the investiga-


tion is summarized and recommendation (s) given

vii. References, in which the literature cited in the text is given.

viii. Appendices (if any), in which extra texts, tables and diagrams, which have
not been included in the main text are presented.

In the results and discussion section of the report, the findings are usually pre-
sented in form of tables or diagrams whose main features are also described in the
text. Choice of the most appropriate tables and diagrams to summarize the results
is essential, both because they enable the investigator to describe his/her findings

24
concisely and because they may suggest to him/her some aspects of the data which
he/she has not yet analyzed. Each table or diagram must be designed to show
one or two points, for it is usually better to have a number of simple tables or
figures than a single complex one. That is, too long tables or diagrams which can-
not fit to only one page are not preferred in a report. Tables and diagrams should
be self-explanatory and comprehensible without reference to the accompanying
text or description. Note further that if a table comprises of figures such as percent-
ages, which are derived by calculation from the original observations, they should
be accompanied by sufficient data to enable recalculation of the original numbers.
Percentages, rates, and other derived figures should not be expressed with more pre-
cision, for example to more decimal places, than it is compatible with the precision of
the original observations. Each table or figure must be precisely labelled indicating
the source of the data and date. Examples of graphical methods of displaying data
include line plots, histograms, bar charts, pie charts, scatter plots, maps, etc. Also
note that in some settings or reports the results and discussion are presented as two
separate sections in the report.
Major goals of data presentation

i. Summarize and display the sample information


ii. Use the information to learn about the underlying population.

For the purpose of illustration we will focus on quantitative data. To be able to


meet the stated goals, three commonly used methods are employed. These are:

i. Classification
ii. Tabulation
iii. Graphic representation.

Classification: Process of grouping together the observed values according to their


common characteristics. A group so determined is called a class. The number of
values falling in a particular group (class) is referred to as the class frequency or
simply frequency.

Before attempting to to group the data into classes, one should always start by
creating an array of the data.
Definition: An array is defined as an arrangement of raw numerical data in ascend-
ing or descending order of their magnitude.
Objectives:

i. To facilitate comparison e.g. when groups of people are classified according


to their income, we get the idea of status of the people with respect to their
income.
ii. To condense the mass of data: i.e. to display the point of similarity and
dissimilarity.

25
iii. To prepare the basis for further analysis.

Tabulation: Systematic arrangement of data in columns and rows. This sort of


logical arrangement makes the data easy to understand and facilitate comparisons
and further analysis.

Frequency Distributions
A tabular arrangement of data by classes together with the corresponding class fre-
quency is called a frequency distribution or frequency table.

Frequency: number of times a particular value occurs. For a given data, its rela-
tive frequency is the fraction obtained by dividing the class frequency by the total
frequency.

A frequency distribution has lower and upper limits, lower and upper class
boundaries, an interval and a mid-value.

Class Intervals and Class limits


A symbol such as 20 – 30 is called a class interval. Where the end numbers 20 and
30 are called class limits. The smaller number (20) is the lower class limit, and the
larger number (30) is the upper class limit. A class interval that has either no lower
class limit or no upper class limit is called open-end class e.g. the class 90 and
over is an open class interval.

Class Boundaries
From the class interval 20-30, the numbers 19.5 and 30.5 are called class boundaries
or true class limits, the smaller number (19.5) is the lower class boundary and the
larger number (30.5) is the upper class boundary.
In practice, the class boundaries are obtained by adding the upper limit of one class
interval to the lower limit of the next higher class interval and dividing by 2.

The size, or width of a class interval


The size or width, of a class interval is the difference between the lower and upper
class boundaries and it is also referred to as the class width or class size.

The class mark (mid-value)


The class mark is the mid point of the class interval and is obtained by adding the
lower and upper class limits and dividing by 2. Thus, the class mark (mid-value) of
the interval 20 – 30 is (20 +2 30) = 25.

26
Example of a frequency distribution table
Table 1: Heights of 20 Male Students
Classes Class Boundaries Mid-Points Frequency
32 - 35 31.5 - 35.5 33.5 1
36 - 39 35.5 - 39.5 37.5 2
40 - 43 39.5 - 43.5 41.5 7
44 - 47 43.5 - 47.5 45.5 7
48 - 51 47.5 - 51.5 49.5 3

The first class consist of heights from 32 to 35 and is indicated by the range symbol
32-35 and only 1 male student belongs to this class, the corresponding class frequency
is 1 as indicated on the last column in Table 1.

General rules for forming a frequency distribution table


Strictly speaking there is no specific right method for constructing a frequency table.
The choice of how to partition the range of data values is flexible depending on the
nature of the data. For simplicity, some general guidelines are adopted:

i. Determine the largest and smallest numbers in the raw data and thus find the
range r(the difference between the largest and smallest numbers).

ii. Determine the number of classes (m) of your frequency distribution. There is
no rule for determining the number of classes, but on the basis of the size of
the data n, you can think of an appropriate number of classes for the frequency
distribution.

iii. But usually m lies between 5 and 15. However, there is an empirical formula,
which is rarely used, defined as m=1+3.3 log n. In case of fraction results the
next higher whole number as taken as m.

iv. Determine c or h, which is the uniform size of class interval by using the
relationship h = r/m.

Determine the number of observations falling into each class interval that is the class
frequencies. This is best done using the tally or score sheet.

Graphical Presentation of Data


The goal of graphical display of data is to provide a visual impression of the char-
acteristics of the data from a sample. The hope is that the characteristics of the
sample are a likely indication of the characteristics of the population from which the
sample was drawn.
The most commonly used graphical methods include:

i. Histogram,

ii. Frequency Polygon

iii. Frequency curve

27
iv. Ogives or Cumulative Frequency curves

Examples of graphical presentation of data


Histogram
2. Histogram

60
1:Health Facilitators
2:Health Workers

50
3:Friends and Family
4:Drama Groups

40
5:Radio
6:Other

30
20
10
0
1 2 3 4 5 6
Source of Information

Line plot
3. Line plot
1.0

Males
Females
0.8
Estimated Probability
0.6
0.4
0.2
0.0

<25 25-32 33-40 41-48 49-56 57-64 65+


Age group

Pie chart
4. Pie Chart

Histogram: A histogram displays the distribution of the data visually by repre-


senting the frequency or likelihood of values in the sample by area. It consists of a
set of adjacent rectangles having:

(i) Bases on the horizontal axis, with centres at the class marks and lengths equal
to the class interval sizes.
(ii) Areas proportional to the class frequencies.

Frequency polygon: A graph obtained by connecting the midpoints of the tops


of the rectangles in the histogram. The points are joined by straight lines. Extra
midpoints/class marks are introduced at both ends on the x-axis so as to get the
required frequency polygon. The extra class marks are given zero frequencies.

28
Table 2. Heights of 100 male students
Height (in)
Class interval Class mark Frequency
60 - 62 61 5
63 - 65 64 18
66 - 68 67 42
69 - 71 70 27
71 - 74 73 8

Example: constructing a histogram


Recap: A histogram represents the information in the frequency table (frequency
distribution) graphically. It gives a pictorial display of how the data values are
distributed.
A picture (histogram) of the distribution of the data values gives an idea of how likely
they were to appear in the sample. Hopefully, if the sample is truly representative
this gives an indication of the way they are likely to occur in the entire population.
Let us now construct a histogram for the data presented in Table 2. See Figure 1
below.
Figure 1. Histogram of heights of 100 male students.

Instead of using frequencies as in Figure 1 to draw a histogram, one can also use
proportions (relative frequencies) as shown in Figure 2.

f requency
Note: Relative frequency= total#of observation sin sample

29
The curve drawn over the histogram is the normal curve of the sample
Figure 2. Histogram of Heights of 100 male students.

30
Exercise: Construct (a) a histogram and a frequency polygon for the diameter
data2 from the cherry tree as presented in Table 3 in ascending order of magnitude.
Hint: First construct a frequency table for the diameter data.
Table 3: Data on 31 cherry trees
Diameter (in) Height (ft) Volume (cu ft)
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77

Figure 3: Frequency polygon of the frequency distribution of Table 2


Frequency curve: A free hand curve drawn though the vertices of a frequency
polygon.

2
Source: Davidian, M. (1998) Experimental Statistics for Biological Sciences

31
Figure 4: Frequency curve of the frequency distribution of Table 2

Exercise: Draw a frequency curve of the frequency distribution of Table 3. Con-


sider only the diameter data.

Cumulative frequency curves (Ogives): The total frequencies of all values less
than the upper class boundary of a given class interval is known as cumulative
frequency up to and including that class interval. For instance, the cumulative fre-
quency up to and including the class interval 69-71 in Table 2 is 5 +18 +42+27=92.

Cumulative frequencies are of two types: less than and more than cumulative fre-
quency. Consequently, there are two types of ogives. Less than ogives and more than
ogives. A less than ogive is an increasing graph sloping upward from left to right
where as a more than ogive is a decreasing curve and slopes downward from left to
right.

Definition: A cumulative frequency curve is a free hand graph obtained by plotting


the cumulative frequencies of a class against class boundaries. A table representing
cumulative frequencies is known as cumulative frequency table. Shown below

32
in Table 4 is a less than cumulative frequency table and the corresponding curve is
presented in Figure 5.
Table 4: Less than cumulative frequency table
Height (in) No of Students
Less than 59.5 0
Less than 62.5 5
Less than 65.5 23
Less than 68.5 65
Less than 71.5 92
Less than 74.5 100

Figure 5: Less than cumulative frequency curve (ogive) of Table 4

Table 5: More than cumulative frequency table


Height (in) No of Students
More than 59.5 100
More than 62.5 95
More than 65.5 77
More than 68.5 35
More than 72.5 8
More than 74.5 0

33
Figure 6: More than cumulative frequency curve (ogive) of Table 5

Exercise

1. The class boundaries and frequencies of a certain frequency distribution are as


follows:
Income (in $) 34.5-39.5 39.5-44.5 44.5-49.5 49.5-54.5 54.5-59.5
Frequency 17 25 35 23 16

Draw a less than and more than ogive curves to the above frequency table.
2. Consider the following frequency distribution

Marks 10 -20 20-30 30-40 40 -50 50-60 60-70 70 -80 80-90


Students 5 12 23 29 24 19 10 2

Construct a less than and more than cumulative frequency table

1.3.2 Measures of central tendency (averages)

We have seen that a histogram provides an overall visual impression of the character
of the data. From it we get a sense of the center of the data and how spread
out they are. However, it is pleasing to summarise the nations of center and spread
quantitatively.
Objective: Describe a large body of quantitative data by a single value, which the
mind can grasp easily and quickly. Thus, by quantifying center and spread for a
sample we hope to get an idea of these same notions for the population from which
the sample was drawn.
Definition: An average is defined as a single value that is intended to represent the
distribution as a whole.

34
A typical value should have the values of the distribution clustered about it. That
is, it must have a central tendency.
Notation: The following notations are adopted.

n=size of sample
Y =the variable of interest
Y1 , Y2 , ..., Yn =observations on the variable for the sample.

=symbol for summation

Desirable Qualities of an Average


An ideal average should have the following qualities:
(i) An average should be rigidly defined
(ii) It should not be affected by extreme observations.
(iii) It should be based on all values of the given data
(iv) It should be suitable for further mathematical treatment
(v) It should be easy to understand and calculate
(vi) It should be affected as little as possible by fluctuation of sampling (i.e. it should
possess sampling stability).

Types of Averages
The most commonly used averages are:
(i) Arithmetic mean or simply mean
(ii) Median
(iii) Mode
(iv) Geometric mean
(v) Harmonic Mean

The arithmetic mean: The arithmetic mean or briefly the mean, of a set of n
numbers Y1 , Y2 , ..., Yn is denoted by Ȳ (read “Y bar”) and is defined as.
Y1 +Y2 + ... +···Yn
Ȳ = n =sample mean

1 n
Ȳ = Yi (3)
n i=1

That is, the sample mean is the average of the sample values. The notation “Ȳ ”
is standard, the “bar” indicates the “averaging ”operation being performed. The
population mean is the corresponding quantity for the population. We will use the
Greek symbol µ to denote the population mean. Thus, µ is the mean of all values
for the variable for the population.

35
Example 1
If 32, 51, 60, 48 and 46 are the marks of five students in MB 101 test, then the
variable “marks” is denoted by Y and the five values are denoted by Y1 , Y2 , Y3 , Y4
and Y5. The arithmetic mean is given by:

32 + 51 + 60 + 48 + 46 237
Ȳ = = = 47.4
5 5

Example 2
Consider the following tuition fees (in million Tsh.) charged by six different univer-
sities in the country to complete a three-year degree programme. 10.3, 4.9, 8.9, 11.7,
6.3, 7.7 Compute the mean.
Solution:

n
Xi
i=1 10.3 + 4.9 + 8.9 + 11.7 + 6.3 + 7.7 49.8
X̄ = = =
n 6 6
=8.30 million Tsh.

Note: calculation of the arithmetic mean is based on all the observations in the set
of data, thus it is affected by extreme (largest and smallest) values in the data set.
That is, it is not a desirable measure when there are extreme values.

Exercise
Find the arithmetic mean of the numbers 4.1, 2.9, 4.1, 3.8, 2.9, 3.4

Exercise
Find the arithmetic mean of the numbers 23.4, 15.6, 22.1, 20.0, 26.7, 31.4, 18.9, 22.3.

The expression for Ȳ given in equation (18) above is suitable for individual mea-
surements (i.e. ungrouped data). If the numbers Y1 , Y2 ,. . . , Y n occur f1 , f2 , ..., fn
times, respectively (i.e. occur with frequencies f1 , f2 , ..., fn ), the arithmetic means
is:

n
f i Yi
f1 Y1 + f2 Y2 + ... + fn Yn i=1
Ȳ = = 
n
f1 + f2 + ... + fn fi
i=1


where: n = fi is the total frequency (i.e. the total number of cases).

Example 1
If 5, 8, 6 and 2 occur with frequencies 3, 2, 4, and 1, respectively, the arithmetic
mean is

3x5 + 2x8 + 4x6 + 1x2 15 + 16 + 24 + 2


Ȳ = = = 5.7
3+2+4+1 10

36
Example 2
Find the arithmetic mean of the numbers 4.1, 2.9, 3.8 and 3.4 which occur with
frequencies 2, 2, 1, and 1, respectively.

Solution
Arithmetic mean is
f1 X1 + f2 X2 + ... + fn Xn
X =
f1 + f2 + ... + fn
2 × 4.1 + 2 × 2.9 + 1 × 3.8 + 1 × 3.4 8.2 + 5.8 + 3.8 + 3.4 21.2
= = = = 3.533
2+2+1+1 6 6

Exercise 1
Suppose that a final examination score in MTH 106 is weighted 2.5 times as much
as a test and a student has a final examination score of 75 and test scores of 80 and
95. Find the mean score.

Exercise 2
One hundred students were administered a 40-item test concerning knowledge about
causes of mental illness in teenagers. The data obtained are summarized in the
following Table 1. Compute the arithmetic mean.
Table 1: Test scores concerning knowledge about causes of mental illness
Class interval Frequency
6-7 10
8-9 6
10-11 11
12-13 10
14-15 25
16-17 16
18-19 8
20-21 14

Short method of calculating the arithmetic mean


To facilitate calculations of the arithmetic mean some short methods have been
developed. These methods may involve change of origin (i.e. adding or subtracting
a constant from each observation) or change of unit/scale (i.e. multiplying or diving
each observation by a constant). Here we focus on the method of assumed mean as
described below.
The method of assumed mean/provisional mean
With this method we identify an assumed mean denoted by A; and then we create
a new variable Di , which is the deviation of each observation Yi from the assumed
mean:
i.e. Di = Yi – A ⇒ Yi = Di + A
Then by definition,

37

n
f i Yi
i=1
Ȳ = 
n
fi
i=1

On replacing Yi by A + Di we obtain
  
fi (Di + A) fi Di + Afi
Ȳ =  = 
fi fi
 
Af
or =  f i + fi Di
i fi

f i Di 
=A+ n because fi = n.

f i Di
Ȳ = A + n or Ȳ = A + D̄

This method is applicable for both grouped and ungrouped data.


(i) In case of ungrouped data Yi represents the ith observation and fi will be the
same as 1.
(ii) In case of grouped data (frequency distribution) Yi will be the ith class mark
and fi its corresponding frequency.

Example
Find the arithmetic mean of the following set of observations 19, 21, 32, 16, 17, 15,
22, 34, and 51 by using the short method. Choose A = 34.

Yi 19 21 32 16 17 15 22 34 51 Total
Di = Yi - A -15 -13 -2 -18 -17 -19 -12 0 17 -79

Here n=9 and fi=1 for each value



fi Di (−79)
Ȳ = A + = 34 +
n 9
79
= 34 - 9 = 34 – 8.777 = 25.22

Ȳ = 25.22

Selection of “A”
To choose A for raw (ungrouped data), you can form an array of the observations
and pick A as the value, which lies approximately on the centre. For grouped data
select A as the class mark of the class, which lies in the region of high frequency.
Specifically, A is taken as the class-mark of the class with the highest frequency in
a given frequency distribution.

38
The weighted arithmetic mean
Sometimes we associate with the numbers Y1 , Y2 , . . . , Y n , certain weighting factors
(or weights) w1 , w2 . . . , wn depending on the significance or importance attached to
the numbers. In this case the arithmetic mean,

w1 Y1 + w2 Y2 + ... + wn Yn
Ȳ =
w1 + w2 + ... + wn

wY
=  wi i is called the weighted arithmetic mean.
i

Example
If a final examination score in a course is weighted 3 times as much as a quiz and
a student has a final examination grade of 85 and quiz grades 70 and 90, the mean
grade is.

wt Yt + wq1 Yq1 + wq2 Yq2


Ȳ = but wq1 = wq2 = 1
wt + wq1 + wq2

3 × 85 + 1 × 70 + 1 × 90 415
Ȳ = = = 83.
3+1+1 5

Exercise
A sample of 5 households is selected with equal probability (epsem) from a list of
250 households. One adult is selected at random in each sampled household (HH).
The monthly income (yij ) and the level of education (zij =1, if secondary or higher;
=0 otherwise) of the jth sampled adult in the ith household are recorded. Let Mi
denote the number of adults in the household. Assume that the data obtained from
the single sampled adult for each household in the first-stage sample of households
are as given in the table below where wi denote the overall weight (inverse of overall
probability of selection) of a sampled adult.

Sampled HH Mi wi yij zij wi × yij wi × zij wi × zij × yij


1 3 150 70 1 10,500 150 10,500
2 1 50 30 0 1,500 0 0
3 3 150 90 1 13,500 150 13,500
4 5 250 50 1 12,500 250 12,500
5 4 200 60 0 12,000 0 0
Total 16 800 300 3 50,000 550 36,500

Use the information given in the above table to estimate the following characteristics:

i. Weighted and unweighted average monthly income. Comment on the results.

ii. Weighted and unweighted proportion of people with secondary or higher edu-
cation. Comment on the results.

iii. Weighted and unweighted total number of people with secondary or higher
education. Comment on the results.

39
iv. Weighted and unweighted mean monthly income of adults with secondary or
higher education. Comment on the results.

Arithmetic mean for a combined (pooled) group


Let Ȳ1 , Ȳ2 , ..., Ȳn be the arithmetic means of n groups having n1 , n2 , ..., n observations
respectively. If we combine the N groups into a single group then we find the
arithmetic mean of the combined group as follows:

n1 Ȳ1 + n2 Ȳ2 + ... + nȲn


Ȳc =
n1 + n2 + ... + n

Example
In a factory, 120 workers get an average of $ 30 a day, 160 workers get $ 50 a day, 80
workers get $ 60 a day and 40 workers get $ 80 a day. Find the combined arithmetic
mean of all workers.

n1 Ȳ1 + n2 Ȳ2 + n3 Ȳ3 + n4 Ȳ4


Ȳc =
n1 + n2 + n3 + n4

120 x 30 + 10 x 50 + x 60 + 40 x 80
= = 49
120 + 160 + 80 + 40

Properties of the arithmetic mean


(i) The algebraic sum of the deviations of a set of number from their arithmetic
n 
  
n  
mean is zero i.e. Yi − Ȳ = 0 (for ungrouped data) fi Yi − Ȳ = 0 (for
i=1 i=1
grouped data).

(ii) The sum of the squares of the deviations of a set of numbers Yi from any number
 2 
a is a minimum if and only if a = Ȳ i.e. Yi − Ȳ is less than (Yi − A)2
  2
where A is any other value. For a frequency distribution fi Yi − Ȳ is less than

fi (Yi − A)2 .

Advantages and disadvantages of the arithmetic mean

Advantages
(i) It is rigidly defined
(ii) It is easy to calculate and understand
(iii) It is based on all the observations
(iv) It is suitable for further mathematical treatment
(v) Compared to the other averages, arithmetic mean is affected least by fluctuation
of sampling.

40
Disadvantages
(i) The severe draw back of arithmetic mean is that it is unduly affected by extreme
values of the data.
(ii) Arithmetic mean cannot be used in case of open-end classes.
(iii) Arithmetic mean cannot be obtained if a single observation is missing.

The median (X̃)


The median of a set of numbers arranged in order of magnitude (i.e., in an array) is
either the middle value or the arithmetic mean of the two middle values.

For a given set of an observations arranged in order of their magnitude, the median
 th
n+1
is defined as the value of the observation if n is odd and
  th 2
1  n th n+2
2 2 obser. + 2 obs if n is even.

Example 1 Consider the tuition fee data set and compute the median.

Solution
Ordered array of the data in ascending order of magnitude is:
4.9, 6.3, 7.7, 8.9, 10.3, 11.7

Since n (=6) is even then, by rule 2, median is the mean of the two middle numbers
(7.7 and 8.9). That is, median= 7.7+8.9
2 = 16.6
2 = 8.30 million Tsh.

Example 2
Consider the data set 24.1, 22.6, 27.0, 19.8, 21.5, 23.7, 22.6 Compute the median.

Solution
Ordered array of the data in ascending order of magnitude is:
19.8, 21.5, 22.6, 22.6, 23.7, 24.1, 27.0

Since n (=7) is odd then, by rule 1, median is the middle number (22.6).
Exercise
Find the median of the following values:
(i) 3, 5, 6, 8, 11.
(ii) 3, 5, 6, 8
Exercise
Find the median of the following set of observations:
(i) 45, 38, -40, 27, -29, 40, 57, 56, -33.
(ii) 19, 32, 15, 22, 20, 17, 9, 12.

41
Exercise Find the median of the numbers 0, -4, -1, 0, -2, -1, 0

Median for grouped data (frequency distributions)


In a frequency distribution, the median is conventionally taken as n/2, where the
frequencies are assumed to be uniformly distributed in classes. In other words, the
median is the value of x, which divides a histogram into two equal parts (areas).
This value corresponds to the value n/2 on the ordinate.
However, the median in case of frequency distribution is obtained by the process of
linear interpolation, which is defined as follows:

h n
X̃ = L1 + − c.f
f median 2

where:
L1 is the lower class boundary of the median class
h is the size of the median class interval
n is the total number of observations
fmed. is the frequency of the median class
c.f . is the sum of frequencies of all classes lower than the median class.

Example
Consider the following frequency distribution Table 1 and compute the median.
Solution n 
2 − c.f
h
Median, X̃ = L1 + fmedian

Table 2: Frequency distribution table of 100 test scores


Class Class Class Cumulative
interval boundaries mark Frequency frequency
6-7 5.5-7.5 6.5 10 10
8-9 7.5-9.5 8.5 6 16
10-11 9.5-11.5 10.5 11 27
12-13 11.5-13.5 12.5 10 37
14-15 13.5-15.5 14.5 25 62
16-17 15.5-17.5 16.5 16 78
18-19 17.5-19.5 18.5 8 86
20-21 19.5-21.5 20.5 14 100

From Table 2, n2 (100/2=50) of the observations fall in the class interval 14-15.
Therefore, L1 =13.5, h=2, fmedian =25, c.f . =37. Substituting these values in equa-
tion 3 we obtain

2
Median = 13.5 + 25 (50 − 37) =14.54 units

42
Exercise
Consider the following frequency distribution table and compute the median.

Height (in) Class Bound. Class mark Freq. Cum. Freq.


60 - 62 59.5 - 62.5 61 5 5
63 - 65 62.5 - 65.5 64 18 23
66 - 68 65.5 - 68.5 67 42 65
69 - 71 68.5 - 71.5 70 27 92
72 - 74 71.5 - 74.5 73 8 100

Advantages and disadvantages of median

Advantages
(i) It is rigidly defined
(ii) It is easy to understand and calculated
(iii) It is not affected at all by extreme values. Hence, it is a better average than
arithmetic mean when extreme observations are present
(iv) Can be computed for a distribution with open-end classes.
(v) The values of median can be obtained graphically.

Disadvantages
(i) It is not based on all values
(ii) Not suitable for further mathematical treatment
(iii) Compared to mean, median is more affected by fluctuation of sampling
(iv) In case of ungrouped data rearrangement of the values in the order of magnitude
becomes necessary

The mode (X̂)


The mode of a set of numbers is that value which occurs with the greatest frequency,
that is, it is the most common value. The mode may not exist, and even if it does
exist it may not be unique.

Example 1
Consider the tuition fee data set and find the mode.

Solution
Ordered array of the data in ascending order of magnitude is: 4.9, 6.3, 7.7, 8.9, 10.3,
11.7
Since none of the tuition fees occurs most in the set, then there is no mode.

Example 2
Consider the data set 24.1, 22.6, 27.0, 19.8, 21.5, 23.7, 22.6 Find the mode.

43
Solution
Ordered array of the data in ascending order of magnitude is:

19.8, 21.5, 22.6, 22.6, 23.7, 24.1, 27.0

Mode=22.6

Since there is only one mode, the data set is described as a unimodal. Two modes
is called bimodal, more than two modes is described as multimodal.

Exercise
1. Find the mode of the numbers 0, -4, -1, 0, -2, -1, 0

2. Find the mode(s) for each of the following set of observations:


(i) 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, 18, 9.
(ii) 3, 5, 8, 10, 12, 15, 16.
(iii) 2, 3, 4, 4, 4, 5, 7, 7, 7, 9.

Grouped data (frequency distribution)


For grouped data, it is usually convenient to take the class mark of class with highest
frequency as the mode of the distribution. This class is technically known as the
modal class.
From a frequency distribution or histogram the mode can be obtained from the
formula.

 ∆1
M ode (X) = L1 + × h or c
∆1 + ∆2

where:
L1 = lower class boundary of the modal class (i.e. the class with highest frequency).
∆1 =excess of modal frequency over frequency of next lower class.
∆2 =excess of modal frequency over frequency of next higher class.
c= size of the modal class interval.

44
Example
Consider the following frequency distribution table and compute the mode.

Frequency distribution table of 100 test scores


Class Class Class Cumulative
interval boundaries mark Frequency frequency
6-7 5.5-7.5 6.5 10 10
8-9 7.5-9.5 8.5 6 16
10-11 9.5-11.5 10.5 11 27
12-13 11.5-13.5 12.5 10 37
14-15 13.5-15.5 14.5 25 62
16-17 15.5-17.5 16.5 16 78
18-19 17.5-19.5 18.5 8 86
20-21 19.5-21.5 20.5 14 100

Solution
  
Mode, X = L1 + ∆1∆+∆
1
2
h

From Table 2, modal class (class with the highest frequency) is 14-15. Therefore,
L1 =13.5, ∆1 =15 (25-10), ∆2 =9(25-16), h=2. Substituting these values in equation
4 we obtain
   
15 15
Mode = 13.5 + 15+9 × 2 = 13.5 + 24 × 2 = 13.5 + 1.25 = 14.75 units

Graphical location of mode.


For a unimodal frequency distribution, mode can be located graphically from the
histogram. The procedure has the following steps:
(i) Draw a histogram to the given frequency distribution.
(ii) Join the top right corner of the tallest rectangle with the top right corner of the
preceding rectangle by means of a straight line.
(iii) Join the top left corner of the tallest rectangle with the top left corner of the
succeeding rectangle by means of a straight line.
(iv) From the point of intersection of the lines drawn, draw a perpendicular to the
x-axis (class mark). The foot of this perpendicular is the modal value.

Exercise
Determine the modal value of the above table graphically and compare the value
obtained with that obtained in the example above using the formula.

 ∆1
X = L1 + c
∆1 + ∆2

Advantages and disadvantages of mode

Advantages
(i) Mode is easy to understand and calculate

45
• In some cases it can be located merely by inspection

• The value of mode can be obtained graphically from the histogram (ii) It can
be calculated for frequency with open-end classes.

Disadvantages
(i) Mode is not rigidly defined
(ii) It is not based on all the values of the data
(iii) Mode is not suitable for further mathematical treatment
(iv) As compared with mean, mode is affected to a greater extent by the fluctuations
of sampling.

The empirical relation between the mean, median, and mode


For unimodal frequency curves that are moderately skewed (asymmetrical), we have
the empirical relation:
Mean – Mode = 3 (Mean – Median). For symmetrical curves, the mean mode, and
median all coincide.

The geometric mean


The geometric mean G of a set of n positive numbers Y1 , Y2 , . . . , Yn is the nth root
of the product of the numbers.


n
i.e. G = Y1 .Y2 .Y3 ...Yn or

1
G = (Y1 .Y2 .Y3 ...Yn ) /n

It is used to average certain ratios and rates.

Example
Find the geometric mean of the numbers 2, 4, and 8.

Solution √
G = (2 × 4 × 8)1/3 = 3 64 = 4

Question
The average in the sequence 2, 4, 8 is 4 and not 4.6, i.e., (2+4+8)/3. Why do you
think this is so?

i. Because 4<4.6

ii. Because 4 is the geometric mean

iii. Not true

46
iv. No reason

To facilitate computation of geometric mean, logarithms are used.


1
G = (Y1 .Y2 ....Yn ) /n Taking logarithms to base 10 both sides we have,
1
log G = log (Y1 .Y2 .Y3 ...Yn ) /n = 1
n log (Y1 .Y2 ...Yn )
1
= n [log Y1 + log Y2 + ... log Yn ]
1 
n
= n log Yi
i=1
 
1 
n
G= anti-log n log Yi
i=1
Exercise
Find the geometric mean of 23, 27, 54, 35, 50,and 43

In case of a grouped data i.e. frequency distribution with n classes, we consider the
class marks Y1 , Y2 , . . . ,Yn and their corresponding frequencies f1 , f2 , . . . , f n . Then
 1/
G = Y1f1 .Y2f2 ....Ynfn n on taking logarithm we have log G = n1 log Y1fi . Y2f2 ... Ynfn

1
log G = [fi log Y1 + fi log Y2 + ... + fn log Y2 ]
n
 
1 
n
1 
n
= n fi log Yi or G = anti − log n fi log Yi .
i=1 i=1

Exercise
Find the geometric mean of the frequency distribution of Table 2. For what value of
Y will geometric mean be undefined?

Advantages and disadvantages of geometric mean

Advantages
(i) Geometric mean is rigidly defined
(ii) It is based on al the observations
(iii) It is suitable for further mathematical treatment
e.g. If G1 and G2 are geometric means of two series of sizes n1 and n2 values
respectively then the geometric mean of the combined series of n1 + n2 observations
is given by:

1
G = (Gn1 1 × Gn2 2 ) n1 +n2

or

47
(n1 log G1 + n2 log G2 ) (n1 log G1 + n2 log G2 )
log G = n1 + n2 or G = anti − log n1 + n2
It is a proper average to measure the relative change. Geometric mean is a suitable
average to find the average percentage increase in population, production, and sales
over a period of time.

Disadvantages
(i) Because of its mathematical character, geometric mean is not easy to understand
and to calculate for a non – mathematical person.
(ii) If nay of the observations is zero, geometric mean becomes zero and if any of
the observation is negative geometric mean becomes imaginary regardless of the
magnitude of the other items.

The harmonic mean


Harmonic mean H of a set of n numbers Y1 , Y2 , . . . , Y n is the reciprocal of the
arithmetic mean of the reciprocal of the numbers:
 1  = 1
i. e. H = 
n or
1 1
+ 1
+ ... + 1 1 1
n Yi Y2 Yn n Yi
i=1

1 1 n
1
=
H n i=1 Yi

Example
Find the harmonic mean of the numbers 2, 4, and 8.

Solution

1 1 n
1
=
H n i=1 Xi


1 1 1 1 1 1 4+2+1 1 7 7
= + + == = =
H 3 2 4 8 3 8 3 8 24
24
H= = 3.43
7

For grouped data, harmonic mean is given by


1 1 1 
n
fi
H = 
n or H = n Yi
1 fi
i=1
n Yi
i=1

Exercise
Find the harmonic mean of the following frequency distribution

48
Class: 20-24 25-29 30-34 35-39 40-44 45-49 50-54
Freq: 11 18 32 37 21 47 13

Advantages and disadvantages of harmonic mean

Advantages
(i) It is rigidly defined
(ii) It is based on all the observations
(iii) It is suitable for further mathematical treatment
e.g. If H1 and H2 are harmonic means of n1 and n2 observations, then the harmonic
mean of the combined group of n1 + n2 observations is given by:

n1 + n2
H = n1 n2
H1 + H 2

(iv) It is not affected very much by fluctuation of sampling


(v) It is particularly useful in averaging special types of rates and ratios where time
factor is involved.

Advantages
(i) It is not easy to understand and calculate.
(ii) Its value cannot be obtained if any of the observation is zero.

The relation between the Arithmetic, Geometric and Harmonic


H ≤ G ≤ X̄.The equality signs hold only if all the number Y3 , Y2 , . . . , Y n are
identical.

Example
The set 2, 4, and 8 has arithmetic mean 4.67, geometric mean 4, and harmonic mean
3.43.

1.3.3 Measures of spread (dispersion)

Two data sets (or populations) may have the same mean, but may be “spread”
about this mean value very differently.

Definition
Dispersion, or Variation is the degree to which numerical data tend to spread about
an average value.

Characteristics of an ideal measure of dispersion.


Same as an ideal measure of central tendency discussed above.

49
Types of Measures of Dispersion
-Absolute and Relative.

Absolute Measures of Dispersion


-Measures expressed in terms of the original units of the data.

Limitation
Are not suitable for comparing the variability of two or more distributions, which
are not in the same units of measurements.

Relative Measures of Dispersion


-Pure numbers independent of the units of measurements.
Various measures of dispersion (or variation) are available, the most common be-
ing the range, mean deviation, quartile deviation or semi-interquartile range and
standard deviation.
Here we focus on three measures namely: the range, mean deviation and the stan-
dard deviation.

Range
Difference between the highest and the lowest values in the set. That is,

Range = Xl arg est − Xsmallest (4)

Example
Consider the tuition fee data set presented earlier and compute the range.

Solution
Ordered array of the data in ascending order of magnitude is: 4.9, 6.3, 7.7, 8.9, 10.3,
11.7
Range=11.7-4.9=6.8 million Tsh.
Exercise
Find the range of the set of numbers 2, 3, 3, 5, 5, 5, 8, 10, 12

The relative measure of range is defined as:


max . − min .
Coefficient of range = max . + min .
For a frequency distribution, range is taken as the difference between the upper class
boundary of the highest class and the lower class boundary of the lowest class.

Advantages and disadvantages of range

Advantages
(i) It requires very little calculations. Hence it is considered as the easiest measure

50
of dispersion.
(ii) It is rigidly defined.

Disadvantages
(i) Is not based on all the values of the data
(ii) Range is very much affected by fluctuation of sampling, its value varies very
widely from sample to sample
(iii) It is not suitable for further mathematical treatment.

Mean deviation
The mean deviation (MD), or average deviation, of a set of n numbers is defined by:

n
|Yi −Ȳ |
Mean deviation (MD) = i=1 n where Ȳ is the arithmetic mean of the numbers
and |Yi − Ȳ | is the absolute value of the deviation of Yi from Ȳ . For example, the
absolute value of –10 is |-10|=10, while that of 3 is |3|=3.

Example
Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16.

For grouped data


If Y1 , Y2 , . . . , Y n occur with frequencies f1 , f2 , . . . , f n respectively, the mean devi-
n
fi |Yi − Ȳ |
n
ation MD = i=1 n where n as defined before= fi . Here the Yi ’s represent
i=1
class marks and fi ’s the corresponding class frequencies.

Advantages and disadvantages of mean deviation

Advantages
(i) Mean deviation is rigidly defined
(ii) As compared to stand and deviation, it is easy to understand and calculate
(iii) Unlike range and quartile deviation, mean deviation is based on all the obser-
vations
(iv) Since mean deviation is based on the deviation about an average, it provides a
better measure of the comparison about the formation of different distributions.
(v) As compared with standard deviation it is less affected by extreme observations.

Disadvantages
(i) It is not suitable for further mathematical treatment.
(ii) It cannot be computed for distributions with open-end classes.
Note: The mean deviation from median is the least compared to the mean deviation

51
from any other value.

The sample variance


The sample variance (denoted as s2 ) for a set of n measurements is equal to the sum
of squared distances from the mean divided by (n-1).

In symbols:

n
(xi −x)2
s2 = i=1
n−1 or

(x1 − x)2 + (x2 − x)2 + ... + (xn − x)2


s2 = (5)
n−1
2
In practice s is calculated as:

2

n
xi

n
x2i − i=1
n
2 i=1
s = (6)
n−1
Equation (6) is known as the shortcut formula for computing s2

Example
Consider the tuition fee data set and compute the sample variance.

Solution
The six sample of tuition fees are 10.3, 4.9, 8.9, 11.7, 6.3, 7.7

The sample variance is calculated as defined in equation (5)


n 
 2
Xi − X
s2 = i=1
n−1
2
+(4.9−8.3)2 +(8.9−8.3)2 +(11.7−8.3)2 +6.3−8.3+(7.7−8.3)2
= (10.3−8.3) 6−1
31.84
s2 = 5 = 6.368 million Tsh.

The sample variance for grouped data


If X1 , X2 , . . . , Xn represent the midpoints of a frequency distribution table with n
classes and corresponding class frequencies f1 , f2 , . . . , f n , the sample variance, s2
of a set of n grouped measurements having X as the mean is defined as:


n  2
fi Xi − X
s2 = i=1
(7)
n−1

52
The corresponding shortcut formula is

2

n
fi Xi

n
fi Xi2 − i=1
n
2 i=1
s = (8)
n−1
Exercise
Find the sample variance of the frequency distribution of Table 2.

The sample standard deviation


The sample standard deviation (denoted by s) is defined as the positive square root
of the sample variance, s2 .


That is, s = s2 or

 n
 (x − x)2
 i=1 i
s= (9)
n−1

Example
Consider the tuition fee data set and compute the sample standard deviation.

Solution
From above, sample variance, s2 =6.368. Therefore, sample standard deviation,



 (xi −x)2
n
 √
i=1
s= n−1 = 6.368 = 2.52 million Tsh.

The sample standard deviation for grouped data


For grouped data, the sample standard deviation is obtained by taking the square
root of the corresponding sample standard deviation. That is,

  2
 n
 fi Xi − X
 i=1
s= (10)
n−1

Exercise
Find the sample variance of the frequency distribution of Table 2.

The average absolute deviation or mean deviation


The mean deviation of a set of n measurements from their mean is defined as the
average of the absolute values of their deviations. That is,

n
|Xi − X|
i=1
MD = (11)
n

53
Example
Find the mean deviation of the set 68, 67, 66, 63, and 61.

Solution
Arithmetic mean, X of the given set of numbers is 65 (check). Therefore, the mean

n
|Xi −X|
i=1 |68−65|+|67−65|+|66−65|+|63−65|+|61−65|
deviation, MD = n = 5 =

3+2+1+2+4
= 2.4
5

Exercise
Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16.

The mean deviation for grouped data


If X1 , X2 , . . . , Xn represent the midpoints of a frequency distribution table with
n classes and corresponding class frequencies f1 , f2 , . . . , f n , the mean deviation,
MD of a set of n grouped measurements having X as the mean is defined as:


n  
 
fi Xi − X 
i=1
MD = (12)
n
Exercise
Find the mean deviation of the frequency distribution of Table 2.

The coefficient of variation


Coefficient of variation (denoted as CV) is equal to the standard deviation divided
by the arithmetic mean, multiplied by 100%. Therefore it is a relative measure
variation, which measures the scatter in the data relative to the mean. That is,

s
CV = 100% (13)
X

It is particularly useful when comparing the variability of two or more sets of data
that are expressed in different units of measurements.

Example
Consider the tuition fee data set and compute the coefficient of variation (CV).

Solution
From above,
 sample mean
 (X)=8.30,
 sample standard deviation, s=6.368. There-
s 2.52
fore, CV, X 100% = 8.30 100% = 30.4%. That is, the relative size of the average

54
spread around the mean to the mean is 30.4%.

Combined standard deviation (Sc)


We have seen in the previous discussions that when we have n-groups each with its
group mean Ȳi and group size ni , we can compute the combined (or pooled) mean
Ȳc of all groups as follows:


n
ni Ȳi
n1 Ȳ1 + n2 Ȳ2 + ... + nȲn i=1
ȲC = = 
n
n1 + n2 + ... + n ni
i=1

For the same groups, we can compute the combined standard deviation (Sc ), if we
have information on their individual standard deviations, s1 , s2 , . . . , s n , their means
Ȳ1 , Ȳ2 , ... Ȳn and group sizes, n1 , n2 , . . . , n and their pooled mean Ȳc .

ni (s2i + d2i )
Then, Sc =  where di = Ȳi , − Ȳc
ni

Example
Wages paid to workers in two companies A and B are summarized as follows:

Number of workers 500 600


Average wages $ 186 175
Variance of distribution of wages 81 100

Compute Ȳc and Sc . Deduce the variance and coefficient of variation (C.V.).

Calculating descriptive summary measures from a population


Recall: the above various statistics summarize or describe numerical information
from a sample. Measures computed from the population are called parameters.
Examples include the population mean, the population variance and standard devi-
ation, and the population coefficient of variation.

The population mean


The population mean, denoted by µis defined as the sum of all the measurements in
the population divided by the population size. That is,


N
Xi
i=1
µ= (14)
N
The population variance and standard deviation
The population variance σ 2 is given by

N
(Xi − µ)2
σ2 = i=1
(15)
N

55
The population standard deviation, σ is given by

σ= σ or

 N
 2

 i=1 (Xi − µ)
σ= (16)
N

The population coefficient of variation


σ
CV = 100% (17)
µ
Exercise
1. Calculate the mean, median, mode, range, variance and standard deviation of the
variable age given in the table below by hand. Compare your results with the SPSS
printout given for the same variable (age). Interpret your results. That is, state
what each quantity means with respect to the study.

OMT data
Age Weight Height
31 75 178
30 69 178
38 73 176
24 78 179
54 73 162
64 75 168
61 65 160
34 60 166
38 64 165
47 63 167
49 85 172
55 91 177
25 69 170
42 63 178
53 93 180
43 65 174
55 60 173
60 61 167
61 49 163
45 56 175
24 79 179
24 80 177
25 64 168
25 95 174
44 82 160
49 77 172
52 90 178
50 80 160
54 76 160
33 77 180

56
SPSS Printout of summary statistics for variable age

Valid 30
N
Missing 0
Mean 42.97
Standard Error of Mean 2.356
Median 44.50
Mode 24(a)
Standard Deviation 12.907
Variance 166.585
Range 40
Minimum 24
Maximum 64

1.3.4 Simple Linear Regression analysis

1.3.4.1 Introduction
Investigating the relationships between two (or more) variables is a problem that
arises in the biological and physical sciences, economics, industrial applications and
biomedical settings.

For instance, a business firm may wish to know how future sales of a given product
could be affected by its price, or in a classroom we may need to know how tests and
assignments scores can be used to predict final examinations scores, etc.

Thus this problem of predicting unknown value of one variable in terms of the known
value of another variable is called Regression. In this course we will restrict our
discussion to the simplest case in which two variables for which the relationship
between them is reasonably assumed to be a straight line. However, the technique
can be extended to a multiple variate data (multiple regression analysis).

Variables We name the two variables in our simple regression as dependent or


response (Y ) and independent (X).

Practical problem
In most situations, the values we observe for Y (and sometimes for X) are not exact.
In particular, due to biological variation among experimental units and the sampling
of them, impression and/or inaccuracy of measuring devices, and so on, we may only
observe values of Y (and also possibly X) with some error. Thus, based on a sample
of (X, Y) pairs, our ability to see the relationships exactly is obscured by this error.

Scatter diagram
The scatter diagram is a graph that gives an indication of the mathematical form
that could represent the relationship between two variables. It is always advisable
to plot the data before analysis, to ensure that the model assumptions seem valid.

57
1.3.4.2 Simple linear regression model
Straight line model
It is often reasonable to suppose that the relationship between Y and X is a straight
line. We may write this as

Y = β0 + β1 X + ε (18)

for some values of β0 and β1 .

where
β0 is the intercept, the value taken on at X=0
β1 is the slope, expresses the rate of change in Y , that is, β1 =change in Y brought
about by a change of one unit in X.

Issue
The problem is that we do not know β0 or β1 . To get information of their values,
the typical experimental set up is to choose values of Xi , i = 1, ..,n, and observe the
resulting responses Y1 ...,Yn so that the data consist of pairs (Xi , Yi ), i = 1, ..., n.
The data are then used to estimate β0 and β1 , that is, fit the model to the data in
order to

• quantify the relationship between Y and X

• use the relationship to predict a new response Y0 we might observe at a given


value X0 (perhaps one not included in the experiment).

Model (18) above is referred to as a simple linear regression model.

• the parameters β0 and β1 characterising this relationship are called regression


coefficients or regression parameters.

• ε is an experimental error-represents all unexplained inherent variation due


to the experimental unit.

If we think of our data (Xi , Yi ), we may thus think of a model for Yi as follows:

Yi = β0 + β1 X i + εi

Objective
For the simple linear regression model, fit the line to the data to serve as our “ best”
characterisation of the relationship based on the available data. Although, we will
work with the simple linear regression model, be aware that the methods we discuss
extent easily to more complex linear models.

58
1.3.4.3 Fitting a simple linear regression model: the method of least
squares
Having discussed the important conceptual issues involved in studying the rela-
tionship between two variables, let us now describe practical implementation.
We do this first for fitting a simple linear regression model.

For observations (Xi , Yi ), i = 1, ..., n, we postulate the simple linear regression


model

Yi = β0 + β1 X i + εi , i = 1, ..., n.

Goal
We wish to fit the above model by estimating the intercept and slope parameters β0
and β1 .
Assumptions
For the purpose of making inferences about the true values of intercept and slope,
making predictions, and so on, we make the following assumptions. There are often
reasonable assumptions.

i. The observations Y1 ...,Yn are independent, i.e., not related in any way. For
example, they are derived from different animals, subjects, etc. They might
also be measurements on the same subject, but taken far apart enough in time
to where the value at one time is completely unrelated to that at another time.

ii. The observations Y1 ...,Yn have the same variance, σ 2 . That is, regardless of
which Xi we consider, the variation in possible Yi values is the same.

iii. The observations Yi are each normally distributed with mean µi = β0 + β1 Xi ,


i = 1, ..., n and variance σ 2 .

The method of least squares


There is no single right way to estimate β0 andβ1 . The most widely acceptable
method is that of least squares.

For each Yi , note that Yi − (β0 + β1 Xi ) = εi ,

that is the deviation Yi − (β0 + β1 Xi ) is a measure of the vertical distance of the


observation Yi from the line β0 + β1 Xi that is due to inherent variation (represented
by εi ). This deviation may be positive or negative.

The least squares estimation method consists of choosing, for any given set of
observations the values of β0 and β1 which will minimize the sum of squares of the
deviations εi . This has the same appeal as a sample variance-we ignore the signs of
the deviations but account for their magnitude.

59

n
That is, {Yi − (β0 + β1 Xi )}2
i=1

Our interest is to find the estimates of β0 and β1 that are the most plausible to
have generated the data. Hence, a natural way to think about this is to choose as
estimates the values β̂0 and β̂1 that make this measure of overall variation as small
as possible (that is, which minimize it).

n 
 2
If β̂0 andβ̂1 are the estimates of β0 andβ1 then Yi − β̂0 − β̂1 Xi is minimum
i=1
Estimation of regression parameters
Consider Yi = β0 + β1 Xi + εi

Ŷi = β̂0 + β̂1 Xi

ei = Yi − Ŷi

= Yi − (β̂0 + β̂1 Xi )

= Yi − β̂0 − β̂1 Xi


n 
n
e2i = (Yi − β̂0 − β̂1 Xi )2
i=1 i=1


n 
n
Minimizing e2i amounts to finding the partial derivatives of e2i with respect to
i=1 i=1
β0 and β1 .


n
The partial derivatives of e2i with respect to β0 and β1 are:
i=1

∂  2 ∂  2
ei = Yi − β̂0 − β̂1 Xi (19)
∂β0 ∂β0

∂  2 ∂  2
ei = Yi − β̂0 − β̂1 Xi (20)
∂β1 ∂β1

Simplifying equations (19) and (20) and equating the resulting equations to zero,
one can show that

β̂0 = Ȳ − β̂1 X̄ (21)

60

n 
n 
n
n Xi Yi − Xi Yi
i=1 i=1 i=1
β̂1 = n
2 (22)
n 
n Xi2 − Xi
i=1 i=1


n
(Xi −X̄ )(Yi −Ȳ )
i=1
It can also be shown that β̂1 = 
n
2
Exercise
(Xi −X̄ )
i=1

Thus, the fitted straight line is given by

Ŷi = β̂0 + β̂1 Xi

The “hat” on the Yi emphasizes the fact that these values are our “best guesses”.
The Ŷi are often called the predicted values.

61
Example: Optical density data
The optical density Y of a solution measured at eight concentrations, X, of a chemical
was as follows:
Meter reading, Yi 4 9 18 20 35 41 47 60
Concentration µg/ml, Xi 1 2 4 5 8 10 12 15

(a) Draw a scatter diagram of Y against X


(b) Fit the simple linear regression line Yi = β0 + β1 Xi + εi
Solution
(a) Here is the plot of the data:

 
Note: A regression line always passes through the point X̄, Ȳ
(b) Computational Table

Xi Yi Xi Yi Xi2
1 4 4 1
2 9 18 4
4 18 72 16
5 20 100 25
8 35 280 64
10 41 410 100
12 47 564 144
15 60 900 225

8 
8 
8 
8
Xi = 57 Yi = 234 Xi Yi = 2348 Xi2 = 579
i=1 i=1 i=1 i=1


n 
n 
n
n Xi Yi − Xi Yi
i=1 i=1 i=1 8 × 2348 − (57) (234)
β̂1 = n
2 = = 3.94
n  8 × 579 − (57)2
n Xi2 − Xi
i=1 i=1

62
234 57
β̂0 = Ȳ − β̂1 X̄ = − 3.94 ×
8 8
=1.193
Therefore the fitted line is
Ŷi = 1.193 + 3.94Xi

The slope of the line is 3.94, that is, for each unit increase in concentration (X),
optical density (Y) increases by the amount 3.94 and when X = 0, Y takes the value
1.193.

Assessing the fitted regression


Goal
Here we wish to assess how precisely have we estimated the intercept β0 and slope
β1 parameters, and for that matter, the line overall. Specifically we would like to
quantify

• The precision of the estimate of the line

• The variability in the estimates of β̂0 and β̂1 .

Roughly speaking a good regression line is one, which helps to explain or account
for a large proportion of the variability in Y.

Consider the identity

Yi − Ȳ = (Ŷi − Ȳ ) + (Yi − Ŷi )

Algebra and the fact that Ŷi = β̂0 + β̂1 Xi = Ȳ + β̂1 (Xi − X̄)may be used to show
that


n 
n 
n
(Yi − Ȳ )2 = (Ŷi − Ȳ )2 + (Yi − Ŷi )2 (23)
i=1 i=1 i=1

The quantity on the left hand side is the Total Sum of Squares for the set of data.
For any set of data, we may always compute the Total SS as the sum of squared
deviations of the observations from the (overall) mean, and it serves as a measure of
the overall variation in the data.
Therefore, equation (23) represents a partition of our assessment of overall variation
in the data, Total SS, into two independent components.

• (Ŷi − Ȳ )is the variation of the predicted value of the ith observation from the
overall mean. Thus, we may think of this as measuring the variation in the
observations that may be explained by the regression line β0 + β1 Xi

63
• (Yi − Ŷi )is the deviation of the predicted value for the ith observation (our “best
”guess for its mean) and the observation itself (that we observed). Hence, the
sum of squared deviations


n
(Yi − Ŷi )2
i=1

measures any additional variation of the observations about the regression line; the
inherent variation in the data at each Xi value that causes observations not to lie
on the line.
Thus, the overall variation in the data, as measured by Total SS, may be broken
down into two components that each characterise parts of the variation:

n
Regression SS= (Ŷi − Ȳ )2 , which measures that portion of the variability that
i=1
may be explained by the regression relationship.

n
Error SS (also called Residual SS)= (Yi − Ŷi )2 which measures the inherent
i=1
variability in the observations (e.g., Experimental error).

Total variation = explained variation+ unexplained variation

OR

Total SS = Regression SS + Error SS


SST = SSR + SSE
We define
Regression SS
R2 = (24)
T otal SS

R2 is our measure of goodness of fit and is called the coefficient of determination.


Note that we must have 0 ≤ R2 ≤ 1, because both components are nonnegative and
the numerator can be no larger than the denominator. Thus, an R2 value close to 1
is often taken as evidence that the regression model does a good job at describing
the variability in the data. On the other hand, an R2 close to 0 indicates that the
regression does not explain the variation in Y. That is, when SSE=SST

R2 is 1 if the SSE = 0, that is if the regression explains the random variation in Y.


In practice we shall have values of R2 , which lie between these to extreme values. A
value of R2 , which is very close to 1, implies a very good fit where as a value of R2 ,
which is very close to, 0 implies a very poor fit.

Important R2 is computed under the assumption that the simple linear regression
model is correct; i.e., it is a good description of the underlying relationship between

64
Y and X. Thus, it assesses, if the relationship between X and Y really is a straight
line.

Calculation of R2
β̂12 SXX
To calculate R2 by hand, one can show that R2 = SY Y

2

n

n  Xi
 2 
n
where SXX = Xi − X̄ = Xi2 − i=1
n
i=1 i=1


2

n

n  Yi
 2 
n
and SY Y = Yi − Ȳ = Yi2 − i=1
n
i=1 i=1

Example
Compute the coefficient of determination for the optical density data set. Exercise
Show that R2 =0.996 or 99.6% and give interpretation

1.3.5 Correlation analysis

Idea
Describe the “degree of association” between X and Y . Correlation doesn’t charac-
terise the relationship between the two variables. The correlation coefficient ρXY is a
measure of the degree of (linear) association between the two random variables. ρXY
satisfies −1 ≤ ρXY ≤ 1with ρXY denoting a “perfect” positive association, ρXY =-1
denoting a “perfect” negative association, and ρXY =0 denoting “no association”.
Here we discuss the Karl-Pearson correlation coefficient.

In practice, “perfect” association is rarely observed. We are more likely to observe


intermediate associations or no association.

Interpretation
It is important to understand what correlation does not measure. Investigators
sometimes confuse the value of the correlation coefficient and the slope of an apparent
underlying straight line relationship. These do not have anything to do with each
other.

The correlation coefficient may be virtually equal to 1, implying an almost perfect


association. But the slope may be very small at the same time. Although there
is indeed an almost perfect association, rate of change of Y values with X values
may be very slow.
The correlation coefficient may be very small, but the apparent “slope” of the rela-
tionship could be very steep. In this situation, it may be that, although the rate of
change of Y values with X values is fast.

65
Estimation
For given set of data, ρXY is unknown. We may estimate from a set of n pairs of
observations (Xi , Yi ),i = 1, ..., n by the sample correlation coefficient


n
(Xi − X̄)(Yi − Ȳ )
rXY = i=1

n
(Xi − X̄)2 (Yi − Ȳ )2
i=1

Using the notation defined before,

SXY
rXY = √
SXX SY Y

For hand calculation, one should use the preferred forms of SXX , SXY and SY Y
Example
The following data are measurements on wing length (X) and tail length (Y ) for a
sample of n=12 birds:

Wing length (X, cm) Tail length (Y , cm)


10.4 7.4
10.8 7.6
11.1 7.9
10.2 7.2
10.3 7.4
10.2 7.1
10.7 7.4
10.5 7.2
10.8 7.8
11.2 7.7
10.6 7.8
11.4 8.3

A plot of the data indicates a possible positive association.

Calculate the correlation coefficient for these data.


n 
n
Xi = 128.2, Xi2 = 1371.32, SXX = 1.717,
i=1 i=1


n 
n
Yi = 90.8, Yi2 = 688.40, SY Y = 1.347
i=1 i=1


n
Xi Yi = 971.37, SXY = 1.323
i=1

66
Thus, our estimate of ρXY is
SXY 1.323
rXY = √ = = 0.8704
SXX SY Y (1.717)(1.347)

The estimate is fairly close to 1

67
2 Chapter Two: Statistical inference

Inferences in statistics are of two types namely:

i. estimation: involves the determination, with a possible error due to sampling,


of the unknown value of a population

ii. hypothesis testing: involves the definitions of a hypothesis as one set of


possible population values and an alternative, a different set. There are many
statistical procedures for determining, on the basis of a sample, whether the
true population characteristic belongs to the set of values in the hypothesis or
the alternative.

2.1 Point and interval estimation

2.1.1 Point estimation

A point estimate consists of a single sample statistic that is used to estimate the
true value of a population parameter. Examples include sample mean (X̄), which
is a point estimate of the population mean, µ and the sample variance (s2 ), which
is a point estimate the population variance, σ 2 . For details of how to carry out the
estimation of these quantities revist our discussion of measures of central tendency,
specifically the sample mean and measures of dispersion specically the sample vari-
ance respectively. Note that the sample mean (X̄) and sample variance (s2 ) are not
the only estimators of population mean and population variance respectively. The
sample mean and variance possess most of the desirable qualities of an estimator as
discussed below.

2.1.1.1 Some properties of estimators

i. Unbiasedness: An estimator is said to unbiased if its expected value is equal


to the parameter that it is supposed to estimate. Otherwise, it is said to be
biased. Note that:

n 
 2
1
s2 = n Xi − X̄ is a biased estimator of population variance σ 2 while
i=1
n  2
1 
s2 = n−1 Xi − X̄ is unbiased estimator.
i=1

ii. Minimum variance: What if we can identify two competing estimators that
are both unbiased? On what grounds might we prefer one over the other?

Since the aim is to use sample information to say something about the population
from which the sample was drawn, we would therefore, like our estimator of to be
close to the true population values as possible. That is, we would like it to have
small variance. This would mean that the possible values that the estimator could

68
take on (across all possible samples we might have ended up with) exhibit only small
variation.

• Two unbiased estimators, choose the one with smaller variance.

• Ideally we would like to use an estimator that is unbiased and has the smallest
variance among all such candidates. Such an estimator is given the name
minimum variance unbiased estimator-MVUE

• It turns out that, for normally distributed data Y , the estimators Ȳ (for µ) and
s2 (for σ 2 ) have this desired property.

It should be noted that unbiasedness and minimum variance are not the only desir-
able qualities of an estimator. Sufficiency, consistency, efficiency and robust-
ness are also used to judge which estimator to be used in a particular situation.

2.1.2 Interval estimation

An estimate is a “likely” value. Because of chance, it is of course too much to expect


that the estimators X̄ and s2 would be exactly equal to the true unknown population
parameters µ and σ 2 , respectively, for any given data set of size n. Although they
may not be “exactly” equal to the value they are trying to estimate, because they
are “good” estimators in the sense discussed above, they are likely to be close.

• Instead of reporting only a single value of an estimator, we report an interval


(based on the estimator) and state that it is likely that the true unknown value
of the parameter is in the interval specified.

• “Likely” means probability is involved

Here we discuss the notion of such an interval, known as a confidence interval.

2.1.2.1 Case I: Confidence interval estimation of the mean µ (σ un-


known)
If we wish to make probability statements about X̄ without knowledge of σ 2 we use
the statistics X̄−µ
s where sX̄ = √sn

The value of µ we wish to estimate is fixed (but unknown) quantity. Thus, our
probability statements intuitively should have something to do with the uncertainty
of trying to get an understanding of the fixed value of µ using the variable estimator

With this in mind, consider the following probability statement. Let tn−1,α/2 be
the point such that the region under the Student’s t density with (n-1) degrees of
freedom has area α/2 so that the remaining region has area (1-α/2).

69
By symmetry, we know that tn−1,α/2 is such that each region on the right of the
Student’s t density has area
 α/2 with (1-α) in the middle:

or P −tn−1,α/2 ≤ X̄−µ
s ≤ t n−1,α/2 =1−α

If we rewrite the above expression by algebra we obtain

 
P X̄ − tn−1,α/2 × sX̄ ≤ µ ≤ X̄ + tn−1,α/2 × sX̄ = 1 − α

Definition
The interval

 
Ȳ − tn−1,α/2 × sX̄ , X̄ + tn−1,α/2 × sX̄ (25)

is called a (1-α)100% confidence interval for µ. For example if α=0.05, then (1-α)
=0.95, and the interval would be called a 95% confidence interval. In general,
the value (1-α) is called the confidence coefficient. We will often abbreviate con-
fidence interval by “CI”.

The endpoints of the interval, X̄ − tn−1,α/2 × sX̄ and X̄ + tn−1,α/2 × sX̄ are called
the lower and upper confidence limits.

2.1.2.2 Case II: Confidence interval estimation of the mean µ (σ known)

To construct confidence interval formula for estimating the mean of a normal popu-
lation when σ 2 is known we return to the probability

σ σ
P X̄ − zα/2 √ ≤ µ ≤ X̄ + zα/2 √ =1−α
n n

Thus,

σ σ
X̄ − zα/2 × √ < µ < X̄ + zα/2 × √ (26)
n n

is a (1-α)100% confidence interval for the mean of the population.

Interpretation of CI
In general, a (1-α)100 % confidence interval can be interpreted to mean that if all
possible samples of the same size n were taken, (1-α)% of them would include the
true population mean somewhere within the interval around their sample means,

70
and only α% of them would not.

Example
The inspection division of the Tanzania Bureau of Standards Weights and Measures
Department is interested in estimating the actual amount of soft drink that is placed
in 2-litre bottles at the local bottling plant of a large nationally known soft drink
company in Tanzania. The bottling plant has informed the inspection division/team
that the standard deviation for 2-litre bottles is 0.05 litre. A random sample of 100
2-litre bottles obtained from this bottling plant indicates a sample average of 1.99
litres.

i. Set up a 95% confidence interval estimate of the true average amount of soft
drink in each bottle

ii. Does the population of soft drink fill have to be normally distributed here?
Explain

iii. Explain why an observed value of 2.02 litres would not be unusual, even though
it is outside the confidence interval you calculated

iv. Suppose that the sample average changed to 1.97 litres. What would be your
answer to (a)?

Table 2.2a shows calculation of required confidence interval. As shown in the ta-
ble, the confidence limits (lower and upper) are 1.9802 and 1.9998 respectively or
1.9802 ≤ µ ≤ 1.9998. This interval includes the average (mean) amount (1.99 litre)
of soft drink that is placed in 2-litre bottles at the local bottling plant of the com-
pany. We are 95% confident that the true population mean of soft drink in 2-litre
bottles is between 1.9802 and 1.9998 litres.

Table 2.2a: Calculation of confidence interval estimate


¯
Given: Mean=1.99, Z=1.96 (for 95% CI), σ =0.05, and n=100

0.05
CI: X̄ ± Z × √σn =1.99 ± 1.96 × √
100

=1.99± 0.0098

Therefore, 95% CI: 1.9802 ≤ µ ≤ 1.9998

or

(1.9802, 1.9998)

As regards to part (ii), the answer is no. This is because, in the present case σ
is known and that n=100, by the central limit theorem we may assume that X̄ is
normally distributed. Therefore, the assumption of normally is not required.

71
Concerning part (iii), we argue that the observed value of 2.02 litres should not be
considered unusual even though it is outside the confidence interval calculated in
part (i) because the CI represents the estimate of the average of a sample of 100
2-litre bottles, not an individual value. Moreover, the individual value of 2.02 is only
0.6 standard deviation above the sample mean.

When the sample average changed to 1.97 litres the answer to part (i) would be as
shown in the calculation table 2.2b. We see that the interval 1.9602 ≤ µ ≤ 1.9798
also includes the average (mean) amount (1.97 litre) of soft drink that is placed in
2-litre bottles at the local bottling plant of the company. Therefore, we are 95%
confident that the true population mean of soft drink in 2-litre bottles is between
1.9602 and 1.9798 litres.
Table 2.2b: Calculation of confidence interval estimate
Given: M̄ ean = 1.97, Z = 1.96(f or95%CI), σ =0.05, and n=100

0.05
CI: X̄ ± Z × √σn =1.97 ± 1.96 × √
100

=1.97 ± 0.0098

Therefore, 95% CI: 1.9602 ≤ µ ≤ 1.9798

or

(1.9602, 1.9798)

Exercise
1. A market researcher states that she has 95% confidence that the true average
monthly sales of a product will be between TZS 1,700,000 and TZS 2,000,000. Ex-
plain the meaning of this statement.

2. The manager of the National Microfinance Bank (NMB PLC) Morogoro branch
wants to estimate the mean waiting time of customers who visit the branch for cash
deposit or withdrawal during end of month period. To achieve his goal, the manager
selects a random sample of 30 customers and the result indicate a sample average
waiting time of 4,750 seconds and a standard deviation of 1,200 seconds.

i. Set up a 95% confidence interval estimate of the true average amount of waiting
time of customers at this branch

ii. If the individual stayed for 4,000 seconds, would this be considered unusual?
Explain your answer

3. The quality control manager at a light bulb factory situated along Pugu road
need to estimate the average life of a large shipment of light bulbs. The process
standard deviation is known to be 100 hours. Arandom sample of 50 light bulbs
indicate a sample average life of 350 hours.

72
i. Set up a 95% confidence interval estimate of the true average amount of light
bulbs in this shipment

ii. Does the population of light bulb life have to be normally distributed here?
Explain

iii. Explain why an observed value of 320 hours would not be unusual, even though
it is outside the confidence interval you calculated

iv. Suppose that the process standard deviation changed to 80 hours. What would
be your answer in (a)?

4. Suppose that the manager of a paint supply store wants to estimate the actual
amount of paint contained in 1-gallon cans purchased from Coral Paint company. It
is known from the manufacturer’s specifications that the standard deviation of the
amount of paint is equal to 0.02 gallon. A random sample of 50 cans is selected, and
the average amount of paint per 1-gallon can is 0.995 gallon.

i. Set up a 99% confidence interval estimate of the true population average


amount of paint included in a 1-gallon can

ii. Based on your results, do you think that the store owner has a right to complain
to the manufacturer? Why?

iii. Does the population amount of paint per can have to be normally distributed
here? Explain

iv. Explain why an observed value of 0.98 gallon for an individual can would not
be unusual, even though it is outside the confidence interval you calculated

v. Suppose that you used a 95% CI estimate. What would be your answer to (a)
and (b)?

5. The customer service department of TANESCO would like to estimate the average
length of time between entry of service request and the connection of the service. A
random sample of 15 houses connected to LUKE meters is selected from the records
available during the past year. The results recorded in number of days are as follows:
114, 78, 96, 137, 78, 117, 126, 86, 99, 114, 72, 104, 73, 86.

i. Set up a 95% confidence interval estimate of the population average waiting


time in the past year

ii. What assumption about the population distribution must be made in (a)?

iii. Suppose the last value was 286 days instead of 86. What would be your answer
to (a)? What effect does this change have on the confidence interval?

73
2.1.2.3 Confidence interval for a difference of population means (µ1 −
µ2 )
Rarely in real life is our interest confined to a single population. Rather, we are
usually interested in conducting experiments to compare populations. For example,
suppose we wish to compare the effects of two concentrations of a toxic agent on
weight loss in rats. We select a random sample of rats from the population of interest
and then randomly assign each rat to receive either concentration 1 or concentration
2. The variable of interest is

X = weight loss for a rat.

Until the rats receive the treatments, we may assume them all to have arisen from a
common population for which X has some mean µ and variance σ 2 . Because these
are continuous measurement data, it is reasonable to further assume that X ∼ µ, σ 2

Because of the nature of the data, it is further reasonable to think about two random
variables X1 and X2 one corresponding to each population, and to think of them
as being normally distributed:

 
Population 1 X1 :∼ µ1 , σ12
 
Population 2 X2 :∼ µ2 , σ22

In this framework we may now cast our question as follows:

• Is there a difference in response for the 2 treatments? or is µ1 different from


µ2

• More formally, then we may look at the difference (µ1 − µ2 ):

(µ1 − µ2 )=0 no difference


(µ1 − µ2 ) = 0 real difference

Facts: It may be shown mathematically that if both of the random variables are
normally distributed, then the following facts are true:
The random variable (X1 − X2 ) satisfies

2)
(X1 − X2 ) ∼ N (µ1 − µ2 , σD

2 = σ2 + σ2
where σD 1 2

• Define D̄ = X̄1 − X̄2

74
2 ), σ 2 = σ12 σ12
That is, D̄ ∼ N (µ1 − µ2 , σD̄ D̄ n1 + n2

By analogy to the single-population case, the statistic

D̄ − (µ1 − µ2 )
σD̄

would follow a standard normal distribution

Intuition: Use D̄ as an estimator of (µ1 − µ2 ) and report an interval assessing the


quality of the sample evidence.

In practical situations σ12 and σ22 will be unknown. The obvious strategy would be
to replace them by estimates.

Two cases:

(i) n1 = n2 = n, (ii) σ12 = σ22 = σ 2

Under these conditions it can be shown that a “pooled” estimate of the common
σ 2 is given by

(n − 1)s21 + (n − 1)s21 s2 + s22


s2 = = 1
2(n − 1) 2

 
2 =2 s2
Because the two sample sizes are equal, an obvious estimator for σD̄ n

Just as in the single-population case, we would thus consider the statistic



D̄ − (µ1 − µ2 ) 2
, s
sD̄ n

It may be shown that this statistic has a Student’s t distribution with 2(n-1) degrees
of freedom

Confidence interval for (µ1 − µ2 )

By the same reasoning as in the single-population case, the confidence interval for
(µ1 − µ2 ) is thus

(D̄ − t2(n−1),α/2 sD̄ , D̄ + t2(n−1),α/2 sD̄ )

75
Example
The following data concern two types of rations, A and B, being fed to pigs. An
experiment was conducted in which 12 randomly selected pigs were fed ration A and
12 were fed ration B with the goal of determining whether there is a difference in
the weight gains (lbs) for pigs fed the two different rations.

A: 31 34 29 26 32 35 38 34 30 29 32 31
B: 26 24 28 29 30 29 32 26 31 29 32 28

Assume the normality assumption is reasonable; find a 95% confidence interval for
the difference in means (µ1 − µ2 )

Table 2.3 shows calculation of required confidence interval. As shown in the ta-
ble, the confidence limits (lower and upper) are 0.6689 and 5.4978 respectively or
0.6689 ≤ µ1 − µ2 ≤ 5.4978. This interval includes the difference is means (3.0833
lbs) between ration A and ration B given to the rats. We are 95% confident that
the true difference in population mean of gain in weight (lbs) for pigs fed the two
different rations is between 0.6689 and 5.4978 lbs. Note that the calculations in
Table 2.3 can be obtained in SPSS as shown in Table A of the Appendix.

Table 2.3: Calculation of confidence interval estimate


Let populations 1 and 2 be that of pigs fed ration A and B respectively.

Here n1 = n2 = n =12. The usual calculations give X̄1 = 31.75, X̄2 = 28.6667.

(n−1)s21 +(n−1)s21
Also (n-1) =11 so that s2 = 2(n−1) = 8.1314(Check!)

Thus, we have D̄ = X̄1 − X̄2 =3.0833 and sD̄ =


   
2 sn = 2(8.1314)
2
12

= 1.6141

From tables of the t distribution, we have t22 ,0.025 =2.074 (Check!).

For α=0.05, a 95% confidence interval for (µ1 − µ2 ) is

D̄ − t2(n−1),α/2 sD̄ , D̄ + t2(n−1),α/2 sD̄ ) =

3.0833-(2.074×1.1641), 3.0833+(2.074×1.1641)]=(0.6689, 5.4978)

Paired Differences Sig.


95% CI of the diff. t df (2-tailed)
Mean Std. Dev. Std. Error Lower Upper
VA - VB 3.083 3.942 1.138 0.579 5.588 2.710 11 0.020

76
Exercise
1. For the same problem, construct a 90% CI and comment on the results by com-
paring with the 95% C.I. obtained above.

2. Write expressions for confidence limits (lower and upper) for the difference be-
tween two population means (µ1 − µ2 ) in case σ12 = σ22 (but known) and n1 = n2

2.2 Elementary Probability

2.2.1 Introduction

We have already discussed the notion of random samples and the different methods
that are commonly used in drawing them from the population of interest. The word
“random” implies that chance is involved.

• The chance of Kenya to win in the African Cup of Nations is 15%

• I will probably pass MB 101

• The chance of rain today is 35%, etc.

Chances in statistics arise because we deal with random samples. A discussion


of probability provides a formal framework for describing chance associated with
data.

Let us describe the idea of probability in terms of what is probably the simplest
situation where chance is involved. The flip of a coin! We develop the terminology
and properties first from this simple situation, and them extend the ideas behind
them to real situations.

2.2.2 Some basic terminologies

2.2.2.1 Random experiment


A process for which no outcome may be predicted with certainty. For example,
tossing a “fair” coin “Heads” and “Tails” may not be known in advance.
Sample space: the set of all possible (mutually exclusive) outcomes of an experi-
ment.
Notation
S denotes sample space. For example, tossing a coin once, S={H, T}
Event: A possible result of an experiment. For example, tossing a coin twice,
S={HH, TH, HT, TT}. Thus, each element in S is a possible result of the experi-
ment.
Notation
E will be used to denote an event.

77
2.2.2.2 Mutually exclusive events
Outcomes do not overlap, i.e., events that cannot happen together.

2.2.2.3 Probability function


A probability function P − assigns a number between 0 and 1 to an event.

For any event E, 0 ≤ P (E) ≤ 1


P (S) = P (Oi ) = 1
i

Intuitively we assign probability 1 to an event that must occur and a probability 0


to an event that cannot occur.
Probability of an event E occurring

# of outcomes in S associated with E


P (E) =
Total # of possible outcomes in S

2.2.2.4 Exhaustive events


The total # of possible outcomes in any trial. For example, in throwing a die there
are six exhaustive cases.

2.2.2.5 Equally likely events


Events that have the same chance of occurring.

If P (E) is the probability of occurrence of E, then P (Ē) = 1−P (E)is the probability
of nonoccurrence of E

2.2.2.6 Addition law for mutually exclusive events


Probability that either A or B occurs is

P (AU B) = P (A) + P (B)

In general P (A1 U A2 U...U An ) = P (A1 ) + P (A2 ) + ....P (An )



n
= P (Ai )
i=1

2.2.2.7 Addition law for not mutually exclusive events


Probability that either occurs is P (AU B) = P (A) + P (B) − P (AnB).This is the
general rule of addition.

78
Exercise
1. Discuss and criticize the following
P (A) = 23 , P (B) = 14 , P (C) = 16 , for the probability of three mutually exclusive
events A, B, and C.
2. Let A and B be two events associated with an experiment. Suppose that P (A) =
0.4 and P (AU B) = 0.7. Let P (B) = p. For what choice of p are A and B mutually
exclusive

2.2.2.8 Conditional probability


The conditional probability of A relative to B is:
P (A/B)= P (AnB) , P(B) > 0. Likewise,
P (B)
P (AnB)
P (B/A)= P (A) , P(A) > 0

2.2.2.9 Independent events


An event B is said to be independent of A, if the conditional probability of B
given A, that is, P (B/A) is equal to the unconditional probability of B. That is,
P (B/A)= P (B). Hence, the events A and B are independent if and only if P (AnB) =
P (A) × P (B)-This is what is known as multiplication law for independent
events. This can be extended to any # of events which are independent.

P (A1 nA2 n...nAn ) = P (A1 ).P (A2 ).....P (An )


n
= P (Ai )
i=1
Examples see exercise on probability

2.2.2.10 Multiplication law for not independent events


P (AnB) = P (A).P (B/A) =P (B).P (A/B)
This is the general rule of multiplication.
Example see exercise on probability

2.2.2.11 Bayes’ rule


P (A|Ek )P (Ek )
For any event A with P (A) > 0, one has P (Ek |A) =  P (A|E )P (E )
for any k (k=1,
i i
i
2, . . . n)

Example
Two identical urns contain 10 balls. Urn U contains 6 white and 4 black balls, and
urn V contains 2 white and 8 black balls. Select an urn at random and without
looking at its content, guess whether you have urn U . At this moment, you have
P (U ) = P (V ) = 1/2. Now, you may gather additional information by drawing one
ball at random from the urn and looking at its color. Assume that the ball is white.

79
What is your best guess now? Hint: use Bayes’ rule.

Random variables
Statistical methods for analyzing data have their foundations in probability. Samples
are chosen at random meaning that an element of chance is introduced. Thus,
the observations are best viewed as random, that is, they are subject to chance.

We therefore name the variable of interest a random variable. Data are the obser-
vations on the random variable.
Notation
X=random variable (often abbr. as r.v). X1 , X2 , . . . , Xn are observations on X.
Results
Events of interest may be formulated in terms of random variables. For example.
Tossing a coin twice, let X = # H in 2 coin tosses

Represent event as E1 = {X = 1}, E2 = {X ≥ 1}

The probability of the events may be written in terms of X, e.g., p(E1 ) = P (X =


1), p(E2 ) = P (X ≥ 1)

2.2.3 Probability density function (discrete r.v)

A probability function f (x) of a discrete random variable (also called probability


mass or density function) is a function that assigns a probability real value to each
real number within the range of a discrete random variable X. Tossing a coin twice
and observing the # of “Heads” in each trial. Thus, X is a discrete random variable;
it may take on the values 0, 1 and 2.

Exercise Find the probability function of X

Notation
We denote the probability function of a discrete random variable by f (x) or p(x)

If f (x) represents the probability that X = x, i.e., P [X = x] then the following must
be satisfied:

(i) 0 ≤ f (x) ≤ 1. ∀x

(ii) f (xi ) = 1
i

Probability distribution function for discrete r.v: The probability distribution


function for a discrete random variable X denoted by F (x) is defined by F (x) =

P [X ≤ x]= f (t)
t≤x

80
Example Refer to the previous example.
X f (x) = P [X=x] F (x) = P [X ≤ x]
0 1/4 1/4
1 2/4=1/2 3/4
2 1/4 1

Note: The term cumulative distribution is used instead of probability distribution


function.

2.2.4 Probability density function (continuous r.v)

A probability density function (pdf) for a continuous random variable X is a function


f that posses the following properties:


 b
(i) f (x) ≥ 0, (ii) f (x)dx = 1, (iii) P [a ≤ X ≤ b] = f (x)dx
−∞ a

Example 
c(1 − x)2 , 0 ≤ X ≤ 1
Given the function f (x) =
0, elsewhere

Determine the value of c which makes f (x) a density function.


Find P [X>1/2]

Definition: The distribution function F for a continuous r.v. X is defined by:


x
F (x) = P [X ≤ x]= f (t)dt
−∞

Mathematical expectation (expected value): The mathematical expectation


(expected value) of a r.v. X denoted by E(x) is defined by:

E(X) = xf (x) if X is discrete
x


E(X) = xf (x)dx if X is continuous and the integral exists.
−∞

Properties of expectations

i. E[a]=a. where a is a constant


ii. E[aX]=aE[X]
iii. E[a1 X+a2 Y ]=a1 E[X]+a2 E[Y ] for any random variables X and Y and real
constants a1 and a2
iv. var(aX)=a 2 var(X) for a real constant a.Note: var(X)=E[(X 2 )]-(E[X])2
v. var(X+a)=var(X)
vi. var(aX+b)=a2 var(X)

81
2.2.5 Discrete distributions

2.2.5.1 The Binomial distribution


A Binomial experiment consists of n (fixed) repeated independent Bernoulli trials
each with the probability of success p. If X represents the total number of successes
in a binomial experiment with ntrials and probability of success p, then X is called
a binomial random variable with parameters n and p.

Definition
A random variable X is said to follow a binomial distribution if it assumes only
non-negative values and its probability mass function is given by:
 
n
p(x) = P (X = x) = px (1 − p)n−x , x=0, 1,. . . , n, q = 1 − p
x
=0, elsewhere

Notation
X∼B(n, p) denotes that the random variable X follows binomial distribution with
parameters n and p

Note:
1. A Bernoulli trial is an experiment with only two possible outcomes, a success (s)
and a failure (f). Its p.m.f is given by: f (x) = px (1 − p)1−x , x=0, 1 (0<p<1).
 
n n!
2. = nC x =
x x!(n - x)!

Physical conditions for binomial distribution


1. The number of trials n is finite
2. The probability of success p is constant for each trial
3. The trials are independent of each other
4. Each trial results in two mutually exclusive outcomes-success and failure

For example, tossing of a coin, throwing of dice, drawing cards from a pack of cards
with replacement etc, lead to binomial probability distribution.
Properties of the binomial distribution
If X has a binomial distribution with parameters n and p then E[X]=np and
var[X]=npq=np(1-p)

 

n n
Hint: (a+b)n = an−x bx
x=0 x
Examples
1. An investor buys five single residence dwellings as an investment. He assumes

82
that the probability he will make a profit on each is 0.8. Assuming independence;

(a) What is the probability that he makes a profit on each one?


(b) What is the probability that he makes a loss on each one?

2. A manufacturer of small parts ships his parts in lots of size 20 to his customer.
Assume that each part is or is not defective and that the probability an individual
part is defective is 0.05.

(a) What is the expected number of defective parts per lot?


(b) What is the probability that a particular lot contain no defective?
(c) Find P [X ≥ 2]
Exercise
Suppose airplane engines operate independently in a flight and fail with probability
1/5. Assuming that an airplane makes a safe flight if the at least half of its engines
runs. Determine whether a four-engine plane or a two-engine one has the highest
probability of making a safe flight.

2.2.5.2 The Poisson distribution


Definition
A random variable X is said to follow a Poisson distribution if it assumes only
non-negative values and its probability mass function is given by:

e−λ λx
p(x, λ) = P (X = x) = ; x = 0, 1, 2, ...
x!
λ is the parameter of the distribution
Notation
X∼P( λ) denotes that the random variable X follows Poisson distribution with pa-
rameter λ.

Instances where Poisson distribution may be employed.


1. Number of suicides reported in a particular city
2. Number of air accidents in some unit of time
3. Number of faulty blades in a packet of 100
4. Number of printing mistakes at each page of the book
5. Number of deaths from a disease such as heart attack or cancer or due to snake
bite

83
Properties of the Poisson distribution
If a r.v X follows Poisson distribution with parameter λ then E(X) = λ and
var(X) = λ

Examples
1. Let X be the number of typing errors a typist makes per typed page. If the typist
makes the average of 3 errors per page. What is the density function of X? Find
E[X], var(X), P [X ≥ 3]. What is the probability that there is no typing error in 2
typed pages.
2. A car hire firm has two cars, which it fires out day by day. The number of demands
for a car on each day is distributed as Poisson variate with mean 1.5. Calculate the
proportion of days on which neither car is used.

Joint probability distribution


If two discrete random variables X and Y are observed simultaneously, their joint
density function is defined by
f (x, y) = P [X = x, Y = y]for any points (x,y) within the range of X and Y
Note that a function f (x, y) is a joint probability mass function if and only if it
satisfies
f (x, y) ≥ 0 ∀x,y

f (x, y) = 1
x y

Example
The joint probability distribution of X and Y is given by
x+y
f (x, y) = 30 for x=0, 1, 2, 3; y=0, 1, 2
=0 otherwise
Find (i) P [X ≤ 2, Y = 1], (ii) P [X + Y = 4]

Independence
Two random variables X and Y are said to be independent if and only if fX,Y (x, y) =
fX (x) × fY (y) ∀x,y
Where fX,Y (x, y) is the joint density for X and Y . fX (x) and fY (y) are the marginal
densities for X and Y respectively.

Exercise
x+1 2y+3
Suppose fX (x) = 10 , x =0, 1, 2, 3. and fY (y) = 15 , y =0, 1, 2

Show whether or not X and Y are independent

84
2.2.6 Continuous probability distribution

Recap
Continuous random variables
Many of the variables of interest in scientific investigations are continuous; thus,
they take on any value. For example, suppose we obtain a sample of n pigs and
weigh them. Thus, the random variable of interest is
X =weight of a pig
and the data are X1 , ...Xn the observed weights for our n pigs. X is a random variable
because the pigs were drawn at random from the population of all pigs. Furthermore,
the pigs do not weigh exactly the same; they exhibit random variation due to
biological and other factors.

Goal
Find a function like f (x) for a discrete r.v. that describes the probability of observing
a pig weighing x units.

This function would thus serve as a model describing the population of pig weights-
how they are distributed and how they vary.
Technical note
For continuous r.v. we do not think about the probability that X is exactly equal
to some value x as we did for a discrete r.v. The reason is that, due to the limita-
tions imposed by the precision of measuring devices. Thus, we instead speak of the
probability that X falls into an interval.

2.2.6.1 The normal distribution


For many types of continuous measurement data (so continuous r.v.s X), there is
a certain function that seems to provide a good description of the probabilities
associated with the measurements. The probability distribution associated with this
probability density function is called the normal (or Gaussian) distribution.

Definition
A random variable X is said to have a normal with parameters µ and σ 2 if its pdf is:

1 1 x−µ 2
f (x; µ, σ 2 ) = √ e− 2 ( σ ) − ∞ < x < ∞, −∞ < µ < ∞, 0<σ<∞
2πσ

Notation
We write X ∼ N (µ, σ 2 )to say that the r.v. X has this (normal) distribution. The
curve of the normal density is bell-shaped symmetric about x = µ.

Properties
If a r.v X has a normal distribution with parameters µ and σ 2 then, E[X] =µ and
var(X) = σ 2

85
The standard normal distribution
The probability density function f (x) for a normal distribution has a very compli-
cated form. Thus, it is not possible to evaluate easily probabilities for a normal r.v.
X the way we could for a Poisson or binomial. Luckily, however, these probabilities
are widely available in tables in the special case when µ=0 and σ 2 =1. For example,
see Table 4 on page 7 of statistical tables.

Exercise See problems on probability


We learn how to evaluate normal probabilities when µ and σ 2 are known first; we
will see later how we may use this knowledge to develop statistical methods for es-
timating them when they are not known.

Situation
Suppose X ∼ N (µ, σ 2 ). We wish to calculate probabilities such as P (X ≤ x),
P (X ≥ x), P (x1 ≤ X ≤ x2 ) that is, probabilities of intervals associated with
values of X.
Technical note
When dealing with probabilities for any continuous r.v., we do not make the distinc-
tion between strict inequalities like “<” and “>” and inequalities like “≤” and “≥”
for the reasons discussed above.

Definition
If X is normally distributed with mean µ and variance σ 2 , then Z = X−µ
σ will also
be normally distributed with mean zero and standard deviation 1. Hence, we call Z
a standard normal r.v., and we write Z ∼ N (0, 1)

Finding probabilities for a standard normal r.v.


 
For the event of interest (x1 ≤ X ≤ x2 ) ⇔ x1σ−µ ≤ Z ≤ x2σ−µ . We will illustrate
using Table 4 of statistical tables. The body of this table has values P (0 ≤ z ≤ z0 )

Example

1. P (Z ≥ 1.23)
2. P (Z ≤ 1.23)
3. Find the value of z such that P (Z ≥ z) = 0.05
4. P (Z ≤ −1.23)
5. P (0.23 ≤ Z ≤ 1.45
6. P (|Z| ≥ 0.89)
Exercise
1. P (|Z| ≤ 1.68)

86
2. Find the value of z such that:
(i) P (|Z| ≥ z) = 0.05
(ii) P (|Z| ≤ z) = 0.975

Finding probabilities for any normal r.v.

IDEA: Transform probability statements about X ∼ N (µ, σ 2 ) into statements about


Z, and then use the methods above.

Examples
Suppose µ=8 and variance σ 2 =4, X ∼ N (8, 4)
(i) P (X ≥ 9.5), (ii) P (6 ≤ X ≤ 10) EXERCISE: P (|X| ≥ 8.6)

87
3 Chapter Three: Sampling distributions

3.1 Introduction

Since statistics are random variables, their values will vary from sample to sample,
and it is customary to refer to their distributions as sampling distributions.

3.1.1 Sampling distribution of the mean

If X1 , X2 , . . . , Xn constitute a random from an infinite population with mean µ and


2
variance σ 2 , then E(X̄)= and var(X̄) = σn
As we have discussed, one goal is to use X1 , X2 , . . . , Xn to estimate µ and σ 2 . We
use statistics like X̄ and s2 as estimators for these unknown parameters.

Because X̄ and s2 are based on observations on a random variable, they themselves


are random variables. Thus, we may think about the populations of all possible
values they may take on (from all possible samples of size n). It is natural to think
of the probability distributions associated with these populations.

Probability distribution of X̄ : If X ∼ N (µ, σ 2 ), then the distribution of all


possible values of X̄ is also normal.
2 σ
X̄ ∼ N (µX̄ , σX̄ ), µX̄ = µ, σX̄ = √
n

Idea
Using these facts, transform events about X̄into events about a standard normal r.v.
Z= σX̄−µ
/√n

Example
X̄ is based from a sample of size n=25 observations on a random variable X ∼
N (6, 9). Here µ= 6 and σ=3. Find P (X̄ ≥ 6.9)

3.1.2 The Student’s t distribution

Recap
Use statistics to estimate µ, the population mean, using the obvious estimator, X̄.
We would like to be able to make statements about “how likely” it is that X̄would
take on certain values. We saw above that this involves appealing to the normal
2 )
X̄ ∼ N (µ, σX̄

Practical problem
σ 2 , and hence σX̄
2 is not known. An obvious approach would be to replace σ
X̄ in
X̄−µ
our standard normal statistic σX̄ by the obvious estimator sX̄ and consider instead
the statistic

88
X̄−µ
sX̄ .The probability associated with the values taken on by this quantity is called
the Student’s t distribution with (n-1) degrees of freedom. (We discuss the
notion of degrees of freedom in a moment)

Notation
We will write tv to denote a r.v. with the t distribution with v degrees of freedom

Tables of probabilities associated with the values of a r.v. with the t distribution are
readily available. For example Table 5 of Statistical Tables on page 6.

Note: The t distribution is centered at 0 and has the same symmetric, bell
shaped as the normal, but whose probabilities in the extreme “tails” are larger
than those of the normal distribution. Again, as always, the total area is equal to 1.

Examples of using the table

1. Suppose v=10. Find the value of t such that P (tv ≥ t) = 0.05.


2. Suppose v=10. Find the value of t such that P (|tv | ≥ t) = 0.05.
3. P (t4 ≥ t) = 0.05
4. P (|t6 | ≥ t) = 0.05
5. P (|t9 | ≤ t) = 0.90

Note: In some cases, the values desired won’t be in the table. In this case, we
report a range for which the expected value lies.

Example: Find P (t4 ≥ 3.258)

3.1.3 The Chi-square (χ2 ) distribution

Probability distribution of s2 : If X ∼ N (µ, σ 2 ), then it may be shown mathemat-


2
ically that the values taken on by (n−1)s
σ2
are well represented by another distribution
different from the normal. This distribution is known as the chi-square distribution
with n-1 degrees of freedom. This is often written as χ2 (Greek letter). Table 6 of
Statistical Tables on page 9.
Note: The χ2 distribution is not symmetric. The total area is equal to 1

Examples of using the table

1. Suppose v=13. Find x so that that P (χ2v ≥ x) = 0.050.


2. Suppose v=24. What isP (χ2v ≥ 29.3)?
3. Suppose v=14. Find x so that P (χ2v ≤ x) = 0.25?
4. Suppose v=18. Find P (χ2v ≤ 8.94)

89
3.1.4 The

F distribution Another distribution that plays an important role in connection


with sampling from normal populations is the F distribution.

If U and V are independent random variables having chi-square distributions with


v1 and v2 degrees of freedom, then
U/v1
F = V /v2 is a r.v. having an F distribution

It is also known that if s21 and s22 are the variances of independent random samples
of size n1 and n2 from normal distributions with the variances σ12 and σ22 , then
s2 /σ2 σ22 s21
F = s12 /σ12 = σ12 s22
is a r.v. having an F distribution with n1 -1 and n2 -1 degrees of
2 2
freedom.

Degrees of freedom
(n−1)s2 σ 2 s2
For the statistics X̄−µ
sX̄ , σ2
,and σ22 s12 which follow the t, χ2 and the F distribu-
1 2
tions respectively, the notion of degrees of freedom has arisen. The probability
associated with each of these statistics depends on the sample size n through the
degrees of freedom value (n-1). What is the meaning of this?
Not that all statistics depend on s2 , and recall that

1 n
 2
s2 = Xi − X̄
n i=1

n 
 
Recall also that it is always true that Xi − X̄ = 0
i=1

Thus, if we know that values of (n-1) of the observations in our sample, we may
always compute the last value because the deviations about X̄of all n of them must
sum to zero. Thus, s2 must be thought of as being based on (n-1) “independent”
deviations-the final deviation can be gotten from the other (n-1)
The term degrees of freedom thus has to do with the fact that there are (n-1)
“free” or “independent” quantities upon which the r.v.s above are based.

90
4 Chapter Four: Hypothesis testing

4.1 Introduction

Often in real life we take observations on a sample with a specific question in mind.
Hypothesis testing is another way of data analysis. It begins with some theory, claim,
or assertion about a particular parameter of a population. For example, the branch
manager of CRDB bank may claim that the average waiting time for customers to
be served is 368 seconds, which the manager believes that it is a reasonable time
below which it is considered unusual thus requires reallocation of some tellers to
other branches within the country. That is, no corrective measure/action is needed
in the service/teller department of the bank. In contrast, above this time, it is con-
sidered an intolerable situation for most of the customers of the bank thus requires
attention of the bank management including increasing number of tellers to meet
recommended waiting time.

This problem can be translated into the language of statistical hypothesis or hy-
pothesis testing and based on collected data, one of the following two conclusions
will be made/reached regarding the manager’s claim:

i. The average waiting time is 368 seconds. No corrective action is needed.

ii. The average waiting time is not 368 seconds; either is less than 368 seconds,
or it is more than 368 seconds. Corrective measure is needed.

That is, the manager may wish to know whether:

i. µ = 368 seconds

ii. µ = 368 seconds

subsectionTerminologies The statement µ=368 seconds is called a statistical hy-


pothesis. We call such a hypothesis the null hypothesis and we write H0 : µ=368
seconds. A null hypothesis is always one of no difference or status quo. If the null
hypothesis is considered false, something else must be true-an alternative hypoth-
esis-which is the opposite of the null hypothesis. The statement µ = 368 seconds
is called an alternative hypothesis and we write H1 : µ = 368 seconds. A formal
procedure for deciding between H0 and H1 is called a hypothesis test or test of
significance.

Note: The null hypothesis (H0 ) is the hypothesis that is always tested. The alter-
native hypothesis (H1 ) is set up as the opposite of the null hypothesis and represents
the conclusion supported if the null hypothesis is rejected.

91
4.1.1 Level of significance

The probability of committing a Type I error, denoted by α is referred to as the


level of significance of the statistical test. The investigator controls the Type I
error rate by deciding the risk level α he/she is willing to tolerate in terms of reject-
ing the null hypothesis when it is in fact true. The choice of selecting a particular
risk level for making a Type I error is dependent on the cost of making a Type I error.

4.1.2 Confidence coefficient

The compliment (1-α) of the probability of a Type I error is called the confidence
coeffficient. It is the probability that the null hypothesis H0 is not rejected when
in fact it is true and should not be rejected.

4.1.3 The β risk

The probability of committing a Type II error, denoted by β is often referred to as


the consumer’s risk level.

4.1.4 The power of a test

The compliment (1-β) of the probability of a Type II error is called the power of a
statistical test. It is the probability of rejecting a null hypothesis H0 when in fact
it is false and should be rejected.

Summary
When α ↓ β ↑ and vice-versa. Solution is to increase sample size so that power of the
test increases by lowering/decreasing β. That is for any value of α, increasing sample
size n, β ↓, the power of the test (1 − β) ↑. However, remember that resources are
always limited. In general, choice of reasonable values for α and β depends on the
costs inherent in each type of error.

Question
How do we decide which hypothesis between the null and the alternative is correct?

In order to make a choice of whether to reject or accept the null hypothesis we


need based on sample information, compute the value of a test statistic, which will
tell us what action to take. A test statistic is a function of the sample information
that is used as a basis for deciding between H0 and H1 . For example, X̄−µ 0
sX̄ is a test
statistic.

92
In testing hypothesis we partition the possible values of the test statistic into two
subsets: an acceptance region for H0 and a rejection region for H0

93
Regions of rejection and nonrejection

In the process of deciding either of the two hypotheses, two kinds of errors (Type I
error and Type II error) are possible.

Risks in decision-making using hypothesis testing

Actual situation
Statistical decision H0 True H0 False
Do not reject H0 Confidence coefficient (1-α) Type II error (β)
Reject H0 Type I error (α) Power (1-β)

4.2 Type I and II errors

Type I error occurs when the null hypothesis (H0 ) is rejected when in fact it is
true and should not be rejected. On the other hand, a Type II error occurs when
the null hypothesis H0 is not rejected when in fact it is false and should be rejected.

4.3 One-sided and two-sided tests

Let us consider a situation in which we want to test the null hypothesis:

H0 : θ = θ0 vs.H1 : θ = θ0

In this hypothesis we reject H0 when θ̄, the point estimate of θ is much larger or
smaller than θ0 . Such a test is referred to as a two-tailed or two-sided test.

On the other hand, if our interest is to test the hypothesis

94
H0 : θ = θ0 vs. H1 : θ < θ0 or θ = θ0 vs. H1 : > θ0

It would seem reasonable to reject H0 only when the point estimate θ̄ is much smaller
(left tail of the test statistic) or larger (right tail of the test statistic) than θ0 respec-
tively. Such a test is called a one-tailed or sided test. Thus, a one sided or tailed
test is any test where the critical region consists of only one tail of the sampling
distribution of the test statistic.

4.4 Steps involved in hypothesis testing

i. Formulate H0 and H1 and specify α, the level of significance

ii. Using the sampling distribution of an appropriate test statistic, determine a


critical region of size α

iii. Determine the value of the test statistic from the sample data

iv. Check whether the value of the test statistic falls into the critical region and,
accordingly, reject H0 , or accept it or reserve judgment!

4.5 Tests of hypotheses for the mean of a single population

Procedures

Hypotheses:

General form: H0 : µ = µ0

against

One-sided: H1 : µ > µ0 or H1 : µ < µ0

Two-sided: H1 : µ = µ0

Level of significance: α

Test statistic:

X̄ − µ0
t=
sX̄
Decision: Reject H0 if
One-sided: t > tn−1,α or t < −tn−1,α

95
Two-sided: |t| > tn−1,α/
2

Example 1
Suppose that it is known from experience that the standard deviation of the weight
of 8-kg packages of cookies made by a certain bakery is 1.6 kg. To check whether its
production is under control on a given day, that is, to check whether the true average
weight of packages is 8-kgs, employees select a random sample of 25 packages and
find that their mean weight is X̄=8.091 kgs. Since the bakery stands to lose money
when µ >8 and the customer loses out when µ <8, test the hypotheses: H0: µ =8 vs.
H1 : µ =8 at the 0.01 level of significance.

Solution

Procedure:

Hypotheses:

H0: µ =8 vs. H1 : µ =8

Test statistic:

X̄ − µ0
Z= √
σ/ n
Decision: Reject the null hypothesis if Z < −Zα/2 or Z > Zα/2
Substituting X̄=8.091, µ0 =8, σ = 0.16 and n=25 into the test statistic we have,
8.091 − 8
Z= √ = 2.84
0.16 25

Zα/2 = Z0.005 = 2.575


Since Z (2.84) > Z0.005 (2.575), we reject the null hypothesis and conclude that
suitable adjustment should be made in the production process.

Example 2
It is thought that the body temperature of intertidal crabs exposed to air is less than
the ambient temperature. Body temperatures were obtained from a random sample
of 8 such crabs exposed to an ambient temperature of 25.4 degrees Celsius. Assume
that body temperatures are approximately normally distributed.

25.8 24.6 26.1 24.9 25.1 25.3 24.0 24.5

Solution
Procedure

96
Hypotheses
Let µ be the mean body temperature for the population of intertidal crabs exposed
to an ambient temperature of 25.4 degrees Celsius. Then we wish to test

H0 : µ = 25.4 deg. C vs. H1 : µ < 25.4 deg. C.

Clearly this is a one-sided test.

Level of significance: Use α=0.05.


Test statistic:
X̄ − µ0
t=
sX̄

n
Xi
i=1 200.3
We have n=8, thus, X̄ = n = 8 =25.04 (check!)


2 ⎤

n
Xi
⎢n ⎥
s2 = 1 ⎢ Xi2 − i=1 ⎥ = 0.479821428 (check!)
n−1 ⎣ n ⎦
i=1

Thus, sX̄ = √s =0.245 (approx.) (check!)


n

X̄−µ0
So that t = sX̄ = 0.245=-1.470.
−tn−1,α = −t7,0.05 = −1.895.

The value of the test statistic, -1.470 is not less than the critical value, -1.895; we
do not reject H0 at level of significance 0.05. There is not enough evidence in the
sample to suggest that the mean body temperature of intertidal crabs exposed to air
at 25.4 degrees Celsius is indeed less than 25.4

Exercise Using the same data test at α=0.1 and give your conclusion as a meaningful
sentence.

Exercise
1.Suppose that 100 cakes made by a certain fast food store lasted on the average
14 days with a standard deviation of 2 days. Test the null hypothesis µ =10 days
against the alternative that µ < 10 days at the 0.05 level of significance. Present
your findings in a written report covering the abstract, introduction, literature re-
view, methodology, results and discussion, conclusion and recommendation(s) and
references cited.

2. A supplier of ARVs for relieving HIV’s victims claims that the mean half-life
for the respective doses is 2650 hours. A random sample of 25 ARVs doses taken
had a mean half-life of 2640 hours with a standard deviation of 10 hours. Test the
supplier’s claim at 1% level of significance. Present your findings in a written report

97
covering the abstract, introduction, literature review, methodology, results and dis-
cussion, conclusion and recommendation(s) and references cited.

3. The director of admissions of the University of Dar es Salaam would like to


advice parents of incoming students concerning the cost of textbooks during a typical
semester. A sample of 100 students enrolled in the university indicates a sample
average of TZS 315,000 with a sample standard deviation of TZS 4,350.

i. Using the 0.01 level of significance, is there evidence that the population aver-
age is above TZS 300,000?

ii. Find the p-value of the test and interpret its meaning

4. A manufacturer of detergent claims that the mean weight of a particular box of


detergent is 6.25 kg. A random sample of 64 boxes revealed a sample average of
6.238 kg and a sample standard deviation of 0.234 kg.

i. Using the 0.01 level of significance, is there evidence that the average weight
of the boxes is different from 6.25 kg?

ii. Find the p-value of the test and interpret its meaning

iii. What will your answer in (a) be if the standard deviation is 0.05 kg?

iv. What will your answer in (a) be if the sample mean is 6.211 kg?

4.6 Testing for the difference of two population means

As we have discussed, the usual situation in real life is that in which we would like
to compare two populations. For example, we may want to decide based on sample
information whether two competing treatments are significantly different or compare
a new treatment with a standard or control one or we may want to decide on the
basis of an appropriate sample survey whether the average food expenditure of fam-
ilies in one city exceed those of families in another city by at least 2500 Tsh.

Scenario: we assume that we have 2 independent (totally unrelated) normal pop-


ulations

The two samples are of sizes n1 and n2 having the means µ1 and µ2 with known
variances σ12 and σ22 . Suppose further that we want to test the hypothesis:

H0 :µ1 − µ2 = δ, where δ is a given constant, against one of the alternatives

H1 : µ1 − µ2 = δ or µ1 − µ2 > δ or µ1 − µ2 < δ

98
Test statistic:
As in the case of constructing confidence intervals for µ1 − µ2 , intuition suggests that
we base our inference on X̄1 − X̄2 . The test statistic is

X̄1 − X̄2 − δ
z= 
σ12 σ22
n1 + n2

For the above alternative hypotheses, the value of the test statistic is compared with
the respective critical regions |z| > zα/2 , z > zα and z < −zα

Note: When we deal with independent random samples from populations with un-
known variances that may not even be normal, we can still use the above test statistic
with s21 substituted for σ12 and s22 for σ22 as long as both samples are large enough.

Example: An experiment was conducted to determine whether the average nico-


tine content one kind of cigarette exceeds that of another kind by 0.20 milligram. If
n1 =50 cigarettes of the first kind had an average nicotine content of Ȳ1 =2.61 mil-
ligrams with a standard deviation of s1 =0.12 milligram, whereas n2 =40 cigarettes
of the other kind had an average nicotine content of Ȳ2 =2.38 milligrams with a stan-
dard deviation of s2 =0.14 milligram, test the null hypothesis µ1 − µ2 = 0.20 vs.
µ1 − µ2 = 0.20 at the 0.05 level of significance.

Solution

Procedure

Hypothesis

H0 :µ1 − µ2 =0.20 vs. H1 : µ1 − µ2 = 0.20 (two-sided test)


Test statistic:
X̄1 − X̄2 − δ
z= 
σ12 σ22
n1 + n2

Substituting the data given into the formula for test statistic we have
2.61 − 2.38 − 0.20
z=  = 1.08
(0.12) (0.14)
50 + 40

The critical value, zα/2 = z0.025 =1.96. Since 1.08 does not exceed 1.96, we do not
reject the null hypothesis. This means that the difference between 2.16-2.38=0.23
and 0.20 is not significant.

99
5 Appendices

5.1 Exercise 1

1. Using example (s) briefly describe the following terms:


(i) Statitics
(ii) Population
(iii) Sample
(iv) Census
(v) Random sample
(vi) Representative sample
(vii) Statistical inference
(viii) Quantitative data
(ix) Qualitative data
2. Stricking speaking, statistics involves three key stages, namely: (i) collection of
data, (ii) analysis of data, and (iii) interpretation of results. Briefly describe how
each of these stages is achieved.
3. Consider the following data: 5.3, 4.2, 6.1, 8.4, 7.1, 6.6.

i. Calculate the sample mean, median, sample variance and sample standard
deviation, range, coefficient of variation, and standard error of the mean for
the above data by hand.

ii. Verify that the sum of the deviations (observation -mean) is zero. Do this by
hand.

iii. Suppose that the observation 8.4 was recorded instead as 84 due to a record-
ing error. Calculate the sample mean and median of the data set under this
condition and comment on the effect of this error.

4. The mean and variance of a set of 10 values are known to be 17 and 33 respectively.
Of the 10 values 26 was subsequently found wrong and the correct value was 16. Find
the correct mean and variance of the distribution.
Find the missing information from the following data.

Sub group I II III Combined


# 50 n2 90 200
s 6 7 s3 7.745
X̄ 113 X̄2 115 116

5. The mean weight of 150 students in a certain class is 60 kg. The mean weight of
the boys is 70 kg and that of the girls is 55 kg. Find the number of boys and girls
in the class.
6. The mean of 100 observations is 50 and standard deviation 10. What will be the
new mean and standard deviation if:
(i) 5 is added to each observation

100
(ii) Each observation is multiplied by 3.
7. Examine whether the following results of a piece of computation for obtaining
the variance are consistent or not.

n=120, Xi = 128, X̄=-125
8. The mean annual salary paid to all employees in a company was $ 15000. The
mean annual salaries paid to male and female employees of the company were $
15600 and $ 12600 respectively. Determine the percentages of males and females
employed by the company.

9. What is meant by a frequency distribution? Describe the main steps in prepa-


ration of a frequency distribution table from raw data. In the following table, the
mean annual death rates per 1000 at ages 20-65 in each of 88 occupational groups
are given in terms of the class marks.

101
Construct:
Class limits and (ii) class boundaries

Class marks 3.95 4.95 5.95 6.95 7.95 8.95 9.95 10.95 11.95 12.95
Frequencies 1 4 5 13 12 19 13 10 6 4

10. (a) What is meant by a frequency distibution? Describe the main steps in
preparation of a frequency distribution table from a raw data.
(b) A frequency distribution has 8 consecutive classes of equal width. The class mark
of the third is 24.5. The upper class limit of the 5th class is 49. The frequencies
of classes from lowest to highest classes are 8, 32, 142, 216, 240, 206, 143, and 13.
Complete the frequency distribution.
11. (a) What is a measure of dispersion? Which among the following is not a
measure of dispersion. Standard deviation, first quartile, range, quartile deviation.
(b) The mean annual salary paid to all employees in a company was $ 15000. The
mean annual salaries paid to male and female employees of the company were $
15600 and $ 12600 respectively. Determine the percentages of males and females
employed by the company.
(c) On a final examination in Biometry, the mean grade of a group of 150 students
was 78 and the standard deviation was 8.0 In Mathematics, however, the mean final
grade of the group was 73 and the standard deviation was 7.6. In which course was
there the greater (i) absolute dispersion? (ii) relative dispersion?
12. (a) Give the essential characteristics of a good average.
(b) For a certain frequency distribution, the mean was 40 and mode 10. Find the
median
13. (a) Using example(s) explain the distinction between:
(i) a parameter and statistic
(ii) a discrete variable and continuous variable
(b) In general statistics plays a very significant role in any scientific research. In
your own words what do you understand by the term statistics in such a research.
14. Present the following data into a frequency distribution with classes as 80 - 89,
90 - 99, 100 - 109, etc.

85 130 135 90 118 92 80 142 97 147


98 94 115 109 138 109 111 117 120 91
124 101 104 97 98 126 94 109 109 94
110 82 96 119 92 98 114 104 149 107
123 102 83 117 98 87 87 87 145 91

15. Determine class boundaries, class limits and class marks for first, and last classes
in respect of the following:
(i) Weights of entering 300 freshmen ranged from 98 to 226 kgs
(ii) The thickness of 460 washers ranged from 0.421 to 0.563 inches.

102
5.2 Exercise 2

1. (a) In general statistics plays a very significant role in any scientific research. In
your own words what do you understand by the term statistics in such a research.
(b) Define the following terms: (i) data (ii) population (iii) sample
(c) Classify the following data as quantitative, qualitative, continuous, discrete:

i. animal colour

ii. number of wrong answers per student in a multiple choice test

iii. tire-miles to first puncture

iv. population of students, when an inquiry about expenditure on stationery is


being made

v. humans, of specified age-group, tribe and sex, with their income being specified

2. (a) Enumerate the different methods of collecting data. Which one is the most
suitable for conducting inquiry regarding perception of people on adherence to stan-
dard fishing practices among young men in Ukerewe District in Tanzania? Explain
its merits and demerits.
(b) Explain the merits and limitations of the observation method in collecting data.
Illustrate your answer with a suitable example.
3. (a) The numbers 3.2, 5.8, 7.9 and 4.5 have frequencies x, (x + 2), (x-3) and (x+6)
respectively. If their arithmetic mean is 4.876. Find the modal number.
(b) What is a measure of location? Which among the following are not measures of
location? Mode, standard deviation, first quartile, range, quartile deviation.
(c) Give the essential characteristics of a good average.
(d) What is a measure of dispersion? Which among the following is not a measure
of dispersion? Standard deviation, first quartile, range, quartile deviation.
(e) Compare as far as possible, the range, mean deviation, and standard deviation
as measure of dispersion.
(f) The mean of five items of an observation is 4 and the variance is 5.2. If three of
the items are 1, 2 and 6. Find the other two.
2. The mean annual salary paid to all employees in a company was Tanzania Shillings
(TZS) 150,000. The mean annual salaries paid to male and female employees of the
company were TZS 156,000 and TZS 126,000 respectively. Determine the percentages
of males and females employed by the company.
(b) For a certain frequency distribution, the mean was 40 and mode 10. Find the
median.
3. (a) An analysis of monthly wages paid to the workers in two firms A and B
belonging to the same factory, gave the following results.

103
Firm A Firm B
Number of employees 986 548
Average wages 52.5 47.5
Variance of Wages 100 121

(i) Which firm, A or B pays out larger amount as monthly wages?


(ii) Which firm, A or B has a more consistent wage distribution? Explain.
(iii) Calculate the combined arithmetic mean and combined standard deviation for
the two firms.

(b) On a final examination in MTH 106: Introductory statistics, the mean grade of
a group of 150 students was 78 and the standard deviation was 8.0. In mathematics,
however, the mean final grade of the group was 73 and the standard deviation was
7.6. In which course was there the greater (i) absolute dispersion? (ii) relative
dispersion?
4. The average marks of 100 students were to be 60. But it was later discovered
that the score of 63, was misread as 36. Find the correct average, corresponding to
the correct score.
5. Find the arithmetic mean for the frequency distribution table of question 1 (b)
in section II.
6. Draw the ogives for the following data and hence find the value of median from
it. Check the value of the median by actual calculation.

Weight (kg) 118-126 127-135 136-144 145-153 154-162 163-171 172-180


Frequency 3 5 9 12 5 4 2

7. Find the value of mode for the following frequency distribution

Class: 0-9 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79
Freq. 328 350 720 664 598 524 378 244

104
8. Find the standard deviation of the following frequency distribution.

Height (in) Class mark (Yi ) Frequency (f i ) fi Yi2 f i Yi


60 - 62 61 5 18605 305
63 - 65 64 18 73728 1152
66 - 68 67 42 188538 2814
69 - 71 70 27 132300 1890
71 - 74 73 8 42632 584

9. The average marks of 100 students were to be 60. But it was later discovered
that the score of 63, was misread as 36. Find the correct average, corresponding to
the correct score.

5.3 Exercise 3

1. The data in the following table represent the size of an organism at equally spaced
times 0 to 8. Use the first 8 observations to estimate the regression equation of size
of the organism on time.

Time 0 1 2 3 4 5 6 7 8
Size 0.75 1.20 1.75 2.50 3.45 4.70 6.20 8.25 11.5

2. (a) Give one advantage of regression in comparison to correlation.


Air with varying concentration of CO2 is passed over wheat leaves at a temperature
of 350 c and the uptake of CO2 by the leaves is measured. Results for 7 leaves at
different concentrations (X) of uptake (Y) are obtained and summarised as follows:

CO2 - conc.[ppm] (X) 75 100 100 120 130 130 160


CO2 -uptake [Cm3 /dm2 /hr] 0.00 0.65 0.50 1.00 0.95 1.30 1.80

(i) Fit the simple linear regression line Y = α + βX + ε


(ii) Predict the value of CO2 uptake for a single leaf at a concentration X=150ppm.
3. (a) What is a scatter diagram? Explain its significance in regression and correla-
tion analysis.
(b) The data below are the average body weight (Y) and food consumption (X) for
ten hens obtained from Ukombozi Farm. (i) Measure and comment on the strength
of the relationship between food consumption and body weight.
(ii) What proportion of the variation in body weight is explained by the differences
in food consumption?
4. A forestry researcher was interested in the association between percentage of
hardwood in the pulp from which paper is produced (X) and the tensile strength
of the paper (Y, in tenths of a pound per square inch). He obtained samples of
pulp from 7 different batches and corresponding samples of the paper produced from
each of these batches. For each sample of pulp, he determined the percentage of
hardwood, and for the corresponding sample of paper, the tensile strength. His data
and some summary statistics are given below:

105
Food cons. kg (X) 87.1 93.1 89.8 91.4 99.5 92.1 95.5 99.3
Body weight, kg-(Y) 4.6 5.1 4.8 4.4 5.9 4.7 5.1 5.2

X 3.50 4.10 4.30 5.20 5.80 6.60 7.90


Y 1.90 1.30 2.00 1.70 1.80 2.10 2.50


n 
n 
n 
n 
n
Xi = 37.40, Xi2 = 214.20, Yi = 13.30, Yi2 = 26.09, Xi Yi = 73.47
i=1 i=1 i=1 i=1 i=1

Perform whatever analysis you think is most appropriate to derive a numerical char-
acterization of the nature of the association between X and Y. You need not construct
confidence intervals, estimate standard errors, or perform any hypothesis tests.

5.3.1 Suggested solution for question 4

The most appropriate analysis to derive a numerical characterization of the nature of


association between X and Y is correlation analysis. Correlation analysis measures
the nature (positive or negative) and strength (magnitude) of association between
the variables under consideration. Therefore, in the present case we estimate the
Pearson correlation coefficient given by:


n
(Xi − X̄)(Yi − Ȳ )
i=1
rXY = 

n 
n
(Xi − X̄)2 × (Yi − Ȳ )2
i=1 i=1

or
 n 

n  
n
n× Xi Yi − X Yi
i=1 i=1 i=1
rXY =
! n
2 " ! n
2 "

 n ×  X2 −  X  
n n
× n× 2
Yi − Y
i
i=1 i=1 i=1 i=1

In calculating the Pearson correlation coefficient, one may use either the first or sec-
ond equation.

Summary statistics are given below:



n 
n 
n 
n 
n
Xi Yi = 73.47, Xi = 37.40, Yi = 13.30, Xi2 = 214.20 Yi2 = 26.09
i=1 i=1 i=1 i=1 i=1

Substituting these summary statistics in the above equation we have:

7 × 73.47 − [(37.40) × (13.30)]


rXY = 
[7 × 214.20 − (37.40)] × [7 × 26.09 − (13.03)]

106
514.29 − 497.42
rXY = 
[1499.4 − 1398.76] × [182.63 − 176.89]

16.87 √ 16.87 16.87


= √100.64×5.74 = = 24.03484138 = 0.701897705
577.6736

Therefore, Pearson correlation coefficient between X and Y (rXY ) is 0.70 (approxi-


mately), which can be interpreted to mean that there is a strong positive associa-
tion between the percentage of hardwood in the pulp from which paper is produced
(X) and the tensile strength of the paper (Y).

To perform the correlation analysis in SPSS


To perform correlation analysis in SPSS follow the path: Analyze→
Correlate→ Bivariate. . . , then move the variables (X and Y in the present case) you
want to correlate to the Variables box. Check the box next to Pearson, then click the
OK button. The resulting output is given below:
Correlations
X Y
Pearson Correlation 1 .702
X Sig. (2-tailed) . .079
N 7 7
Pearson Correlation .702 1
Y Sig. (2-tailed) .079 .
N 7 7

In the table above, 0.702 (approximated to three decimal places) is the Pearson
correlation coefficient, same as the one obtained by hand calculation or using R/S-
plus above.

5.4 Exercise 4

1. A pair of dice is thrown.


(i) Describe the sample space S.
(ii) Find the probability of getting a total of either 5 or 11.
2. A fair coin is tossed four times. Describe the sample space S and define the
probability that at least one (1) head occur.
3. A ball is drawn at random from a box containing 6 red balls, 4 white balls, and
5 blue balls. Determine the probability that the ball drawn is:
(a) red, (b) white, (c) blue, (d) not red, and (e) red or white
4. Three horses A, B and C are in a race, A is twice as likely to win as B and B is
twice as likely to win as C. What is the probability that A and B wins.

107
5. Two good dice are rolled simultaneously. Let A be the event, the sum shown
is “6” and B the event “the two show the same number”. Find (i) P (A/B), (ii)
P (B/A).
6. The probability that a certain girl will go out is 0.60, and the probability that if
she goes out she will spend shs. 30, 0000 is 0.80. What is the probability that she
will go out and spend shs. 30,000?
7. The probabilities that a student will get passing grades in General Chemistry,
in Introductory Statistics or in both are P (PS100)=0.70, P (MTH106)=0.56. Check
whether events A and B are independent
8. The probability that a man will be alive 30 years is 2/5 and the probability that
his wife will be alive in 30 years is 1/2. Find the probability that:
(i) Both will be alive
(ii) Only the man will be alive
(iii) Only the wife will be alive
(iv) Neither will be alive in 30 years
9. Two cards are drawn from a well-shuffled ordinary deck of 52 cards. Find the
probability that there are both aces. If the first card is:(i) replaced (ii) not replaced.

5.5 Exercise 5

1. Ten multiple-choice questions are available. The chance of making a correct


choice is 1/3. What is the probability that: (a) 4 answers will be correct (b) All
answers will be correct.
2. The random variable X follows a Poisson distribution with variance 4. Find
P [X ≤ 3]
3. X ∼ B(n, p). E(X)=2.4 and p=0.4. Find the variance of the distribution
4. The mean number of bacteria per milliliter of a liquid is known to be 4. Assuming
that the number of bacteria follows a Poisson distribution. Find the probability that
in 1 ml of liquid there will be (a) less that 2 bacteria (b) 3 bacteria (c) No bacteria
5. A manufacturer of cotter pins knows that 5% of his products is defective. If
he sells cotter pins in boxes of 100 and guarantees that not more than 10 pins will
be defective, what is the approximate probability that a box will fail to meet the
guaranteed quality?
6. Suppose that the number of telephone calls coming into a telephone exchange
between 10 a.m. and 11 a.m. say X1 is a random variable with Poisson distribution
with parameter 2. Similarly the number of calls arriving between 11 a.m. and
12 noon say X2 has a Poisson distribution with parameter 6. If X1 and X2 are
independent, what is the probability that more than 5 calls will come in between 10
a.m. and 12 noon?
7. Let:

108
T +=the test is positive (indicating that the disease is present)
T −=the test is negative
Z+=the individual has the disease
Z−=the individual does not have the disease
and
P(T+|Z+)=the sensitivity of the test
P(T-|Z+)=the probability of a false negative
P(T-|Z-)=the specificity of the test
P(T+|Z-)=the probability of a false positive
P (Z+) = the prevalence of the disease in the population
(i) Use Bayes’ rule to find the “predictive value” of a positive test P(Z+|T+) for a
test with 98% specificity and 99% sensitivity when the prevalence is 0.5%
(ii) An ELISA test for AIDS has 99.5% specificity and is used on 140 employees of a
medical clinic. If all 140 are free of AIDS, what is the probability that at least one
of the 140 people will nevertheless test positive for the disease?

5.6 Exercise 6

In the following questions, in each case, draw a picture for the probability or normal,
t or χ2 value you are calculating.
1. Let Z denote a standard normal random variable; Z ∼ N (0, 1). Find:
(a) P (Z ≥ 1.44)
(b) P (Z ≥ 0.34)
(c) P (Z ≤ −0.86)
(d) P (Z ≤ 1.22)
(e) P (−2 ≤ Z ≤ 2)
(f) P (−0.08 ≤ Z ≤ 1.87)
(g) P ([Z] ≤ 1.96)
(h) P (|Z| ≥ 1.28)
(i) P (≤ 1.22 ≤ Z ≤ 3.01)
2. Let Z be as in the previous problem. Find z such that:
(a) P (Z ≥ z) = 0.171
(b) P (Z ≤ z) = 0.9913
(c) P (−0.25 ≤ Z ≤ z) = 0.05
(d) P (|Z| ≥ z) = .10

109
(e) P (|Z| ≤ z) = 0.95
3. Suppose that Y is a random variable with a normal distribution with mean
µ = 20 and variance σ 2 = 16, that is, Y ∼ N (20, 16). Find
(a) P (Y > 25)
(b) P (Y < 17)
(c) P (18<Y < 26)
(d) P (22 ≤ Y ≤ 30)
(e) P (Y < 23.5)
4. Suppose that Y ∼ N (10; 9). Find y such that
(a) P (Y ≥ y) = 0.025
(b) P (Y ≤ y) = 0.02
(c) P (Y < y) = 0.93
(d) P (Y > y) = 0.90
5. Suppose that Y ∼ N (40, 64) and that a sample of size n = 25 is obtained. Let
Ȳ = sample mean. Find.
(a) P (Ȳ > 44.3)
 
(b) P Ȳ ≤ 37.5
 
(c) P 42.3 ≤ Ȳ ≤ 44.4
 
(d) P Ȳ ≥ 38.9
(e) P (Ȳ < 41.9)
 
(f) y such that P Ȳ ≤ y = 0.90
6. Let χ2v be a chi-square random variable with v degrees of freedom. In each case,
find x satisfying the statement.
(a) P (χ4 > x) = 0.10
(b) P (χ22 ≤ x) = 0.025
(c) P (χ5 ≥ x) = 0.01
7. With χ2v as in Problem 6, find the following probabilities. Note: the probabilities
may not be given exactly in the Table. In this case, give a range in which the
probability in question must fall. We will see when we study hypothesis testing that
this is often all the information we need.
(a) P (χ18 > 24.18)
(b) P (χ27 ≥ 19.21)
(c) P (χ2 ≥ 12.2)
(d) P (χ14 < 7.11)

8. Let tv be a random variable with a t distribution with v degrees of freedom. In

110
each case, find t satisfying the statement.
(a) P (t23 > t) = 0.025
(b) P (t7 ≤ t) = 0.975
(c) P (|t7 | ≥ t) = 0.10
(d) P (|t16 | < t) = 0.60
9. With tv as in Problem 8, find the following probabilities. As in Problem 7, the
probabilities may not be exactly given in the Table, so you must give a range in
which the probability in question must fall.
(a) P (t12 ) ≥ 2.34)
(b) P (t25 ≥ 1.45)
(c) P (t7 ≤ 1.31)
(d) P (|t19 | > 2.95)

5.7 Exercise 7

1. The following data are the random yields of two varieties of sugarcane obtained
from an experiment conducted at Mtibwa Sugar Company in Turiani-Morogoro.

Variety 1 266 275 304 245 264 270


Variety 2 274 258 231 282 250 290

Assume that sugarcane yields may be thought of as being well represented by a


normal distribution (continuous measurements), N (µ, σ 2 ):
(a) Find point estimates of population means µ1 and µ2
(b) Find point estimates of population variances σ12 and σ22
(c) Find 95% confidence interval for:
(i) The population means µ1 and µ2
(ii) The difference in population means µ1 − µ2
(d) Check if the two varieties of sugarcane are different in terms of average yield.
2. The following data are from Box, Hunter and Hunter (1978, Statistics for Exper-
imenters) and represent measurements of dissolved oxygen concentration (mg/L) in
6 test samples:

2.62 2.65 2.79 2.83 2.91 3.57

Assume that dissolved oxygen concentrations may be thought of as being well repre-
sented by a normal distribution (continuous measurements), N (µ, σ 2 ), obtain a 95%
confidence interval for µ, the population mean dissolved oxygen concentration for
such samples.
3. The following data are from Finney (1978, Statistical Method in Biological Assay,

111
P. 179) and are from an experiment to investigate the influence of different doses of
vitamin A on weight gain over a 3-week period. For 5 rates receiving 2.5 units of
vitamin A, the following weight increases (mg) were observed:

35 49 51 43 27

Assume the normal assumption seems reasonable, that is the population of weight
increases for all possible rates receiving 2.5 units of vitamin A may be approximated
by a N (µ, σ 2 ) probability distribution, obtain a 90% confidence interval for µ
4. The following data concern two types of rations, A and B, being fed to pigs.
An experiment was conducted in which 12 randomly selected pigs were fed ration A
and 12 were fed ration B with the goal of determining whether there is a difference
in the weight gains (lbs) for pigs fed the two different rations.

A 31 34 29 26 32 35 38 34 30 29 32 31
B 26 24 28 29 30 29 32 26 31 29 32 28

Assume the normality assumption is reasonable; find a 95% confidence interval for
the difference in means (µ1 − µ2 )
5. It is thought that the body temperature of intertidal crabs exposed to air is less
than the ambient temperature. Body temperatures were obtained from a random
sample of 8 such crabs exposed to an ambient temperature of 25.4 degrees Celsius.

25.8 24.6 26.1 24.9 25.1 25.3 24.0 24.5

Assume that body temperatures are approximately normally distributed, test the
hypotheses

H0 : µ = 25.4 deg. C vs. H1 : µ < 25.4 deg. C.

6. For the pig data in question 4. The goal was to determine if 2 different rations
fed to pigs result in different weight gains. Again, we are interested in whether the
rations are different. Test:

H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 = 0

7. It is thought that the mean clutch size of ducks raised in captivity is smaller
than that of ducks breeding in the wild. Suppose it is reasonable to assume that
variability in clutch size is different for ducks raised in captivity from that for ducks
breeding in the wild. Assume that clutch size is approximately normally distributed,
so that
Population 1: Wild N1 (µ1 , σ12 )

Population 2: Captive N2 (µ2 , σ22 )


The following data were obtained

112
Captive 10 11 12 11 10 11 11
Wild 9 8 11 12 10 13 11 10 12

Test H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 > 0


8. For the duck data in question 7 above, test whether or not ducks raised in
captivity have different variability in clutch size from ducks bred in the wild. Use
α = 0.05
9. The observed yield (kg/plot) of two species of a certain plant were recorded as
follows:
Specie 1 2.1 2.3 2.4 2.1 2.6 1.9 2.5 1.8
Specie 2 1.7 2.6 1.8 2.0 2.1 2.2 1.6 2.3

(a) Compute sample variances (point estimates) for the two species
(b) Test the hypothesis that the two population variances from which the samples
were derived are equal
(c) Give an expression of the test statistic you would use if you were asked to test
for the difference in population means
(d) Give assumptions, if any, you need to be able to use the test statistic you have
written in part (c) above.
10. Write TRUE for the correct statement and FALSE for the wrong statement.
(a) A critical region means all values constituting a region leads to acceptance of a
the null hypothesis
(b) Type II error is the one committed by maintaining a true null hypothesis when
in fact the alternative one is correct
(c) Confidence limits are two end points within which a sample parameter falls with
specified degrees of freedom
(d) A two sample pooled t-test is applied when the two unknown population variances
are assumed to be different

5.8 Exercise 8

1. The length of the skulls of 10 fossil skeletons of an extinct species of bird has a
mean of 5.68 cm and a standard deviation of 0.29 cm. Assuming that such measure-
ments are normally distributed, find a 95% confidence interval for the mean length
of the skull of this species of bird
2. Twelve randomly selected mature citrus trees of one variety have a mean height of
13.8 feet with a standard deviation of 1.2 feet, and 15 randomly selected mature citrus
trees of another variety have a mean height of 12.9 feet with a standard deviation of
1.5 feet. Assuming that the random samples were selected from normal populations
with equal variances, construct 90% and 95% confidence intervals for the difference
between the true average heights of the two kinds of citrus trees.

113
3. The following data are the heat producing capacities of coal from two mines (in
millions of calories per ton):

Mine A: 8500 8330 8480 7960 8030


Mine B: 7710 7890 7920 8270 7860

Assume that the data constitute two independent random samples from normal
populations with equal variances, construct a 99% C.I. for the difference between
the true average heat producing capacities from the two mines.
4. A paint manufacturer wants to determine the average drying time of a new
interior paint. If for 12 test areas of equal size he obtained a mean drying time of
66.3 minutes and a standard deviation of 8.4 minutes, construct a 95% C.I. for the
true mean.
5. An industrial designer wants to determine the average amount of time it takes an
adult to assemble an “easy to assemble” toy. Use the following data (in minutes),
a random sample, to construct a 95% C.I. for the mean of the population sampled:

17 13 18 19 17 21 29 22 16 28 21 15
26 23 24 20 8 17 17 21 32 18 25 22
16 10 20 22 19 14 30 22 12 24 28 11

6. A study has been made to compare the nicotine contents of two brands of ciga-
rette. Ten cigarettes of Brand A had an average nicotine content of 3.1 milligrams
with a standard deviation of 0.5 milligram, while eight cigarettes of Brand B had
an average nicotine content of 2.7 milligrams with a standard deviation of 0.7 mil-
ligram. Assuming that the two sets of data are independent random samples from
normal populations with equal variances, construct a 95% confidence interval for the
difference between the mean nicotine contents of the two brands of cigarettes.
7. A doctor is asked to give an executive a thorough physical check-up to test the
null hypothesis that he will be able to take on additional responsibilities. Explain
under what conditions the doctor would be committing a type I error and under
what conditions he would be committing a type II error.
8. An educational specialist is considering the use of instructional material on audio
cassettes for a special class of third-grade students with reading disabilities. Students
in this class are given a standardised test in May of the school year, and µ1 is the
average score obtained on these tests after many years of experience. Let µ2 be the
average score for students using the audio cassettes, and assume that high scores are
desirable.

(a) What null hypothesis should the education specialist use?


(b) What alternative hypothesis should be used if the specialist does not want to
adopt the new cassettes unless they improve the standardised test score?
(c) What alternative hypothesis should be used if the specialist wants to adopt the
new cassettes unless they worsen the standardised test score?
9. Suppose we want to test the null hypothesis that an antipollution device for cars
is effective.

114
(a) Explain under what conditions we would commit a type I error and under what
conditions we would commit a type II error
(b) Whether an error is a type I or a type II error depends on how we formulate
the null hypothesis. Rephrase the null hypothesis so that the type I error becomes
a type II error, and vice-versa
10. A biologist wants to test a null hypothesis that the mean wingspan of a certain
kind of insect is 12.3 mm against the alternative that it is not 12.3 mm. If she take
a random sample and decides to accept the null hypothesis if and only if the mean
of the sample falls between 12.0 mm and 12.6 mm, what decision will she make if
she gets x̄=12.9 mm and will it in error if:
(a) µ = 12.5 mm; (b) µ = 12.3 mm?

5.9 Exercise 9

For each of the following forty (40) statements, write TRUE for a correct state-
ment and FALSE for a wrong statement on the space provided at the end of each
statement.

i. Research is a systematic search for pertinent information on a specific topic or


subject

ii. Applied or action research is mainly for uncovering new knowledge and theories
that will build upon existing knowledge or chart out new directions through
discovery.

iii. Data are observations of variables.

iv. In qualitative variables, numerical measurements on the phenomenon of inter-


est are not possible.

v. Descriptive statistics is concerned with the development and application of the-


ory and methods to the collection (design), analysis, and interpretation of ob-
served information from planned (or unplanned) experiments.

vi. Statistical inference is an estimate, prediction, or some other generalization


about a population based on information contained in a population.

vii. Inferential statistics utilizes sample data to make estimates, decisions, predic-
tions, or other generalizations about a population.

viii. The premise of statistical inference is that we attempt to control and assess
the uncertainty of inferences we make on the population of interest based on
observation of samples.

ix. When an instrument or research study measures what it claims to measure it


is said to have “reliability”.

x. Validity is the quality of consistency or replicability of a study or instru-


ment.

115
xi. Longitudinal studies are studies conducted in a single phase mode in order to
provide a snap shot picture of the problem under investigation.

xii. Cross-sectional studies involve repeated measurements at periodic intervals on


the same subject in order to track changes related to particular variables of
interest in the study.

xiii. Nominal scales capture identity only.

xiv. Unstructured questionnaires imply the questions to be asked and responses


permitted are pre-determined.

xv. Sampling means drawing only a part of a population and studying it then
making inferences about the sample.

xvi. Probability sampling is where the probability of inclusion of a sample element


is known.

xvii. Non-probability samples are obtained by methods that are more objective than
subjective, which are based on the researcher’s judgment. These may be con-
venience, judgment, or quota samples.

xviii. Simple random samples are where each element of the sample has an equal
chance of appearing in the population.

xix. In cluster sampling, the population is observed to be heterogeneous and is


therefore divided into homogeneous segments or clusters.

xx. Stratified sampling is where a geographic territory is sub-divided into regions


and then random sampling is performed to select a few regions which are then
studied.

xxi. Disproportionate stratified sampling is often employed when one stratum (or a
few) is underrepresented but is of importance to the researcher.

xxii. One important advantage of area sampling is that it can be carried out even
in the absence of a sampling frame.

xxiii. Systematic sampling is when from a sampling frame, a systematic interval is


taken in selecting sample elements depending on the population size and sample
size desired.

xxiv. A random sample is a sample that exhibits characteristics possessed by the


target population.

xxv. A representative sample is a sample that is selected from the target population
through the use of probability sampling schemes.

xxvi. In a statistical test of hypothesis, the research hypothesis is the hypothesis to


be tested.

xxvii. In a statistical test of hypothesis, the null hypothesis is the hypothesis we wish
to verify.

116
xxviii. In a statistical test of hypothesis, typically one tests the alternative hypothesis
against the null hypothesis and one of them is rejected.

xxix. Confidence limits are two end points within which a sample parameter falls
with specified degrees of freedom.

xxx. A critical region means all values constituting a region leads to acceptance of
the null hypothesis.

xxxi. Type II error is the one committed by maintaining a true null hypothesis when
in fact the alternative one is correct.

xxxii. Type I error is the one committed by maintaining a true null hypothesis when
in fact the alternative one is correct.

xxxiii. An estimator is quantity describing the population that is used as a guess for
the value of a corresponding population parameter.

xxxiv. A statistic is a quantity that is derived from the sample observations.

xxxv. The standard error (of the mean) is an estimate of the standard deviation of
all possible mean values from samples of size n.

xxxvi. The scope of inference of an experimental design is limited to the population


from which the sample is drawn.

xxxvii. Correlation analysis measures the degree or strength and nature of association
between numerical variables say X and Y.

xxxviii. In regression analysis it is always possible to predict unknown value(s) of one


variable (the outcome) in terms of the known value(s) of another variable (the
dependent).

xxxix. A one-sample t-test is performed when you want to determine if the mean value
of a target variable is different from a hypothesized value.

xl. An independent-samples t-test is performed when you want to determine if the


mean value on a given target variable for one group differs from the mean value
on the target variable for a different group.

5.10 Exercise 10

Data on body weights (kgs) and left ventricular ejection fractions (LVEF) for a
group of 28 male patients with acute dilated cardiomyopathy was collected at a
large hospital in the country. The analyst of the hospital generated some descriptive
statistics (mean, standard deviation-Std Dev, standard error-Std Error, variance,
coefficient of variation, and range). The analysis was done in SPSS and the results
are given below:

117
Some descriptive statistics for acute dilated data

Variable N Mean Std Dev Std. Error Variance

WEIGHT 28 -------- -------- -------- 269.8214286


LVEF 28 -------- -------- -------- 0.000674471

Coeff of
Variable Variation Range

WEIGHT 10.9066820 63.1000000


LVEF 13.1973897 0.1000000

Complete the table in whatever way you feel is appropriate given the summary
statistics above.

5.11 Exercise 11

Write the letter of the best statement in II against the item in I in the scpace
provided.
I
S/N Item Letter
1 Variable
2 Population
3 Qualitative variable
4 Variates
5 Sampling unit
6 Hypothesis testing
7 A parameter
8 Attributes
9 A statistic
10 The number of defective items produced
during a day’s production
11 Randomness
12 Statistical inference
13 An experiment
14 Continuous data
15 Variance
16 Cluster sampling
17 Stratified sampling
18 Eye colour of a group of individuals
19 The arithmetic mean
20 Infant mortality rate

118
II
Statement
A Is a procedure for reaching a probabilistic conclusive decision about a
claimed value for a population’s parameter based on a sample
B It is the entire group of interest, which we wish to describe or about
which we wish to draw conclusions
C A characteristic or phenomenon, which may take different
values (e.g., weight, gender)
D Does not vary in magnitude in successive observations
E The values of quantitative variables
F The values of qualitative variables
G Means unpredictability
H Is a process whose outcome is not known in advance with certainty
J Is a quantity that is calculated from a sample of data
K Is a person, animal, plant or thing which is actually studied by a researcher
L Refers to extending your knowledge obtained from a random sample
from the entire population to the whole population
M Is an example of discrete data
N Is an unknown value, and therefore it has to be estimated
P Can be used whenever the population can be partitioned into smaller
sub-populations, each of which is homogeneous according to the particular
characteristic of interest
Q Is an example of qualitative data
R Are collected by measuring and are expressed on a continuous scale
S Is the average of the squared deviations of each observation in the set from
the arithmetic mean of all of the observations
T Can be used whenever the population is homogeneous but can be partitioned
U Is not a better representative of the data if some values are very large in
magnitude and others are small
V Is an example of continuous variables
W Is the middle value in an ordered array of observations
X Is the absolute value of the difference between the largest and the smallest
values in the data set
Y Cannot be computed when there are negative values in a set of observations
Z Is the most frequently occurring value in a set of observations

5.12 Exercise 12

1. Choose the most correct answer and circle the letter of the best answer.

i. One of the following statements is not true

(a) The research process is cyclical


(b) The term “research” can be used only in technical sense
(c) The research process starts only with an existing practical problem

119
(d) Research is based on observable experience or empirical evidence

ii. Which of the following is the most encompassing definition of research?

(a) A search for objective knowledge and data


(b) A scientific and systematic search for information on specific issues
(c) A careful investigation or inquiry especially directed at the search for new
facts in any branch of knowledge
(d) A search for knowledge through objective and systematic methods of find-
ing solution to problems.

iii. Which of the following is an operational definition of obesity?

(a) A condition characterized by excessive body fat


(b) A condition that is a high priority topic for nurses researchers
(c) A condition associated with heightened risk of health problem
(d) A score greater than 30 on the Body Mass Index (BMI)

iv. What is a random sampling?

(a) Assignment of a group at random


(b) A method of determining eligibility of a study
(c) A form of non-probability sampling
(d) A form of probability

2. Suppose the following information is obtained from Mr XYZ on his application


for a home mortgage loan at the National Housing Corporation:

i. Place of Residence: Dar es Salaam, Arusha

ii. Type of Residence: Single-family home

iii. Date of Birth: April 4, 1972

iv. Monthly Payments: Tsh. 1,427,000

v. Occupation: Assistant Lecturer/researcher

vi. Employer: Public University

vii. Number of Years at Job: 4

viii. Number of Jobs in Past Ten Years: 1

ix. Annual Family Salary Income: Tsh. 10,000,000

x. Other Income: Tsh. 2,000,000

xi. Marital Status: Married

xii. Number of Children: 2

120
xiii. Mortgage Requested: Tsh. 120,000,000

xiv. Term of Mortgage: 15 years

xv. Other Loans: Car

xvi. Amount of Other Loans: Tsh. 16,000,000

Classify each of the responses by type of data (continuous numerical, discrete numer-
ical or categorical) and level of measurement (interval, nominal, ratio or ordinal).

Item No. Type of Data: Level of measurement:


(continuous numerical, ((interval, nominal,
(discrete numerical or categorical) (ratio or ordinal)
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
xv
xvi

5.12.1 Suggested solutions

1. Most correct answer.

i. (c) The research process starts only with an existing practical problem

ii. (c) A careful investigation or inquiry especially directed at the search for new
facts in any branch of knowledge

iii. (d) A score greater than 30 on the Body Mass Index (BMI)

iv. (d) A form of probability

121
2.
Item No. Type of Data: Level of measurement:
(continuous numerical, ((interval, nominal,
(discrete numerical or categorical) (ratio or ordinal)
i categorical nominal
ii categorical nominal
iii continuous numerical ratio
iv continuous numerical ratio
v categorical ordinal
vi categorical nominal
vii discrete numerical ratio
viii discrete numerical ratio
ix continuous numerical ratio
x continuous numerical ratio
xi categorical nominal
xii discrete numerical ratio
xiii continuous numerical ratio
xiv continuous numerical ratio
xv categorical nominal
xvi continuous numerical ratio

5.13 Exercise 13

1. A crop scientist was interested in Y = average leaf weights per plat (grams) after
75 days of plots planted with a particular soybean variety. The following data are
values of Y measured on 9 randomly chosen such plots.

17.1 16.5 21.8 19.9 18.4 14.3 22.3 19.5 20.0


n 
n
Yi = 169.8, Yi2 = 3256.5
i=1 i=1

(a) Calculate a quantity such that observations in this sample are equally likely to
have been observed above or below this value.
(b) Calculate the best value you can that quantifies the “spread” of the observations
in this sample and that has units of grams.

122
2. An experiment was conducted to determine the extent to which the growth
of a certain fungus could be affected by filling tubes containing the same medium
at the same temperature with inert gases. The data below are the result of one
such experiment, where X=molecular weight of gas, Y =growth measurement in
millimeters.
X 4.0 20.2 28.2 39.9 83.8 131.3
Y 3.85 3.48 3.27 3.08 2.56 2.21


n 
n 
n 
n
Xi = 307.4, Xi2 = 27073.42, Yi = 18.45, Yi2 = 58.5499,
i=1 i=1 i=1 i=1


n
Xi Yi = 805.503, n = 6.
i=1

(a) Assume that there are theoretical reasons to expect the relationship between X
and Y to follow a straight line. Fit the simple linear regression line Yi = β0 +β1 Xi +εi
to these data.
(b) Provide an interpretation for the regression parameters in the model in (a) in
terms of the situation at hand.
(c) Compute the coefficient of determination R2 . Based on this value of R2 , comment
on the usefulness of the regression line for explaining the relationship between the
response and the independent variable.
3. The mean and variance of a set of five observations are respectively 4 and 5.2. If
three of the observations are 1, 2 and 6. Find the other two.

5.13.1 Suggested solutions

1. (a) The median. The ordered data are

14.3 16.5 17.1 18.4 19.5 19.9 20.0 21.8 22.3

n = 9 is odd, so the median is defined as the middle value, or 19.5


(b) Calculate the best value you can that quantifies the “spread” of the observations
in this sample and that has units of grams.
The sample standard deviation, s. (Could also calculate the range, but this is not
as reliable a measure). We have


n 
n
Yi = 169.8, Yi2 = 3256.5
i=1 i=1


 ⎛ ⎞2
 n 2 n
 Yi Yi
 i=1 ⎜ i=1 ⎟
s= ⎜ ⎟
 n −⎝ n ⎠

123
  2
3256.5 169.5
Thus, s = 9 − 9 = 2.425 (approx.)
2.
X 4.00 20.20 28.20 39.90 83.80 131.30
Y 3.85 3.48 3.27 3.08 2.56 2.21


n 
n 
n 
n
Xi = 307.4, Xi2 = 27073.42, Yi = 18.45, Yi2 = 58.5499,
i=1 i=1 i=1 i=1


n
Xi Yi = 805.503, n = 6.
i=1

(a)

n 
n 
n
n Xi Yi − Xi Yi
i=1 i=1 i=1 6 × 805.503 − (307.4)(18.45)
β̂1 = n
= = −0.012341

n  6 × 27073.42 − (307.4)2
n Xi2 − Xi
i=1 i=1

18.45 307.4
β̂0 = Ȳ − β̂1 X̄ = − (−0.01241) × = 3.7073
6 6

Thus, the fitted regression line is Ŷ = 3.7073 − 0.012341X


(b) Interpretation for the regression parameters in the model in (a) in terms of the
situation at hand.
Interpretation: β̂0 = 3.7073 represents the value of mean fungus growth rate that
is observed with a gas with molecular weight 0. β̂1 =-0.012341 represents the change
in fungus growth that is observed for every unit increase in molecular weight. Since
this value is negative, thus, for every unit increase in molecular weight there is a
decrease of 0.012341 in fungus growth rate.
(b) Coefficient of determination R2 .


2 ⎤

n
Xi
⎢
n ⎥
β̂12 ⎢
⎣ Xi2 − i=1
n


i=1
β̂ 2 SXX
R 2
= 1 =
2 = 0.9496 = 94.96%
SY Y 
n
Yi

n
Yi2 − i=1
n
i=1

Interpretation: The interpretation of R2 is made on the assumption that a straight


line is appropriate. Thus, as this value is a “high” value of R2 so that given that a
straight line relationship is appropriate, the fitted line does a good job in explaining
the variation in the response values.

124

n
Xi
i=1 X1 +X2 +...+Xn
3. By definition, mean or X̄ = n = n

4= 1+2+6+X
5
4 +X5
or
9 + X4 + X5 = 20

X4 + X5 = 11or X4 =11-X5 (1)


n ⎛
n ⎞2
Xi2 Xi
⎜ ⎟
We also know that, variance or s2 = i=1
n − ⎝ i=1n ⎠

12 +22 +62 +X42 +X52


Thus, 5 − 42 = 5.2

41 + X42 + X52
= 5.2 + 42
5

41 + X42 + X52
= 21.2
5

41 + X42 + X52 = 21.2 × 5

X42 + X52 = 21.2 × 5 = 106 − 41 = 65

X42 + X52 = 65 (2)

Substituting equation (1) in (2) we have,

(11 − X52 )2 + X52 = 65

Expanding (11 − X52 )2 and simplifying we have,

2X52 − 22X5 + 56 = 0

Dividing by 2 throughout we have,

X52 − 11X5 + 28 = 0

11± (−11)2 −4×1×(28)
Thus, X5 = 2×1

Simplifying we have,

X5 = (4, 7)

From equation (1), when X5 = 4, X4 = 7 and when X5 = 7, X4 = 4


Therefore, (X4 ,X5 )=(4,7) or (7,4)

125
5.14 Exercise 14

1. Red blood cell deficiency may be determined by examining a specimen of the blood
under a microscope. Suppose a certain small fixed volume contains on the average
20 red cells for normal persons. Using Poisson distribution, obtain the probability
that a specimen from a normal person will contain less than 15 red cells.
2. Suppose that weather records show that on the average 5 out of 31 days in
October are rainy days. Assuming a binomial distribution with each day of October
as an independent trial, find the probability that the next October will have at most
three rainy days.
3. For married couples living in a certain suburb, the probability that the husband
will vote in a school board election is 0.21, the probability that the wife will vote in
the election is 0.28, and the probability that they will both vote is 0.15. What is the
probability that at least one of them will vote?
4. Let:
T +=the test is positive (indicating that the disease is present)
T −=the test is negative
Z+=the individual has the disease
Z−=the individual does not have the disease
and

P(T+|Z+)=the sensitivity of the test


P(T-|Z+)=the probability of a false negative
P(T-|Z-)=the specificity of the test
P(T+|Z-)=the probability of a false positive
P (Z+) = the prevalence of the disease in the population
Use Bayes’ rule to find the “predictive value” of a positive test P(Z+|T+) for a test
with 98% specificity and 99% sensitivity when the prevalence is 0.5%
An ELISA test for AIDS has 99.5% specificity and is used on 140 employees of a
medical clinic. If all 140 are free of AIDS, what is the probability that at least one
of the 140 people will nevertheless test positive for the disease?

5.14.1 Suggested solutions

1. Let X represent the number of red blood cells a normal person has.
Thus, X ∼ P (λ)

−λ λx
P (X=x)= e x! , x = 0, 1, ...

Here λ=20.

126
14

We want P (X<15)= P (X = x) = P (X=0) + P (X=1) + . . . + P (X=14)
x=0
e−20 200 e−20 201 e−20 2014
= 0! + 1! + ... + 14!

202 2014
=e−20 1 + 20 + 2! + ... + 14!

2. Let X be a random variable representing the number of rainy days in the month
of October.
Thus, X ∼ B(n, p)
 
n
P (X=x)= px (1 = p)n−x , x = 0, 1, ..., 31
x
5 26
Here n=31, p = 31 , q= 31 . We want P (X ≤ 3)
=P (X=0) + P (X=1) + P (X=2) + P (X=3)
 
0
31  
3
28
31 5 26 31 5 26
+ ... +
0 31 31 3 31 31

=0.2403
3. Let A and B be the events that the husband will vote in a school election and the
wife will vote in the school election respectively. And AnB be the event that both
the husband and the wife will vote in the school election.
Thus, P (A)=0.21, P (B)=0.28 and P (AnB)=0.15. Required to find the probability
that at least one of them will vote. That is, P (AuB)
Using the addition law for not mutually exclusive events, we have
P (AuB) = P (A) + P (B) − P (AnB)=0.21 + 0.28 - 0.15=0.34

Qn 4. (i) Given: P(T+|Z+)=99%=0.99


P(T-|Z-)=98%=0.98, P (Z+) =0.5%=0.005
Required to find P(Z+|T+)=predictive value of a positive test.
P (T +/Z+)×P (Z+)
By Bayes’ rule: P (Z + /T +) = P (T +/Z+)×P (Z+)+P (T +/Z−)×P (Z−)
0.99×0.005
= 0.99×0.005+0.02×0.995 = 0.1992 or 19.92%

(ii) An ELISA test for AIDS has 99.5% specificity and is used on 140 employees of a
medical clinic. If all 140 are free of AIDS, what is the probability that at least one
of the 140 people will nevertheless test positive for the disease?
Given P(T-|Z-)=99.5%=0.995. Required to find: P (T + /Z−)
We know that if E is the event representing success and Ē is the event representing
failure, then P (E) + P (Ē)=1
P(T+/Z-) + P (T-/Z-)=1 . Thus, P(T+/Z-) = 1-[P (T − /Z−)]140 or 1-P (all people
test negative) =1- (0.995)140 ∼
=50%

127

Potrebbero piacerti anche