Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introductory Statistics
Lecture Notes
T. Kassile
Department of Biometry and Mathematics
Faculty of Science
Sokoine University of Agriculture
Room 17, Administration Block, SMC, Tel: 0232604420 Ext. 2108
Draft
March 2013
Contents
i
1.2.4 Basic survey designs . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.5 Sample size determination . . . . . . . . . . . . . . . . . . . . 18
1.2.6 Questionnaire design . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Data analysis/presentation . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.2 Measures of central tendency (averages) . . . . . . . . . . . . 34
1.3.3 Measures of spread (dispersion) . . . . . . . . . . . . . . . . . 49
1.3.4 Simple Linear Regression analysis . . . . . . . . . . . . . . . 57
1.3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 57
1.3.4.2 Simple linear regression model . . . . . . . . . . . . 58
1.3.4.3 Fitting a simple linear regression model: the method
of least squares . . . . . . . . . . . . . . . . . . . . . 59
1.3.5 Correlation analysis . . . . . . . . . . . . . . . . . . . . . . . 65
ii
2.2.2.8 Conditional probability . . . . . . . . . . . . . . . . 79
2.2.2.9 Independent events . . . . . . . . . . . . . . . . . . 79
2.2.2.10 Multiplication law for not independent events . . . . 79
2.2.2.11 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . 79
2.2.3 Probability density function (discrete r.v) . . . . . . . . . . . 80
2.2.4 Probability density function (continuous r.v) . . . . . . . . . 81
2.2.5 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . 82
2.2.5.1 The Binomial distribution . . . . . . . . . . . . . . . 82
2.2.5.2 The Poisson distribution . . . . . . . . . . . . . . . 83
2.2.6 Continuous probability distribution . . . . . . . . . . . . . . . 85
2.2.6.1 The normal distribution . . . . . . . . . . . . . . . . 85
5 Appendices 100
5.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
iii
5.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.1 Suggested solution for question 4 . . . . . . . . . . . . . . . . 106
5.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.11 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.12 Exercise 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.12.1 Suggested solutions . . . . . . . . . . . . . . . . . . . . . . . . 121
5.13 Exercise 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.13.1 Suggested solutions . . . . . . . . . . . . . . . . . . . . . . . . 123
5.14 Exercise 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.14.1 Suggested solutions . . . . . . . . . . . . . . . . . . . . . . . . 126
iv
Preamble
To introduce the students to some basic concepts in statistics (theory and practice)
which are necessary for handling numerical observations.
Descriptive Statistics
Definitions of relevant statistical terminologies; introduction to elementary statistics:
data collection, organization and presentation: frequency distribution, statistical
measures of central tendency and dispersion, measures of symmetry and skewness,
simple linear regression and correlation analysis.
Statistical Inference
Elementary probability theory; introduction to probability distributions: discrete
distributions, e.g., poisson, binomial; continuous probability distribution, e.g., nor-
mal.
Sampling distributions
Sampling distributions, e.g., student’s t distribution, Chi-square distribution, F-
distribution.
Estimation theory
Point and interval estimation.
Hypothesis testing or test of significance
Null and alternative hypotheses, level of significance, Type I and Type II errors, one
tail and two tail tests.
0.3 Requirements
0.3.1 Readings
You are required to do the readings1 before we discuss them in class. You will be
informed at the end of each lecture, which aspects are to be read for the next lecture.
0.3.2 Exercises
There will be a number of exercises throughout the semester. You should use these
exercises to assess yourself whether acceptable progress is made. You are strongly
urged to complete the exercises. In addition, you are encouraged to work together in
teams of 2-4 students to help each other in understanding the course material and
completing the exercise problems. However, if you find are having trouble working
1
Handouts will be provided. However, to complete your understanding in each of the different
aspects that will be discussed in this course, you are advised to consult any of the reference books
listed in Section 1.1.
i
through the exercises or understanding the material covered in class, you should see
the course instructor as soon as possible. “The earlier the better ”. NO credit(s)
would be given to the exercises. Partial solutions to some of the problems in the
exercises may be provided if necessary.
Assignments and tests will contribute 40% of the total credits allotted to this course
(coursework) and the final written university exam (UE) will contribute 60%. All
tests will be closed lecture notes, books, etc. You are expected to complete the
coursework assessment (tests) during the course of the semester as indicated above,
NO exceptions.
0.4 References
Where necessary, SPSS will be used to illustrate how to generate results or carry out
data analysis in a software package.
ii
1 Chapter One: Descriptive statistics
This deals with presenting the data we have. Presentation of data can be: (i) vi-
sually (through graphs, e.g., line graphs to display trend over time such as maize
production in Tanzania for 20 years; charts such as pie charts to display for example,
people’s opinions about the effects of climate change on food production and liveli-
hood in Tanzania, etc.), (ii) numerically (through averages such as mean, median,
mode, etc,). The fundamental objective of descriptive statistics is to present the
data in an clear/logical or meaningful way.
Illustration
• Suppose that you have data on total family income of each applicant (20, 000
in total) seeking sponsorship from the Higher Education Students’ Loan Board
(HESLB) for the academic year 2012/2013.
• Data were collected from 12, 000 babies born between 2000 and 2010 at a
certain public hospital in the country. The aim of the study was to understand
whether mother’s smoking status during pregnancy, number of physician visits
during the first trimester, history of hypertension, age of mother at birth, etc.
are risk factors for low birth weight, defined birth weight less than 2500 grams.
• Suppose that you have data on GPA of 15, 000 first year students enrolled in
4-year degree programmes at Sokoine University of Agriculture (SUA). The
aim of the study is to understand whether first year GPA predicts final year
GPA.
Goal: Describe the HESLB, the hospital, and the university in terms of the total
family income of each student, birth weight of each baby, and GPA score of each
student respectively.
Problem: If we wish to describe HESLB, the hospital, and the university in terms
of total family income, birth weight, and GPA score, respectively, the listing of 20,000
family incomes, 12,000 birth weights, and 15,000 GPA scores would be unwieldy.
Solution
Use descriptive statistics. As described above, descriptive statistics provides us
with graphical and numerical techniques for describing the HESLB, the hospital,
and the university concisely in terms of the total family income of the applicants,
birth weight of the babies, and GPA score of its students.
1
1.1.2 Inferential statistics
Is concerned with data analysis for decision-making. That is, employs sample data
to make estimates, decisions, predictions, or other generalizations about a popula-
tion. More later when we discuss hypothesis testing.
Illustration
Suppose we wish to estimate the proportion of all married women in Morogoro region
who have completed at least A-level secondary school education between 2000 and
2010. Suppose, further, that a reasonably complete list of all married women in
Morogoro is available at the National Bureau of Statistics (NBS). Can we locate and
interview each of the women in the list? This will be costly and time-consuming.
An easier and more efficient approach would be to randomly sample say 800
women from the list of all married women and contact each of the selected woman
individually.
• Use proportion of married women in the sample who have at least A-level
secondary school education to estimate the proportion of married women with
the same attribute or characteristic.
• The sample proportion is expected to be close to the proportion of all married
women in Morogoro with at least A-level secondary education.
• It is possible to tell by how much the sample estimated is expected to differ
from the proportion of all married women in Morogoro with at least A-level
secondary education.
More examples
2
1.1.3 Population
Totality of all actual or conceived objects of a certain class where data are collected
or is an entire group of objects about which information is gathered.
1.1.4 Sample
That part of population by means of which one seeks to represent the whole popu-
lation (in some situations, a sample may include the whole of the population). In
practice, the intention is to use sample information to make an inference about a
population. For this reason, it is particularly important to define the population
under discussion and to obtain a representative sample from the defined population.
NOTE: To avoid making erroneous conclusions, a sample must be representative of
the population. To obtain a representative sample, we employ the rules for drawing
the sample items- the principle of randomness. To be discussed later in the course.
Question
A statistical population is composed of:
(a) Persons or things
(b) Data
(c) Characteristics of persons or things
(d) Measurements
1.1.5 Statistics
Development and application of theories and methods to handle the collection, analy-
sis, and interpretation of data for drawing useful conclusions.
1.1.6 Biometry
Is a branch of statistics in which statistical techniques are used for biological investi-
gations or it is the use of statistical techniques to arrive at a decision about a certain
biological problem.
1.1.7 Data
Data (plural and datum singular) is a collection of facts, such as values or mea-
surements. It can be numbers, words, measurements, observations or even just
descriptions of things.
Examples
ii. Marks in MTH 106 tests and assignments: 20, 32, 10, 50, 38,48;
3
iii. Marrital status of a sample of 35-year men and women in Morogoro region:
married, divorced, widowed, widower, separated, living as married,never mar-
ried;
iv. Times to abatement of symptoms from 4 samples of patients treated with 2
different drugs;
v. The classification of each in a group of 80 patients as having “high”; “average”,
or “low” systolic blood pressure.
Types of Data
Two types: Discrete and continuous data. Discrete data can take distinctive val-
ues, which can be clearly identified and separated. For example, number of students
texting messages during MTH 106 lecture in MLT 8 can only take values of 0, 1, 2,
3, and so on, with nothing in between. Continuous data can take any value. For
example, when you measure leves of selected heavy metals in a sample of water,
soil or fish, it could take any value, depending on the instrument of measurement
and how accurately you do the measurement. This can take on values such as 2.50,
0.05, 0.55, etc.; average amount of electricity and water consumed per household in
Morogoro per month, etc.
1.1.8 Variables
A variable is a characteristic that changes (i.e., shows variability) from unit to unit
or one individual to another individual (e.g., heights, weights, plots, etc). Variables
are often denoted by upper case letters,e.g., X, Y, H and so on. If a variable can
assume only one variable is called a constant. Technically, Data are observations
of variables.
Types of variables
Variables may be either quantitative or qualitative. A quantitative variable is
one for which the resulting observations can be measured, or the observations are
in the form of numerical values. For example, heights, weights, etc. Observations
on quantitative variables may be further classified as continuous or discrete. A
continuous variable is one for which all values in some range are possible. In con-
tinuous variables we are limited in recording the exact values by the precision/and
or accuracy of the measuring device. Examples include height, weight, etc. By con-
trast, a discrete or discontinuous variable is one for which the possible values are
not observed on a continuous scale because of the existence of gaps between possi-
ble values. Often discrete observations are integers because they arise from counting.
4
separated, never married), political affiliation (e.g., “CCM”, “CUF”, “CHADEMA”,
“UDP”, etc.).
Illustration
Classify each of the following variables measured as quantitative or qualitative, and
continuous or discrete.
ii. Time: quantitative, continuous. (limited only by the precision of our time-
recording method);
Exercise
1. Chemical and manufacturing plants often discharge toxic-waste materials such as
DDT into nearby rivers and streams. These toxins can adversely affect the plants
and animals inhibiting the river and the river bank. The National Environment
Management Council (NEMC) conducted a study of fish in river Ngerengere in Mo-
rogoro region and of its four tributary creeks: C1, C2, C3, and C4. A total of 200
fish were captured, and the following variables were measured for each:
5
(c) Cholesterol level in blood
(d) Number of blood cells/ml. of blood
ii. Are you the person who typically does the food shopping for your household?
iv. How would you rate the taste of the snack food on a scale of 1 to 10, where 1
is least tasty?
v. Would you purchase this snack food if it were available on the market?
vi. If you answered yes to part (v), how often would you purchase the product?
Classify the data generated for each question as quantitative or qualitative. Justify
your classifications.
6
observed data are classified into distinct categories in which ordering is implied, an
ordinal level of measurement is attained. Therefore, an ordinal scale incorporates
the feature of a nominal scale and an additional feature that observations can be
ordered or ranked from low to high.
An undesirable feature of the interval scale is that the origin on the scale is unde-
termined, that is, we do not know where 0 is located. For example, for the IQ test
score, a zero (0) IQ score does not mean zero intelligence.
On the other hand, if a meaningful zero point can be defined for an interval scale,
the scale becomes a ratio scale. That is, a ratio scale incorporates all the features
of interval (and hence nominal, and ordinal) scales and the additional feature that
the ratios can be formed with levels of the scale.
7
1.1.10 Operational definition
Example
Which of the following is an operational definition of obesity?
Exercise
Provide an operational definition for each of the following:
(a) An outstanding student
(b) A hard worker
(c) A nice day
(d) Fast service
(e) Study time
(f) A manager
(g) A boring class
(h) Commuting time to school or work
(i) An interesting book
(j) A leader
When deciding on the variable(s) of interest for a study, one need to consider the
validity-do the variables measure what they are intended to measure-and reliabil-
ity-are the measurements obtained from the variables of interest stable?
Exercise
1 Explain the difference between a categorical and a numerical random variable and
give an example of each.
8
i. Number of cellular phones per household
iii. Length (in minutes) of longest international call made per month
vii. Gender
What are the steps required for data collection? Data collection procedure can be
divided into three major stages namely:
ii. designing the instruments (e.g., questionnaire) of data collection. This entails
formulation of relevant questions and corresponding responses for close-ended
questions (to be discussed shortly).
iii. sampling and field work or execution of the study. Sampling involves selection
(random or non-random depending on sampling frame available, degree of
representation desired and whether inference is required), determination of
sample size (size depends on availability of resources, level of precision/margin
of error required).Field work involves administration of the designed data col-
lection instrument (e.g., questionnaire) through face-to-face interviews, mail,
telephone, web, etc. as discussed below.
Data are classified according to source as primary data or secondary data. Pri-
mary data is a set of information that is collected for the first time, and thus
happen to be original in character. In contrast, secondary data is a set of infor-
mation that has already been collected for you by someone else or institution such
as the National Bureau of Statistics (NBS). It is a set of information that has been
summarized in some form and available in published sources such as a book, a journal
article, conference proceedings, etc.
9
1.2.1.1 Collection of primary data
Primary data can be collected through either a census survey or sample survey.
As defined before, the former (census survey) involves the collection of data from
the whole population (complete enumeration) whereas the latter (sample survey)
involves the collection of data from part of the population.
Whether a sample survey or census surveys we can obtain primary data through
methods such as:
Disadvantages
ii. useful and practical when the sample sizes or populations are relatively small.
iv. Sometimes unforeseen factors may interfere with the observation exercise. At
times, the fact that some people are rarely accessible to direct observation
creates obstacle for this method to collect data effectively.
This method is particularly suitable in studies, which deal with subjects (i.e. respon-
dents) who are not capable of giving verbal reports of their feelings for one reason
or the other. Examples of areas where direct observation has been used are:
ii. Price collection exercises, where enumerators can purchase the produce and
record prices.
10
II: Personal interviews. Under this method information is collected through face-
to-face. That is, interviewer asks questions and respondent gives responses then the
interviewer records the responses in a data collection tool or questionnaire.
Advantages
iii. greater potential for collecting information on difficult items which are likely
to yield ambiguous answers in other methods such as mailed questionnaire.
Disadvantages
ii. In the process of probing, some interviewers may suggest answers to respon-
dents.
iii. Interviewers may read questions wrongly because of the divided attention of
interviewing and recording.
i. It is cheaper.
iv. It is quick.
Disadvantages
ii. The answers to the questions are taken at their face value as there is no op-
portunity to probe.
iv. The method is useful only when the questionnaires are fairly simple, and,
therefore, it is not a suitable method for complex surveys.
11
1.2.1.2 Collection of secondary data
This can be collected from the following sources:
ii. From Banks, e.g., Bank of Tanzania, District Councils, Municipalities, City
Councils, etc.;
Question
How do you decide which mode of data collection to employ? Choice may be influ-
enced by things like:
• population of interest
• types of questions
• question topic
In general, when you have a problem that requires data you can:
• Design an experiment
12
1.2.3 The sample survey
Remark: in practice, because of limited resources (time and money), people often
opt for a selection of respondents, i.e. selection of a small proportion of the total
population of interest. If we wish to make inference about the entire population
from which the sample is drawn we must obtain a representative sample. A
representative sample exhibits characteristics typical of those possessed by the target
population.
Question
How do we achieve the representative sample requirement?
Answer
To select a random sample. A random sample ensures that every element in the
population of interest has the same chance of being selected to constitute the sam-
ple.
Definition
The selection process of a sample is called sampling technique. The survey so
conducted is known as sample survey.
13
ii. stratified random sampling
Exercise
Which sampling mechanism (with replacement or without replacement) would you
prefer to use and why?
Question
How are strata formed and how should items be selected from each stratum?
14
element of randomness is introduced into this kind of sampling by using random
numbers to pick up the unit with which to start.
Exercise
Suppose you have decided to use a systematic sampling procedure for a study. The
known population size is 5,000, and the sample size desired is 250. What is the
sampling interval? If the first element selected is 23, what would be the fourth, fifth,
sixth, and ninth elements selected?
• Note: when the clusters are too large, a second set of clusters is taken from
each original cluster. This leads to what is commonly known as two –cluster
stage sampling. If a third set of clusters is taken it is known as three-stage
cluster sampling, etc.
Review question
At this stage you must be able to anser the question. Why do we go for sample or
census surveys?
Reasons for sample survey include:
ii. Can save time: Because the number of individuals covered is small. This is
especially important when results are immediately required;
iv. Save product: For destructive surveys, collecting data from a sample can save
the product being studied;
v. If it is impossible to access the population, the sample is the only option (for
infinitely many members).
15
Reasons for census survey include:
ii. A client (person authorizing and / or underwriting the study) might not have an
appreciation for random sampling but feels more comfortable with conducting
census;
ii. Nonresponse error-results from the failure to collect data on all subjects in
the sample.
Ethical issues
It is also important to note that not all survey research is ethical. For instance,
purposive exclusion of some or particular groups of individuals from the population
frame in order to obtain results that favourable to the sponsor of the survey is uneth-
ical. Furthermore, designing of questions that are likely to guide the respondent in
particular direction, which captures responses that would result into positive results,
is unethical.
Review questions
At this juncture I expect you to be able to answer the following conceptual questions.
Make sure that you have a clear understanding of the concepts, be able to explain
any of the concepts to your fellow student who, for practical reasons, missed any of
the lectures. If you find having problems answering any of these question, re-read
the appropriate section(s) in the course notes or consult the instructor for further
elucidation on the concept(s).
16
Question 1
iii. What is the difference between an enumerative study and an analytical study?
iv. What is the difference between a categorical and a numerical random variable?
viii. What are the main reasons for obtaining data and what methods can be used
to accomplish this?
xi. What is the difference between sampling with versus without replacement?
xii. What distinguishes the four potential sources of error when dealing with sur-
veys designed using probability sampling?
Question 2
For each of the following statements, write True if the statement is true and False
if it is not true.
ii. In principal, the ordinal scale presumes that if ”a” is greater than ”b” and ”b”
is greater than ”c”, then it is true that c<b<a.
iii. Continuous variable is the variable that can theoretically assume a finite num-
ber of values.
iv. In quantitative data analysis, normally, bar graphs are used to show/present
the frequencies that characterize a quantitative variable.
vi. A randon sample is the one that is always typical of the population.
viii. For all practical purposes, infinite populations are large populations while finite
populations are small populations.
17
ix. When you estimate population parameters based on the properties of the sam-
ple you are making a sampling error.
x. The general ethical issue is that the research design should subject respondent
to material disadvantage .
xi. Access to both primary and secondary data depends on related objectives,
research questions except research design.
xii. Stratified random sampling techniques are based on the geographical proximity
or a common characteristic.
xiii. Focus group interviews involve face-to-face, repeated interaction between the
researchers and her/his informants.
Question 3
Given a population of n=93, using a table of random numbers draw a random sample
of size n=15 without replacement. List the 15 coded sequences obtained and compute
the sample mean. Repeat the exercise by sampling with replacement. Compare the
results obtained.
Cross-sectional surveys
In cross-sectional survey designs the required data are collected at one point in time
from a sample selected to represent a larger population.
Longitudinal Surveys
This involves collecting information over a given period of time. Longitudinal studies
may involve survey of sample population at different points in time-trend, study of
same population each time data are collected, although samples studied may be
different-cohort, and collection of data at various time points with the same sample
of respondents-panel.
There are numerous formulas and software packages for estimating the size of a sam-
ple for finite populations. However, the formulas vary depending on the quantity
(mean, proportion, difference between two population means, etc.) to be estimated.
To illustrate this aspect, we consider sample size calculation for estimating a pop-
ulation mean.
Consider the equation
X −µ
Z= (1)
√σ
n
18
the population mean, µ) and solving for n we obtain
Z 2σ2
n= (2)
e2
• The confidence level desired, which determines the value of Z, the critical value
from the normal distribution.
Example
A survey is planned to determine the average annual family medical expenses of
employees of a large company. The management of the company wishes to be 95%
confident that the sample average is correct to within ±TZS 50 of the true average
family medical expenses. A pilot study indicates that the standard deviation can be
estimated as TZS 400.
ii. If management wants to be correct to within ±TZS 25, what sample size is
necessary?
Solution
(i) Given e=50 (sampling error), σ=400 (population standard deviation obtained
from pilot study), and 95% confidence of estimating the true mean (Z=1.96 critical
value from the normal distribution)
19
Note: the general rule in determining sample size is to always round up to the
nearest integer value.
Exercise
1. An advertising agency that serves a major radio station in Tanzania would like
to estimate the average amount time the station’s audience spends listening to radio
on daily basis. From past studies the standard deviation is estimated as 45 minutes.
i. What sample size is needed if the agency wants to be 90% confident of being
correct to within ±5 minutes?
2. Suppose that REDET wants to estimate the proportion of voters who will vote
for the CCM candidate in the 2015 presidential election. REDET would like 90%
confidence that its prediction is correct to within ±0.4 of the population proportion.
ii. If REDET wants to have 95% confidence, what sample size is needed?
iii. If it wants to have 95% confidence and a sampling error of ±0.03, what sample
size is needed?
2
Hint: use n = Z p(1−p)
e2 where p is true proportion of success and note that if no
prior knowledge or estimate of the true proportion, p is available use p=0.5
The questionnaire is an important tool in the data collection process which involves
the transfer of information from one part-the respondents- to another part-the in-
terviewer. Therefore, in order to collect, correct and reliable information, the ques-
tionnaire must be properly designed. The size and format of the questionnaire are
crucial considerations when designing the questionnaire. Generally a good question-
naire should:
• Ensure economy in data collection, that is, avoid collection of any non-essential
information.
20
Formulation of questions
Questionnaires can be of closed-ended/forced choice and open-ended questions. When
designing a questionnaire, always use simple, clear, precise and unambiguous lan-
guage.
General guidelines
1. Try to be be concise
Example
Poor formulation: How do you feel about building an airport at Jangwani grounds
which have not been used optimally for a number of years?
Better formulation: An airport should be built at Jangwani grounds
1 = strongly agree
2 = agree
3 = disagree
4 = strongly disagree
2. Use mutually exclusive and exhaustive categories
Poor formulation: What is your marital status?
1=Married
2=Single
Better formulation: What is your marital status?
1=Married
2=Divorced
3=Separated
4=Widowed
5=Never Married
3. Use caution when asking personal questions
Poor formulation: How much do you earn each year? TZS.......................
Better formulation: In which category does your annual income last year best fit?
1=Below TZS 100,000
2=TZS 100,001-TZS 200,000
3=TZS 200,001-TZS 300,000
4=TZS 300,001-TZS 400,000
5=TZS 400,001-TZS 500,000
6=Over TZS 500,000
4. Question Order
21
5. Limit ”skip”patterns
Do you participate in sports?
1 = No (GO TO QUESTION NO.)
2 = Yes (circle all sports that apply)
1=Football
2=Volleyball
3=Basketball
4=Soccer
5=Swimming
6=Other (Please specify.......................................................)
More examples
When an unforeseen event occurs, do you address it by selling some of your assets?
1=Usually
2=Always
3=Sometimes
4=Not at all
Other ordinal scales commonly used:
22
Example of questionnaire with open-and close-end questions
Open-ended: How helpful are your fellow students in an event of a health shock?
Closed-ended: My fellow students are helpful in an event of a health shock. Circle one.
1=Definitely agree
2=Agree
3=Disagree
4=Definitely disagree
Closed-ended: In general, how would you describe relations in your workplace between management
and employees?
1=Very good
2=Quite good
3=Neither good nor bad
4=Quite bad
5=Very Bad
Closed-ended: Have you ever attended school?
1=Yes
2=No
Closed-ended: Which of the following books have you read? Circle all that apply.
1=Statistical methods
2=Experimental designs
3=Statistics for managers
4=Clinical trials
5=Introductory statistics
Exercise
Suppose that the National Health Insurance Fund (NHIF) would like to survey 1,500
of its members in Morogoro primarily to determine the percentage of its members
that currently own more than one car.
ii. Describe the type of data that the NHIF primarily wishes to collelct.
iii. Develop a first draft of the questionnaire needed by writing a series of five
categorical questions and five numerical questions that you feel would be ap-
propriate for this survey. Provide an operational definition for each variable
1.3.1 Introduction
Data analysis is a very important but challenging part of any research study. The
challenge is that there is no single statistical analytical technique that is available
for use to analyze every set of collected data. Different techniques do different
things! Therefore, in order for the results of any study/investigation to be useful,
the collected data must be appropriately analyzed taking into consideration the
prime objective of the analysis. That is, asking the question: why is it necessary to
23
organize the collected data? Alternatively stated: Why do you need to carry out an
analysis of the collected information?
One goal of statistical inference (to be discussed in Chapter Two) is to use sample
information to learn something about the population. Hence, we often carry out an
analysis in order to aid the process of learning something about the characteristic(s)
of the population based on sample information. That is, draw conclusions about the
characteristics of a population. In this chapter we will discuss different techniques
for analysing or presenting data.
Presentation of analysis results (optional)
One you have finished analysing the data, you have to present the results in a sys-
tematic way. In academic writing (e.g., special projects, master’s dissertation, etc.),
the results of any research have to be presented following accetable formart. Report
or presentation of results of an academic research study has to include at least the
following five sections:
ii. Introduction, in which the purpose of the investigation is described. That is,
problem statement, objectives and significance of the study are all included in
this section.
iii. Literature review, in which the state of evidence on a topic under study is
given. It covers both theoretical and empirical literature. Theoretical litera-
ture provides the theoretical underpining/foundation upon which the problem
under investigation is based. On the other hand, empirical literature, gives
the investigator an opportunity to understand the current state of evidence on
the topic thus identify gaps (or limitations of previous empirical studies) in
knowledge on the topic, which therefore, need to be addressed.
iv. Methodology, in which the population studied and the techniques (data col-
lection and statistical analyses) used are described
v. Results and Discussion, in which the findings of the investigation are pre-
sented, interpreted and discussed
viii. Appendices (if any), in which extra texts, tables and diagrams, which have
not been included in the main text are presented.
In the results and discussion section of the report, the findings are usually pre-
sented in form of tables or diagrams whose main features are also described in the
text. Choice of the most appropriate tables and diagrams to summarize the results
is essential, both because they enable the investigator to describe his/her findings
24
concisely and because they may suggest to him/her some aspects of the data which
he/she has not yet analyzed. Each table or diagram must be designed to show
one or two points, for it is usually better to have a number of simple tables or
figures than a single complex one. That is, too long tables or diagrams which can-
not fit to only one page are not preferred in a report. Tables and diagrams should
be self-explanatory and comprehensible without reference to the accompanying
text or description. Note further that if a table comprises of figures such as percent-
ages, which are derived by calculation from the original observations, they should
be accompanied by sufficient data to enable recalculation of the original numbers.
Percentages, rates, and other derived figures should not be expressed with more pre-
cision, for example to more decimal places, than it is compatible with the precision of
the original observations. Each table or figure must be precisely labelled indicating
the source of the data and date. Examples of graphical methods of displaying data
include line plots, histograms, bar charts, pie charts, scatter plots, maps, etc. Also
note that in some settings or reports the results and discussion are presented as two
separate sections in the report.
Major goals of data presentation
i. Classification
ii. Tabulation
iii. Graphic representation.
Before attempting to to group the data into classes, one should always start by
creating an array of the data.
Definition: An array is defined as an arrangement of raw numerical data in ascend-
ing or descending order of their magnitude.
Objectives:
25
iii. To prepare the basis for further analysis.
Frequency Distributions
A tabular arrangement of data by classes together with the corresponding class fre-
quency is called a frequency distribution or frequency table.
Frequency: number of times a particular value occurs. For a given data, its rela-
tive frequency is the fraction obtained by dividing the class frequency by the total
frequency.
A frequency distribution has lower and upper limits, lower and upper class
boundaries, an interval and a mid-value.
Class Boundaries
From the class interval 20-30, the numbers 19.5 and 30.5 are called class boundaries
or true class limits, the smaller number (19.5) is the lower class boundary and the
larger number (30.5) is the upper class boundary.
In practice, the class boundaries are obtained by adding the upper limit of one class
interval to the lower limit of the next higher class interval and dividing by 2.
26
Example of a frequency distribution table
Table 1: Heights of 20 Male Students
Classes Class Boundaries Mid-Points Frequency
32 - 35 31.5 - 35.5 33.5 1
36 - 39 35.5 - 39.5 37.5 2
40 - 43 39.5 - 43.5 41.5 7
44 - 47 43.5 - 47.5 45.5 7
48 - 51 47.5 - 51.5 49.5 3
The first class consist of heights from 32 to 35 and is indicated by the range symbol
32-35 and only 1 male student belongs to this class, the corresponding class frequency
is 1 as indicated on the last column in Table 1.
i. Determine the largest and smallest numbers in the raw data and thus find the
range r(the difference between the largest and smallest numbers).
ii. Determine the number of classes (m) of your frequency distribution. There is
no rule for determining the number of classes, but on the basis of the size of
the data n, you can think of an appropriate number of classes for the frequency
distribution.
iii. But usually m lies between 5 and 15. However, there is an empirical formula,
which is rarely used, defined as m=1+3.3 log n. In case of fraction results the
next higher whole number as taken as m.
iv. Determine c or h, which is the uniform size of class interval by using the
relationship h = r/m.
Determine the number of observations falling into each class interval that is the class
frequencies. This is best done using the tally or score sheet.
i. Histogram,
27
iv. Ogives or Cumulative Frequency curves
60
1:Health Facilitators
2:Health Workers
50
3:Friends and Family
4:Drama Groups
40
5:Radio
6:Other
30
20
10
0
1 2 3 4 5 6
Source of Information
Line plot
3. Line plot
1.0
Males
Females
0.8
Estimated Probability
0.6
0.4
0.2
0.0
Pie chart
4. Pie Chart
(i) Bases on the horizontal axis, with centres at the class marks and lengths equal
to the class interval sizes.
(ii) Areas proportional to the class frequencies.
28
Table 2. Heights of 100 male students
Height (in)
Class interval Class mark Frequency
60 - 62 61 5
63 - 65 64 18
66 - 68 67 42
69 - 71 70 27
71 - 74 73 8
Instead of using frequencies as in Figure 1 to draw a histogram, one can also use
proportions (relative frequencies) as shown in Figure 2.
f requency
Note: Relative frequency= total#of observation sin sample
29
The curve drawn over the histogram is the normal curve of the sample
Figure 2. Histogram of Heights of 100 male students.
30
Exercise: Construct (a) a histogram and a frequency polygon for the diameter
data2 from the cherry tree as presented in Table 3 in ascending order of magnitude.
Hint: First construct a frequency table for the diameter data.
Table 3: Data on 31 cherry trees
Diameter (in) Height (ft) Volume (cu ft)
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7
11.0 66 15.6
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77
2
Source: Davidian, M. (1998) Experimental Statistics for Biological Sciences
31
Figure 4: Frequency curve of the frequency distribution of Table 2
Cumulative frequency curves (Ogives): The total frequencies of all values less
than the upper class boundary of a given class interval is known as cumulative
frequency up to and including that class interval. For instance, the cumulative fre-
quency up to and including the class interval 69-71 in Table 2 is 5 +18 +42+27=92.
Cumulative frequencies are of two types: less than and more than cumulative fre-
quency. Consequently, there are two types of ogives. Less than ogives and more than
ogives. A less than ogive is an increasing graph sloping upward from left to right
where as a more than ogive is a decreasing curve and slopes downward from left to
right.
32
in Table 4 is a less than cumulative frequency table and the corresponding curve is
presented in Figure 5.
Table 4: Less than cumulative frequency table
Height (in) No of Students
Less than 59.5 0
Less than 62.5 5
Less than 65.5 23
Less than 68.5 65
Less than 71.5 92
Less than 74.5 100
33
Figure 6: More than cumulative frequency curve (ogive) of Table 5
Exercise
Draw a less than and more than ogive curves to the above frequency table.
2. Consider the following frequency distribution
We have seen that a histogram provides an overall visual impression of the character
of the data. From it we get a sense of the center of the data and how spread
out they are. However, it is pleasing to summarise the nations of center and spread
quantitatively.
Objective: Describe a large body of quantitative data by a single value, which the
mind can grasp easily and quickly. Thus, by quantifying center and spread for a
sample we hope to get an idea of these same notions for the population from which
the sample was drawn.
Definition: An average is defined as a single value that is intended to represent the
distribution as a whole.
34
A typical value should have the values of the distribution clustered about it. That
is, it must have a central tendency.
Notation: The following notations are adopted.
n=size of sample
Y =the variable of interest
Y1 , Y2 , ..., Yn =observations on the variable for the sample.
=symbol for summation
Types of Averages
The most commonly used averages are:
(i) Arithmetic mean or simply mean
(ii) Median
(iii) Mode
(iv) Geometric mean
(v) Harmonic Mean
The arithmetic mean: The arithmetic mean or briefly the mean, of a set of n
numbers Y1 , Y2 , ..., Yn is denoted by Ȳ (read “Y bar”) and is defined as.
Y1 +Y2 + ... +···Yn
Ȳ = n =sample mean
1 n
Ȳ = Yi (3)
n i=1
That is, the sample mean is the average of the sample values. The notation “Ȳ ”
is standard, the “bar” indicates the “averaging ”operation being performed. The
population mean is the corresponding quantity for the population. We will use the
Greek symbol µ to denote the population mean. Thus, µ is the mean of all values
for the variable for the population.
35
Example 1
If 32, 51, 60, 48 and 46 are the marks of five students in MB 101 test, then the
variable “marks” is denoted by Y and the five values are denoted by Y1 , Y2 , Y3 , Y4
and Y5. The arithmetic mean is given by:
32 + 51 + 60 + 48 + 46 237
Ȳ = = = 47.4
5 5
Example 2
Consider the following tuition fees (in million Tsh.) charged by six different univer-
sities in the country to complete a three-year degree programme. 10.3, 4.9, 8.9, 11.7,
6.3, 7.7 Compute the mean.
Solution:
n
Xi
i=1 10.3 + 4.9 + 8.9 + 11.7 + 6.3 + 7.7 49.8
X̄ = = =
n 6 6
=8.30 million Tsh.
Note: calculation of the arithmetic mean is based on all the observations in the set
of data, thus it is affected by extreme (largest and smallest) values in the data set.
That is, it is not a desirable measure when there are extreme values.
Exercise
Find the arithmetic mean of the numbers 4.1, 2.9, 4.1, 3.8, 2.9, 3.4
Exercise
Find the arithmetic mean of the numbers 23.4, 15.6, 22.1, 20.0, 26.7, 31.4, 18.9, 22.3.
The expression for Ȳ given in equation (18) above is suitable for individual mea-
surements (i.e. ungrouped data). If the numbers Y1 , Y2 ,. . . , Y n occur f1 , f2 , ..., fn
times, respectively (i.e. occur with frequencies f1 , f2 , ..., fn ), the arithmetic means
is:
n
f i Yi
f1 Y1 + f2 Y2 + ... + fn Yn i=1
Ȳ = =
n
f1 + f2 + ... + fn fi
i=1
where: n = fi is the total frequency (i.e. the total number of cases).
Example 1
If 5, 8, 6 and 2 occur with frequencies 3, 2, 4, and 1, respectively, the arithmetic
mean is
36
Example 2
Find the arithmetic mean of the numbers 4.1, 2.9, 3.8 and 3.4 which occur with
frequencies 2, 2, 1, and 1, respectively.
Solution
Arithmetic mean is
f1 X1 + f2 X2 + ... + fn Xn
X =
f1 + f2 + ... + fn
2 × 4.1 + 2 × 2.9 + 1 × 3.8 + 1 × 3.4 8.2 + 5.8 + 3.8 + 3.4 21.2
= = = = 3.533
2+2+1+1 6 6
Exercise 1
Suppose that a final examination score in MTH 106 is weighted 2.5 times as much
as a test and a student has a final examination score of 75 and test scores of 80 and
95. Find the mean score.
Exercise 2
One hundred students were administered a 40-item test concerning knowledge about
causes of mental illness in teenagers. The data obtained are summarized in the
following Table 1. Compute the arithmetic mean.
Table 1: Test scores concerning knowledge about causes of mental illness
Class interval Frequency
6-7 10
8-9 6
10-11 11
12-13 10
14-15 25
16-17 16
18-19 8
20-21 14
37
n
f i Yi
i=1
Ȳ =
n
fi
i=1
On replacing Yi by A + Di we obtain
fi (Di + A) fi Di + Afi
Ȳ = =
fi fi
Af
or = f i + fi Di
i fi
f i Di
=A+ n because fi = n.
f i Di
Ȳ = A + n or Ȳ = A + D̄
Example
Find the arithmetic mean of the following set of observations 19, 21, 32, 16, 17, 15,
22, 34, and 51 by using the short method. Choose A = 34.
Yi 19 21 32 16 17 15 22 34 51 Total
Di = Yi - A -15 -13 -2 -18 -17 -19 -12 0 17 -79
Ȳ = 25.22
Selection of “A”
To choose A for raw (ungrouped data), you can form an array of the observations
and pick A as the value, which lies approximately on the centre. For grouped data
select A as the class mark of the class, which lies in the region of high frequency.
Specifically, A is taken as the class-mark of the class with the highest frequency in
a given frequency distribution.
38
The weighted arithmetic mean
Sometimes we associate with the numbers Y1 , Y2 , . . . , Y n , certain weighting factors
(or weights) w1 , w2 . . . , wn depending on the significance or importance attached to
the numbers. In this case the arithmetic mean,
w1 Y1 + w2 Y2 + ... + wn Yn
Ȳ =
w1 + w2 + ... + wn
wY
= wi i is called the weighted arithmetic mean.
i
Example
If a final examination score in a course is weighted 3 times as much as a quiz and
a student has a final examination grade of 85 and quiz grades 70 and 90, the mean
grade is.
3 × 85 + 1 × 70 + 1 × 90 415
Ȳ = = = 83.
3+1+1 5
Exercise
A sample of 5 households is selected with equal probability (epsem) from a list of
250 households. One adult is selected at random in each sampled household (HH).
The monthly income (yij ) and the level of education (zij =1, if secondary or higher;
=0 otherwise) of the jth sampled adult in the ith household are recorded. Let Mi
denote the number of adults in the household. Assume that the data obtained from
the single sampled adult for each household in the first-stage sample of households
are as given in the table below where wi denote the overall weight (inverse of overall
probability of selection) of a sampled adult.
Use the information given in the above table to estimate the following characteristics:
ii. Weighted and unweighted proportion of people with secondary or higher edu-
cation. Comment on the results.
iii. Weighted and unweighted total number of people with secondary or higher
education. Comment on the results.
39
iv. Weighted and unweighted mean monthly income of adults with secondary or
higher education. Comment on the results.
Example
In a factory, 120 workers get an average of $ 30 a day, 160 workers get $ 50 a day, 80
workers get $ 60 a day and 40 workers get $ 80 a day. Find the combined arithmetic
mean of all workers.
120 x 30 + 10 x 50 + x 60 + 40 x 80
= = 49
120 + 160 + 80 + 40
(ii) The sum of the squares of the deviations of a set of numbers Yi from any number
2
a is a minimum if and only if a = Ȳ i.e. Yi − Ȳ is less than (Yi − A)2
2
where A is any other value. For a frequency distribution fi Yi − Ȳ is less than
fi (Yi − A)2 .
Advantages
(i) It is rigidly defined
(ii) It is easy to calculate and understand
(iii) It is based on all the observations
(iv) It is suitable for further mathematical treatment
(v) Compared to the other averages, arithmetic mean is affected least by fluctuation
of sampling.
40
Disadvantages
(i) The severe draw back of arithmetic mean is that it is unduly affected by extreme
values of the data.
(ii) Arithmetic mean cannot be used in case of open-end classes.
(iii) Arithmetic mean cannot be obtained if a single observation is missing.
For a given set of an observations arranged in order of their magnitude, the median
th
n+1
is defined as the value of the observation if n is odd and
th 2
1 n th n+2
2 2 obser. + 2 obs if n is even.
Example 1 Consider the tuition fee data set and compute the median.
Solution
Ordered array of the data in ascending order of magnitude is:
4.9, 6.3, 7.7, 8.9, 10.3, 11.7
Since n (=6) is even then, by rule 2, median is the mean of the two middle numbers
(7.7 and 8.9). That is, median= 7.7+8.9
2 = 16.6
2 = 8.30 million Tsh.
Example 2
Consider the data set 24.1, 22.6, 27.0, 19.8, 21.5, 23.7, 22.6 Compute the median.
Solution
Ordered array of the data in ascending order of magnitude is:
19.8, 21.5, 22.6, 22.6, 23.7, 24.1, 27.0
Since n (=7) is odd then, by rule 1, median is the middle number (22.6).
Exercise
Find the median of the following values:
(i) 3, 5, 6, 8, 11.
(ii) 3, 5, 6, 8
Exercise
Find the median of the following set of observations:
(i) 45, 38, -40, 27, -29, 40, 57, 56, -33.
(ii) 19, 32, 15, 22, 20, 17, 9, 12.
41
Exercise Find the median of the numbers 0, -4, -1, 0, -2, -1, 0
h n
X̃ = L1 + − c.f
f median 2
where:
L1 is the lower class boundary of the median class
h is the size of the median class interval
n is the total number of observations
fmed. is the frequency of the median class
c.f . is the sum of frequencies of all classes lower than the median class.
Example
Consider the following frequency distribution Table 1 and compute the median.
Solution n
2 − c.f
h
Median, X̃ = L1 + fmedian
From Table 2, n2 (100/2=50) of the observations fall in the class interval 14-15.
Therefore, L1 =13.5, h=2, fmedian =25, c.f . =37. Substituting these values in equa-
tion 3 we obtain
2
Median = 13.5 + 25 (50 − 37) =14.54 units
42
Exercise
Consider the following frequency distribution table and compute the median.
Advantages
(i) It is rigidly defined
(ii) It is easy to understand and calculated
(iii) It is not affected at all by extreme values. Hence, it is a better average than
arithmetic mean when extreme observations are present
(iv) Can be computed for a distribution with open-end classes.
(v) The values of median can be obtained graphically.
Disadvantages
(i) It is not based on all values
(ii) Not suitable for further mathematical treatment
(iii) Compared to mean, median is more affected by fluctuation of sampling
(iv) In case of ungrouped data rearrangement of the values in the order of magnitude
becomes necessary
Example 1
Consider the tuition fee data set and find the mode.
Solution
Ordered array of the data in ascending order of magnitude is: 4.9, 6.3, 7.7, 8.9, 10.3,
11.7
Since none of the tuition fees occurs most in the set, then there is no mode.
Example 2
Consider the data set 24.1, 22.6, 27.0, 19.8, 21.5, 23.7, 22.6 Find the mode.
43
Solution
Ordered array of the data in ascending order of magnitude is:
Mode=22.6
Since there is only one mode, the data set is described as a unimodal. Two modes
is called bimodal, more than two modes is described as multimodal.
Exercise
1. Find the mode of the numbers 0, -4, -1, 0, -2, -1, 0
∆1
M ode (X) = L1 + × h or c
∆1 + ∆2
where:
L1 = lower class boundary of the modal class (i.e. the class with highest frequency).
∆1 =excess of modal frequency over frequency of next lower class.
∆2 =excess of modal frequency over frequency of next higher class.
c= size of the modal class interval.
44
Example
Consider the following frequency distribution table and compute the mode.
Solution
Mode, X = L1 + ∆1∆+∆
1
2
h
From Table 2, modal class (class with the highest frequency) is 14-15. Therefore,
L1 =13.5, ∆1 =15 (25-10), ∆2 =9(25-16), h=2. Substituting these values in equation
4 we obtain
15 15
Mode = 13.5 + 15+9 × 2 = 13.5 + 24 × 2 = 13.5 + 1.25 = 14.75 units
Exercise
Determine the modal value of the above table graphically and compare the value
obtained with that obtained in the example above using the formula.
∆1
X = L1 + c
∆1 + ∆2
Advantages
(i) Mode is easy to understand and calculate
45
• In some cases it can be located merely by inspection
• The value of mode can be obtained graphically from the histogram (ii) It can
be calculated for frequency with open-end classes.
Disadvantages
(i) Mode is not rigidly defined
(ii) It is not based on all the values of the data
(iii) Mode is not suitable for further mathematical treatment
(iv) As compared with mean, mode is affected to a greater extent by the fluctuations
of sampling.
√
n
i.e. G = Y1 .Y2 .Y3 ...Yn or
1
G = (Y1 .Y2 .Y3 ...Yn ) /n
Example
Find the geometric mean of the numbers 2, 4, and 8.
Solution √
G = (2 × 4 × 8)1/3 = 3 64 = 4
Question
The average in the sequence 2, 4, 8 is 4 and not 4.6, i.e., (2+4+8)/3. Why do you
think this is so?
i. Because 4<4.6
46
iv. No reason
In case of a grouped data i.e. frequency distribution with n classes, we consider the
class marks Y1 , Y2 , . . . ,Yn and their corresponding frequencies f1 , f2 , . . . , f n . Then
1/
G = Y1f1 .Y2f2 ....Ynfn n on taking logarithm we have log G = n1 log Y1fi . Y2f2 ... Ynfn
1
log G = [fi log Y1 + fi log Y2 + ... + fn log Y2 ]
n
1
n
1
n
= n fi log Yi or G = anti − log n fi log Yi .
i=1 i=1
Exercise
Find the geometric mean of the frequency distribution of Table 2. For what value of
Y will geometric mean be undefined?
Advantages
(i) Geometric mean is rigidly defined
(ii) It is based on al the observations
(iii) It is suitable for further mathematical treatment
e.g. If G1 and G2 are geometric means of two series of sizes n1 and n2 values
respectively then the geometric mean of the combined series of n1 + n2 observations
is given by:
1
G = (Gn1 1 × Gn2 2 ) n1 +n2
or
47
(n1 log G1 + n2 log G2 ) (n1 log G1 + n2 log G2 )
log G = n1 + n2 or G = anti − log n1 + n2
It is a proper average to measure the relative change. Geometric mean is a suitable
average to find the average percentage increase in population, production, and sales
over a period of time.
Disadvantages
(i) Because of its mathematical character, geometric mean is not easy to understand
and to calculate for a non – mathematical person.
(ii) If nay of the observations is zero, geometric mean becomes zero and if any of
the observation is negative geometric mean becomes imaginary regardless of the
magnitude of the other items.
1 1 n
1
=
H n i=1 Yi
Example
Find the harmonic mean of the numbers 2, 4, and 8.
Solution
1 1 n
1
=
H n i=1 Xi
1 1 1 1 1 1 4+2+1 1 7 7
= + + == = =
H 3 2 4 8 3 8 3 8 24
24
H= = 3.43
7
Exercise
Find the harmonic mean of the following frequency distribution
48
Class: 20-24 25-29 30-34 35-39 40-44 45-49 50-54
Freq: 11 18 32 37 21 47 13
Advantages
(i) It is rigidly defined
(ii) It is based on all the observations
(iii) It is suitable for further mathematical treatment
e.g. If H1 and H2 are harmonic means of n1 and n2 observations, then the harmonic
mean of the combined group of n1 + n2 observations is given by:
n1 + n2
H = n1 n2
H1 + H 2
Advantages
(i) It is not easy to understand and calculate.
(ii) Its value cannot be obtained if any of the observation is zero.
Example
The set 2, 4, and 8 has arithmetic mean 4.67, geometric mean 4, and harmonic mean
3.43.
Two data sets (or populations) may have the same mean, but may be “spread”
about this mean value very differently.
Definition
Dispersion, or Variation is the degree to which numerical data tend to spread about
an average value.
49
Types of Measures of Dispersion
-Absolute and Relative.
Limitation
Are not suitable for comparing the variability of two or more distributions, which
are not in the same units of measurements.
Range
Difference between the highest and the lowest values in the set. That is,
Example
Consider the tuition fee data set presented earlier and compute the range.
Solution
Ordered array of the data in ascending order of magnitude is: 4.9, 6.3, 7.7, 8.9, 10.3,
11.7
Range=11.7-4.9=6.8 million Tsh.
Exercise
Find the range of the set of numbers 2, 3, 3, 5, 5, 5, 8, 10, 12
Advantages
(i) It requires very little calculations. Hence it is considered as the easiest measure
50
of dispersion.
(ii) It is rigidly defined.
Disadvantages
(i) Is not based on all the values of the data
(ii) Range is very much affected by fluctuation of sampling, its value varies very
widely from sample to sample
(iii) It is not suitable for further mathematical treatment.
Mean deviation
The mean deviation (MD), or average deviation, of a set of n numbers is defined by:
n
|Yi −Ȳ |
Mean deviation (MD) = i=1 n where Ȳ is the arithmetic mean of the numbers
and |Yi − Ȳ | is the absolute value of the deviation of Yi from Ȳ . For example, the
absolute value of –10 is |-10|=10, while that of 3 is |3|=3.
Example
Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16.
Advantages
(i) Mean deviation is rigidly defined
(ii) As compared to stand and deviation, it is easy to understand and calculate
(iii) Unlike range and quartile deviation, mean deviation is based on all the obser-
vations
(iv) Since mean deviation is based on the deviation about an average, it provides a
better measure of the comparison about the formation of different distributions.
(v) As compared with standard deviation it is less affected by extreme observations.
Disadvantages
(i) It is not suitable for further mathematical treatment.
(ii) It cannot be computed for distributions with open-end classes.
Note: The mean deviation from median is the least compared to the mean deviation
51
from any other value.
In symbols:
n
(xi −x)2
s2 = i=1
n−1 or
Example
Consider the tuition fee data set and compute the sample variance.
Solution
The six sample of tuition fees are 10.3, 4.9, 8.9, 11.7, 6.3, 7.7
n 2
fi Xi − X
s2 = i=1
(7)
n−1
52
The corresponding shortcut formula is
2
n
fi Xi
n
fi Xi2 − i=1
n
2 i=1
s = (8)
n−1
Exercise
Find the sample variance of the frequency distribution of Table 2.
√
That is, s = s2 or
n
(x − x)2
i=1 i
s= (9)
n−1
Example
Consider the tuition fee data set and compute the sample standard deviation.
Solution
From above, sample variance, s2 =6.368. Therefore, sample standard deviation,
(xi −x)2
n
√
i=1
s= n−1 = 6.368 = 2.52 million Tsh.
Exercise
Find the sample variance of the frequency distribution of Table 2.
53
Example
Find the mean deviation of the set 68, 67, 66, 63, and 61.
Solution
Arithmetic mean, X of the given set of numbers is 65 (check). Therefore, the mean
n
|Xi −X|
i=1 |68−65|+|67−65|+|66−65|+|63−65|+|61−65|
deviation, MD = n = 5 =
3+2+1+2+4
= 2.4
5
Exercise
Find the mean deviation of the set 4, 6, 8, 10, 12, 14 and 16.
n
fi Xi − X
i=1
MD = (12)
n
Exercise
Find the mean deviation of the frequency distribution of Table 2.
s
CV = 100% (13)
X
It is particularly useful when comparing the variability of two or more sets of data
that are expressed in different units of measurements.
Example
Consider the tuition fee data set and compute the coefficient of variation (CV).
Solution
From above,
sample mean
(X)=8.30,
sample standard deviation, s=6.368. There-
s 2.52
fore, CV, X 100% = 8.30 100% = 30.4%. That is, the relative size of the average
54
spread around the mean to the mean is 30.4%.
n
ni Ȳi
n1 Ȳ1 + n2 Ȳ2 + ... + nȲn i=1
ȲC = =
n
n1 + n2 + ... + n ni
i=1
For the same groups, we can compute the combined standard deviation (Sc ), if we
have information on their individual standard deviations, s1 , s2 , . . . , s n , their means
Ȳ1 , Ȳ2 , ... Ȳn and group sizes, n1 , n2 , . . . , n and their pooled mean Ȳc .
ni (s2i + d2i )
Then, Sc = where di = Ȳi , − Ȳc
ni
Example
Wages paid to workers in two companies A and B are summarized as follows:
Compute Ȳc and Sc . Deduce the variance and coefficient of variation (C.V.).
N
Xi
i=1
µ= (14)
N
The population variance and standard deviation
The population variance σ 2 is given by
N
(Xi − µ)2
σ2 = i=1
(15)
N
55
The population standard deviation, σ is given by
√
σ= σ or
N
2
i=1 (Xi − µ)
σ= (16)
N
σ
CV = 100% (17)
µ
Exercise
1. Calculate the mean, median, mode, range, variance and standard deviation of the
variable age given in the table below by hand. Compare your results with the SPSS
printout given for the same variable (age). Interpret your results. That is, state
what each quantity means with respect to the study.
OMT data
Age Weight Height
31 75 178
30 69 178
38 73 176
24 78 179
54 73 162
64 75 168
61 65 160
34 60 166
38 64 165
47 63 167
49 85 172
55 91 177
25 69 170
42 63 178
53 93 180
43 65 174
55 60 173
60 61 167
61 49 163
45 56 175
24 79 179
24 80 177
25 64 168
25 95 174
44 82 160
49 77 172
52 90 178
50 80 160
54 76 160
33 77 180
56
SPSS Printout of summary statistics for variable age
Valid 30
N
Missing 0
Mean 42.97
Standard Error of Mean 2.356
Median 44.50
Mode 24(a)
Standard Deviation 12.907
Variance 166.585
Range 40
Minimum 24
Maximum 64
1.3.4.1 Introduction
Investigating the relationships between two (or more) variables is a problem that
arises in the biological and physical sciences, economics, industrial applications and
biomedical settings.
For instance, a business firm may wish to know how future sales of a given product
could be affected by its price, or in a classroom we may need to know how tests and
assignments scores can be used to predict final examinations scores, etc.
Thus this problem of predicting unknown value of one variable in terms of the known
value of another variable is called Regression. In this course we will restrict our
discussion to the simplest case in which two variables for which the relationship
between them is reasonably assumed to be a straight line. However, the technique
can be extended to a multiple variate data (multiple regression analysis).
Practical problem
In most situations, the values we observe for Y (and sometimes for X) are not exact.
In particular, due to biological variation among experimental units and the sampling
of them, impression and/or inaccuracy of measuring devices, and so on, we may only
observe values of Y (and also possibly X) with some error. Thus, based on a sample
of (X, Y) pairs, our ability to see the relationships exactly is obscured by this error.
Scatter diagram
The scatter diagram is a graph that gives an indication of the mathematical form
that could represent the relationship between two variables. It is always advisable
to plot the data before analysis, to ensure that the model assumptions seem valid.
57
1.3.4.2 Simple linear regression model
Straight line model
It is often reasonable to suppose that the relationship between Y and X is a straight
line. We may write this as
Y = β0 + β1 X + ε (18)
where
β0 is the intercept, the value taken on at X=0
β1 is the slope, expresses the rate of change in Y , that is, β1 =change in Y brought
about by a change of one unit in X.
Issue
The problem is that we do not know β0 or β1 . To get information of their values,
the typical experimental set up is to choose values of Xi , i = 1, ..,n, and observe the
resulting responses Y1 ...,Yn so that the data consist of pairs (Xi , Yi ), i = 1, ..., n.
The data are then used to estimate β0 and β1 , that is, fit the model to the data in
order to
If we think of our data (Xi , Yi ), we may thus think of a model for Yi as follows:
Yi = β0 + β1 X i + εi
Objective
For the simple linear regression model, fit the line to the data to serve as our “ best”
characterisation of the relationship based on the available data. Although, we will
work with the simple linear regression model, be aware that the methods we discuss
extent easily to more complex linear models.
58
1.3.4.3 Fitting a simple linear regression model: the method of least
squares
Having discussed the important conceptual issues involved in studying the rela-
tionship between two variables, let us now describe practical implementation.
We do this first for fitting a simple linear regression model.
Yi = β0 + β1 X i + εi , i = 1, ..., n.
Goal
We wish to fit the above model by estimating the intercept and slope parameters β0
and β1 .
Assumptions
For the purpose of making inferences about the true values of intercept and slope,
making predictions, and so on, we make the following assumptions. There are often
reasonable assumptions.
i. The observations Y1 ...,Yn are independent, i.e., not related in any way. For
example, they are derived from different animals, subjects, etc. They might
also be measurements on the same subject, but taken far apart enough in time
to where the value at one time is completely unrelated to that at another time.
ii. The observations Y1 ...,Yn have the same variance, σ 2 . That is, regardless of
which Xi we consider, the variation in possible Yi values is the same.
The least squares estimation method consists of choosing, for any given set of
observations the values of β0 and β1 which will minimize the sum of squares of the
deviations εi . This has the same appeal as a sample variance-we ignore the signs of
the deviations but account for their magnitude.
59
n
That is, {Yi − (β0 + β1 Xi )}2
i=1
Our interest is to find the estimates of β0 and β1 that are the most plausible to
have generated the data. Hence, a natural way to think about this is to choose as
estimates the values β̂0 and β̂1 that make this measure of overall variation as small
as possible (that is, which minimize it).
n
2
If β̂0 andβ̂1 are the estimates of β0 andβ1 then Yi − β̂0 − β̂1 Xi is minimum
i=1
Estimation of regression parameters
Consider Yi = β0 + β1 Xi + εi
ei = Yi − Ŷi
= Yi − (β̂0 + β̂1 Xi )
= Yi − β̂0 − β̂1 Xi
n
n
e2i = (Yi − β̂0 − β̂1 Xi )2
i=1 i=1
n
n
Minimizing e2i amounts to finding the partial derivatives of e2i with respect to
i=1 i=1
β0 and β1 .
n
The partial derivatives of e2i with respect to β0 and β1 are:
i=1
∂ 2 ∂ 2
ei = Yi − β̂0 − β̂1 Xi (19)
∂β0 ∂β0
∂ 2 ∂ 2
ei = Yi − β̂0 − β̂1 Xi (20)
∂β1 ∂β1
Simplifying equations (19) and (20) and equating the resulting equations to zero,
one can show that
60
n
n
n
n Xi Yi − Xi Yi
i=1 i=1 i=1
β̂1 = n
2 (22)
n
n Xi2 − Xi
i=1 i=1
n
(Xi −X̄ )(Yi −Ȳ )
i=1
It can also be shown that β̂1 =
n
2
Exercise
(Xi −X̄ )
i=1
The “hat” on the Yi emphasizes the fact that these values are our “best guesses”.
The Ŷi are often called the predicted values.
61
Example: Optical density data
The optical density Y of a solution measured at eight concentrations, X, of a chemical
was as follows:
Meter reading, Yi 4 9 18 20 35 41 47 60
Concentration µg/ml, Xi 1 2 4 5 8 10 12 15
Note: A regression line always passes through the point X̄, Ȳ
(b) Computational Table
Xi Yi Xi Yi Xi2
1 4 4 1
2 9 18 4
4 18 72 16
5 20 100 25
8 35 280 64
10 41 410 100
12 47 564 144
15 60 900 225
8
8
8
8
Xi = 57 Yi = 234 Xi Yi = 2348 Xi2 = 579
i=1 i=1 i=1 i=1
n
n
n
n Xi Yi − Xi Yi
i=1 i=1 i=1 8 × 2348 − (57) (234)
β̂1 = n
2 = = 3.94
n 8 × 579 − (57)2
n Xi2 − Xi
i=1 i=1
62
234 57
β̂0 = Ȳ − β̂1 X̄ = − 3.94 ×
8 8
=1.193
Therefore the fitted line is
Ŷi = 1.193 + 3.94Xi
The slope of the line is 3.94, that is, for each unit increase in concentration (X),
optical density (Y) increases by the amount 3.94 and when X = 0, Y takes the value
1.193.
Roughly speaking a good regression line is one, which helps to explain or account
for a large proportion of the variability in Y.
Algebra and the fact that Ŷi = β̂0 + β̂1 Xi = Ȳ + β̂1 (Xi − X̄)may be used to show
that
n
n
n
(Yi − Ȳ )2 = (Ŷi − Ȳ )2 + (Yi − Ŷi )2 (23)
i=1 i=1 i=1
The quantity on the left hand side is the Total Sum of Squares for the set of data.
For any set of data, we may always compute the Total SS as the sum of squared
deviations of the observations from the (overall) mean, and it serves as a measure of
the overall variation in the data.
Therefore, equation (23) represents a partition of our assessment of overall variation
in the data, Total SS, into two independent components.
• (Ŷi − Ȳ )is the variation of the predicted value of the ith observation from the
overall mean. Thus, we may think of this as measuring the variation in the
observations that may be explained by the regression line β0 + β1 Xi
63
• (Yi − Ŷi )is the deviation of the predicted value for the ith observation (our “best
”guess for its mean) and the observation itself (that we observed). Hence, the
sum of squared deviations
n
(Yi − Ŷi )2
i=1
measures any additional variation of the observations about the regression line; the
inherent variation in the data at each Xi value that causes observations not to lie
on the line.
Thus, the overall variation in the data, as measured by Total SS, may be broken
down into two components that each characterise parts of the variation:
n
Regression SS= (Ŷi − Ȳ )2 , which measures that portion of the variability that
i=1
may be explained by the regression relationship.
n
Error SS (also called Residual SS)= (Yi − Ŷi )2 which measures the inherent
i=1
variability in the observations (e.g., Experimental error).
OR
Important R2 is computed under the assumption that the simple linear regression
model is correct; i.e., it is a good description of the underlying relationship between
64
Y and X. Thus, it assesses, if the relationship between X and Y really is a straight
line.
Calculation of R2
β̂12 SXX
To calculate R2 by hand, one can show that R2 = SY Y
2
n
n Xi
2
n
where SXX = Xi − X̄ = Xi2 − i=1
n
i=1 i=1
2
n
n Yi
2
n
and SY Y = Yi − Ȳ = Yi2 − i=1
n
i=1 i=1
Example
Compute the coefficient of determination for the optical density data set. Exercise
Show that R2 =0.996 or 99.6% and give interpretation
Idea
Describe the “degree of association” between X and Y . Correlation doesn’t charac-
terise the relationship between the two variables. The correlation coefficient ρXY is a
measure of the degree of (linear) association between the two random variables. ρXY
satisfies −1 ≤ ρXY ≤ 1with ρXY denoting a “perfect” positive association, ρXY =-1
denoting a “perfect” negative association, and ρXY =0 denoting “no association”.
Here we discuss the Karl-Pearson correlation coefficient.
Interpretation
It is important to understand what correlation does not measure. Investigators
sometimes confuse the value of the correlation coefficient and the slope of an apparent
underlying straight line relationship. These do not have anything to do with each
other.
65
Estimation
For given set of data, ρXY is unknown. We may estimate from a set of n pairs of
observations (Xi , Yi ),i = 1, ..., n by the sample correlation coefficient
n
(Xi − X̄)(Yi − Ȳ )
rXY = i=1
n
(Xi − X̄)2 (Yi − Ȳ )2
i=1
SXY
rXY = √
SXX SY Y
For hand calculation, one should use the preferred forms of SXX , SXY and SY Y
Example
The following data are measurements on wing length (X) and tail length (Y ) for a
sample of n=12 birds:
n
n
Xi = 128.2, Xi2 = 1371.32, SXX = 1.717,
i=1 i=1
n
n
Yi = 90.8, Yi2 = 688.40, SY Y = 1.347
i=1 i=1
n
Xi Yi = 971.37, SXY = 1.323
i=1
66
Thus, our estimate of ρXY is
SXY 1.323
rXY = √ = = 0.8704
SXX SY Y (1.717)(1.347)
67
2 Chapter Two: Statistical inference
A point estimate consists of a single sample statistic that is used to estimate the
true value of a population parameter. Examples include sample mean (X̄), which
is a point estimate of the population mean, µ and the sample variance (s2 ), which
is a point estimate the population variance, σ 2 . For details of how to carry out the
estimation of these quantities revist our discussion of measures of central tendency,
specifically the sample mean and measures of dispersion specically the sample vari-
ance respectively. Note that the sample mean (X̄) and sample variance (s2 ) are not
the only estimators of population mean and population variance respectively. The
sample mean and variance possess most of the desirable qualities of an estimator as
discussed below.
n
2
1
s2 = n Xi − X̄ is a biased estimator of population variance σ 2 while
i=1
n 2
1
s2 = n−1 Xi − X̄ is unbiased estimator.
i=1
ii. Minimum variance: What if we can identify two competing estimators that
are both unbiased? On what grounds might we prefer one over the other?
Since the aim is to use sample information to say something about the population
from which the sample was drawn, we would therefore, like our estimator of to be
close to the true population values as possible. That is, we would like it to have
small variance. This would mean that the possible values that the estimator could
68
take on (across all possible samples we might have ended up with) exhibit only small
variation.
• Ideally we would like to use an estimator that is unbiased and has the smallest
variance among all such candidates. Such an estimator is given the name
minimum variance unbiased estimator-MVUE
• It turns out that, for normally distributed data Y , the estimators Ȳ (for µ) and
s2 (for σ 2 ) have this desired property.
It should be noted that unbiasedness and minimum variance are not the only desir-
able qualities of an estimator. Sufficiency, consistency, efficiency and robust-
ness are also used to judge which estimator to be used in a particular situation.
The value of µ we wish to estimate is fixed (but unknown) quantity. Thus, our
probability statements intuitively should have something to do with the uncertainty
of trying to get an understanding of the fixed value of µ using the variable estimator
X̄
With this in mind, consider the following probability statement. Let tn−1,α/2 be
the point such that the region under the Student’s t density with (n-1) degrees of
freedom has area α/2 so that the remaining region has area (1-α/2).
69
By symmetry, we know that tn−1,α/2 is such that each region on the right of the
Student’s t density has area
α/2 with (1-α) in the middle:
or P −tn−1,α/2 ≤ X̄−µ
s ≤ t n−1,α/2 =1−α
Ȳ
P X̄ − tn−1,α/2 × sX̄ ≤ µ ≤ X̄ + tn−1,α/2 × sX̄ = 1 − α
Definition
The interval
Ȳ − tn−1,α/2 × sX̄ , X̄ + tn−1,α/2 × sX̄ (25)
is called a (1-α)100% confidence interval for µ. For example if α=0.05, then (1-α)
=0.95, and the interval would be called a 95% confidence interval. In general,
the value (1-α) is called the confidence coefficient. We will often abbreviate con-
fidence interval by “CI”.
The endpoints of the interval, X̄ − tn−1,α/2 × sX̄ and X̄ + tn−1,α/2 × sX̄ are called
the lower and upper confidence limits.
To construct confidence interval formula for estimating the mean of a normal popu-
lation when σ 2 is known we return to the probability
σ σ
P X̄ − zα/2 √ ≤ µ ≤ X̄ + zα/2 √ =1−α
n n
Thus,
σ σ
X̄ − zα/2 × √ < µ < X̄ + zα/2 × √ (26)
n n
Interpretation of CI
In general, a (1-α)100 % confidence interval can be interpreted to mean that if all
possible samples of the same size n were taken, (1-α)% of them would include the
true population mean somewhere within the interval around their sample means,
70
and only α% of them would not.
Example
The inspection division of the Tanzania Bureau of Standards Weights and Measures
Department is interested in estimating the actual amount of soft drink that is placed
in 2-litre bottles at the local bottling plant of a large nationally known soft drink
company in Tanzania. The bottling plant has informed the inspection division/team
that the standard deviation for 2-litre bottles is 0.05 litre. A random sample of 100
2-litre bottles obtained from this bottling plant indicates a sample average of 1.99
litres.
i. Set up a 95% confidence interval estimate of the true average amount of soft
drink in each bottle
ii. Does the population of soft drink fill have to be normally distributed here?
Explain
iii. Explain why an observed value of 2.02 litres would not be unusual, even though
it is outside the confidence interval you calculated
iv. Suppose that the sample average changed to 1.97 litres. What would be your
answer to (a)?
Table 2.2a shows calculation of required confidence interval. As shown in the ta-
ble, the confidence limits (lower and upper) are 1.9802 and 1.9998 respectively or
1.9802 ≤ µ ≤ 1.9998. This interval includes the average (mean) amount (1.99 litre)
of soft drink that is placed in 2-litre bottles at the local bottling plant of the com-
pany. We are 95% confident that the true population mean of soft drink in 2-litre
bottles is between 1.9802 and 1.9998 litres.
0.05
CI: X̄ ± Z × √σn =1.99 ± 1.96 × √
100
=1.99± 0.0098
or
(1.9802, 1.9998)
As regards to part (ii), the answer is no. This is because, in the present case σ
is known and that n=100, by the central limit theorem we may assume that X̄ is
normally distributed. Therefore, the assumption of normally is not required.
71
Concerning part (iii), we argue that the observed value of 2.02 litres should not be
considered unusual even though it is outside the confidence interval calculated in
part (i) because the CI represents the estimate of the average of a sample of 100
2-litre bottles, not an individual value. Moreover, the individual value of 2.02 is only
0.6 standard deviation above the sample mean.
When the sample average changed to 1.97 litres the answer to part (i) would be as
shown in the calculation table 2.2b. We see that the interval 1.9602 ≤ µ ≤ 1.9798
also includes the average (mean) amount (1.97 litre) of soft drink that is placed in
2-litre bottles at the local bottling plant of the company. Therefore, we are 95%
confident that the true population mean of soft drink in 2-litre bottles is between
1.9602 and 1.9798 litres.
Table 2.2b: Calculation of confidence interval estimate
Given: M̄ ean = 1.97, Z = 1.96(f or95%CI), σ =0.05, and n=100
0.05
CI: X̄ ± Z × √σn =1.97 ± 1.96 × √
100
=1.97 ± 0.0098
or
(1.9602, 1.9798)
Exercise
1. A market researcher states that she has 95% confidence that the true average
monthly sales of a product will be between TZS 1,700,000 and TZS 2,000,000. Ex-
plain the meaning of this statement.
2. The manager of the National Microfinance Bank (NMB PLC) Morogoro branch
wants to estimate the mean waiting time of customers who visit the branch for cash
deposit or withdrawal during end of month period. To achieve his goal, the manager
selects a random sample of 30 customers and the result indicate a sample average
waiting time of 4,750 seconds and a standard deviation of 1,200 seconds.
i. Set up a 95% confidence interval estimate of the true average amount of waiting
time of customers at this branch
ii. If the individual stayed for 4,000 seconds, would this be considered unusual?
Explain your answer
3. The quality control manager at a light bulb factory situated along Pugu road
need to estimate the average life of a large shipment of light bulbs. The process
standard deviation is known to be 100 hours. Arandom sample of 50 light bulbs
indicate a sample average life of 350 hours.
72
i. Set up a 95% confidence interval estimate of the true average amount of light
bulbs in this shipment
ii. Does the population of light bulb life have to be normally distributed here?
Explain
iii. Explain why an observed value of 320 hours would not be unusual, even though
it is outside the confidence interval you calculated
iv. Suppose that the process standard deviation changed to 80 hours. What would
be your answer in (a)?
4. Suppose that the manager of a paint supply store wants to estimate the actual
amount of paint contained in 1-gallon cans purchased from Coral Paint company. It
is known from the manufacturer’s specifications that the standard deviation of the
amount of paint is equal to 0.02 gallon. A random sample of 50 cans is selected, and
the average amount of paint per 1-gallon can is 0.995 gallon.
ii. Based on your results, do you think that the store owner has a right to complain
to the manufacturer? Why?
iii. Does the population amount of paint per can have to be normally distributed
here? Explain
iv. Explain why an observed value of 0.98 gallon for an individual can would not
be unusual, even though it is outside the confidence interval you calculated
v. Suppose that you used a 95% CI estimate. What would be your answer to (a)
and (b)?
5. The customer service department of TANESCO would like to estimate the average
length of time between entry of service request and the connection of the service. A
random sample of 15 houses connected to LUKE meters is selected from the records
available during the past year. The results recorded in number of days are as follows:
114, 78, 96, 137, 78, 117, 126, 86, 99, 114, 72, 104, 73, 86.
ii. What assumption about the population distribution must be made in (a)?
iii. Suppose the last value was 286 days instead of 86. What would be your answer
to (a)? What effect does this change have on the confidence interval?
73
2.1.2.3 Confidence interval for a difference of population means (µ1 −
µ2 )
Rarely in real life is our interest confined to a single population. Rather, we are
usually interested in conducting experiments to compare populations. For example,
suppose we wish to compare the effects of two concentrations of a toxic agent on
weight loss in rats. We select a random sample of rats from the population of interest
and then randomly assign each rat to receive either concentration 1 or concentration
2. The variable of interest is
Until the rats receive the treatments, we may assume them all to have arisen from a
common population for which X has some mean µ and variance σ 2 . Because these
are continuous measurement data, it is reasonable to further assume that X ∼ µ, σ 2
Because of the nature of the data, it is further reasonable to think about two random
variables X1 and X2 one corresponding to each population, and to think of them
as being normally distributed:
Population 1 X1 :∼ µ1 , σ12
Population 2 X2 :∼ µ2 , σ22
Facts: It may be shown mathematically that if both of the random variables are
normally distributed, then the following facts are true:
The random variable (X1 − X2 ) satisfies
2)
(X1 − X2 ) ∼ N (µ1 − µ2 , σD
2 = σ2 + σ2
where σD 1 2
74
2 ), σ 2 = σ12 σ12
That is, D̄ ∼ N (µ1 − µ2 , σD̄ D̄ n1 + n2
D̄ − (µ1 − µ2 )
σD̄
In practical situations σ12 and σ22 will be unknown. The obvious strategy would be
to replace them by estimates.
Two cases:
Under these conditions it can be shown that a “pooled” estimate of the common
σ 2 is given by
2 =2 s2
Because the two sample sizes are equal, an obvious estimator for σD̄ n
It may be shown that this statistic has a Student’s t distribution with 2(n-1) degrees
of freedom
By the same reasoning as in the single-population case, the confidence interval for
(µ1 − µ2 ) is thus
75
Example
The following data concern two types of rations, A and B, being fed to pigs. An
experiment was conducted in which 12 randomly selected pigs were fed ration A and
12 were fed ration B with the goal of determining whether there is a difference in
the weight gains (lbs) for pigs fed the two different rations.
A: 31 34 29 26 32 35 38 34 30 29 32 31
B: 26 24 28 29 30 29 32 26 31 29 32 28
Assume the normality assumption is reasonable; find a 95% confidence interval for
the difference in means (µ1 − µ2 )
Table 2.3 shows calculation of required confidence interval. As shown in the ta-
ble, the confidence limits (lower and upper) are 0.6689 and 5.4978 respectively or
0.6689 ≤ µ1 − µ2 ≤ 5.4978. This interval includes the difference is means (3.0833
lbs) between ration A and ration B given to the rats. We are 95% confident that
the true difference in population mean of gain in weight (lbs) for pigs fed the two
different rations is between 0.6689 and 5.4978 lbs. Note that the calculations in
Table 2.3 can be obtained in SPSS as shown in Table A of the Appendix.
Here n1 = n2 = n =12. The usual calculations give X̄1 = 31.75, X̄2 = 28.6667.
(n−1)s21 +(n−1)s21
Also (n-1) =11 so that s2 = 2(n−1) = 8.1314(Check!)
= 1.6141
76
Exercise
1. For the same problem, construct a 90% CI and comment on the results by com-
paring with the 95% C.I. obtained above.
2. Write expressions for confidence limits (lower and upper) for the difference be-
tween two population means (µ1 − µ2 ) in case σ12 = σ22 (but known) and n1 = n2
2.2.1 Introduction
We have already discussed the notion of random samples and the different methods
that are commonly used in drawing them from the population of interest. The word
“random” implies that chance is involved.
Let us describe the idea of probability in terms of what is probably the simplest
situation where chance is involved. The flip of a coin! We develop the terminology
and properties first from this simple situation, and them extend the ideas behind
them to real situations.
77
2.2.2.2 Mutually exclusive events
Outcomes do not overlap, i.e., events that cannot happen together.
P (S) = P (Oi ) = 1
i
If P (E) is the probability of occurrence of E, then P (Ē) = 1−P (E)is the probability
of nonoccurrence of E
78
Exercise
1. Discuss and criticize the following
P (A) = 23 , P (B) = 14 , P (C) = 16 , for the probability of three mutually exclusive
events A, B, and C.
2. Let A and B be two events associated with an experiment. Suppose that P (A) =
0.4 and P (AU B) = 0.7. Let P (B) = p. For what choice of p are A and B mutually
exclusive
n
= P (Ai )
i=1
Examples see exercise on probability
Example
Two identical urns contain 10 balls. Urn U contains 6 white and 4 black balls, and
urn V contains 2 white and 8 black balls. Select an urn at random and without
looking at its content, guess whether you have urn U . At this moment, you have
P (U ) = P (V ) = 1/2. Now, you may gather additional information by drawing one
ball at random from the urn and looking at its color. Assume that the ball is white.
79
What is your best guess now? Hint: use Bayes’ rule.
Random variables
Statistical methods for analyzing data have their foundations in probability. Samples
are chosen at random meaning that an element of chance is introduced. Thus,
the observations are best viewed as random, that is, they are subject to chance.
We therefore name the variable of interest a random variable. Data are the obser-
vations on the random variable.
Notation
X=random variable (often abbr. as r.v). X1 , X2 , . . . , Xn are observations on X.
Results
Events of interest may be formulated in terms of random variables. For example.
Tossing a coin twice, let X = # H in 2 coin tosses
Notation
We denote the probability function of a discrete random variable by f (x) or p(x)
If f (x) represents the probability that X = x, i.e., P [X = x] then the following must
be satisfied:
(i) 0 ≤ f (x) ≤ 1. ∀x
(ii) f (xi ) = 1
i
80
Example Refer to the previous example.
X f (x) = P [X=x] F (x) = P [X ≤ x]
0 1/4 1/4
1 2/4=1/2 3/4
2 1/4 1
∞
b
(i) f (x) ≥ 0, (ii) f (x)dx = 1, (iii) P [a ≤ X ≤ b] = f (x)dx
−∞ a
Example
c(1 − x)2 , 0 ≤ X ≤ 1
Given the function f (x) =
0, elsewhere
Properties of expectations
81
2.2.5 Discrete distributions
Definition
A random variable X is said to follow a binomial distribution if it assumes only
non-negative values and its probability mass function is given by:
n
p(x) = P (X = x) = px (1 − p)n−x , x=0, 1,. . . , n, q = 1 − p
x
=0, elsewhere
Notation
X∼B(n, p) denotes that the random variable X follows binomial distribution with
parameters n and p
Note:
1. A Bernoulli trial is an experiment with only two possible outcomes, a success (s)
and a failure (f). Its p.m.f is given by: f (x) = px (1 − p)1−x , x=0, 1 (0<p<1).
n n!
2. = nC x =
x x!(n - x)!
For example, tossing of a coin, throwing of dice, drawing cards from a pack of cards
with replacement etc, lead to binomial probability distribution.
Properties of the binomial distribution
If X has a binomial distribution with parameters n and p then E[X]=np and
var[X]=npq=np(1-p)
n n
Hint: (a+b)n = an−x bx
x=0 x
Examples
1. An investor buys five single residence dwellings as an investment. He assumes
82
that the probability he will make a profit on each is 0.8. Assuming independence;
2. A manufacturer of small parts ships his parts in lots of size 20 to his customer.
Assume that each part is or is not defective and that the probability an individual
part is defective is 0.05.
e−λ λx
p(x, λ) = P (X = x) = ; x = 0, 1, 2, ...
x!
λ is the parameter of the distribution
Notation
X∼P( λ) denotes that the random variable X follows Poisson distribution with pa-
rameter λ.
83
Properties of the Poisson distribution
If a r.v X follows Poisson distribution with parameter λ then E(X) = λ and
var(X) = λ
Examples
1. Let X be the number of typing errors a typist makes per typed page. If the typist
makes the average of 3 errors per page. What is the density function of X? Find
E[X], var(X), P [X ≥ 3]. What is the probability that there is no typing error in 2
typed pages.
2. A car hire firm has two cars, which it fires out day by day. The number of demands
for a car on each day is distributed as Poisson variate with mean 1.5. Calculate the
proportion of days on which neither car is used.
Example
The joint probability distribution of X and Y is given by
x+y
f (x, y) = 30 for x=0, 1, 2, 3; y=0, 1, 2
=0 otherwise
Find (i) P [X ≤ 2, Y = 1], (ii) P [X + Y = 4]
Independence
Two random variables X and Y are said to be independent if and only if fX,Y (x, y) =
fX (x) × fY (y) ∀x,y
Where fX,Y (x, y) is the joint density for X and Y . fX (x) and fY (y) are the marginal
densities for X and Y respectively.
Exercise
x+1 2y+3
Suppose fX (x) = 10 , x =0, 1, 2, 3. and fY (y) = 15 , y =0, 1, 2
84
2.2.6 Continuous probability distribution
Recap
Continuous random variables
Many of the variables of interest in scientific investigations are continuous; thus,
they take on any value. For example, suppose we obtain a sample of n pigs and
weigh them. Thus, the random variable of interest is
X =weight of a pig
and the data are X1 , ...Xn the observed weights for our n pigs. X is a random variable
because the pigs were drawn at random from the population of all pigs. Furthermore,
the pigs do not weigh exactly the same; they exhibit random variation due to
biological and other factors.
Goal
Find a function like f (x) for a discrete r.v. that describes the probability of observing
a pig weighing x units.
This function would thus serve as a model describing the population of pig weights-
how they are distributed and how they vary.
Technical note
For continuous r.v. we do not think about the probability that X is exactly equal
to some value x as we did for a discrete r.v. The reason is that, due to the limita-
tions imposed by the precision of measuring devices. Thus, we instead speak of the
probability that X falls into an interval.
Definition
A random variable X is said to have a normal with parameters µ and σ 2 if its pdf is:
1 1 x−µ 2
f (x; µ, σ 2 ) = √ e− 2 ( σ ) − ∞ < x < ∞, −∞ < µ < ∞, 0<σ<∞
2πσ
Notation
We write X ∼ N (µ, σ 2 )to say that the r.v. X has this (normal) distribution. The
curve of the normal density is bell-shaped symmetric about x = µ.
Properties
If a r.v X has a normal distribution with parameters µ and σ 2 then, E[X] =µ and
var(X) = σ 2
85
The standard normal distribution
The probability density function f (x) for a normal distribution has a very compli-
cated form. Thus, it is not possible to evaluate easily probabilities for a normal r.v.
X the way we could for a Poisson or binomial. Luckily, however, these probabilities
are widely available in tables in the special case when µ=0 and σ 2 =1. For example,
see Table 4 on page 7 of statistical tables.
Situation
Suppose X ∼ N (µ, σ 2 ). We wish to calculate probabilities such as P (X ≤ x),
P (X ≥ x), P (x1 ≤ X ≤ x2 ) that is, probabilities of intervals associated with
values of X.
Technical note
When dealing with probabilities for any continuous r.v., we do not make the distinc-
tion between strict inequalities like “<” and “>” and inequalities like “≤” and “≥”
for the reasons discussed above.
Definition
If X is normally distributed with mean µ and variance σ 2 , then Z = X−µ
σ will also
be normally distributed with mean zero and standard deviation 1. Hence, we call Z
a standard normal r.v., and we write Z ∼ N (0, 1)
Example
1. P (Z ≥ 1.23)
2. P (Z ≤ 1.23)
3. Find the value of z such that P (Z ≥ z) = 0.05
4. P (Z ≤ −1.23)
5. P (0.23 ≤ Z ≤ 1.45
6. P (|Z| ≥ 0.89)
Exercise
1. P (|Z| ≤ 1.68)
86
2. Find the value of z such that:
(i) P (|Z| ≥ z) = 0.05
(ii) P (|Z| ≤ z) = 0.975
Examples
Suppose µ=8 and variance σ 2 =4, X ∼ N (8, 4)
(i) P (X ≥ 9.5), (ii) P (6 ≤ X ≤ 10) EXERCISE: P (|X| ≥ 8.6)
87
3 Chapter Three: Sampling distributions
3.1 Introduction
Since statistics are random variables, their values will vary from sample to sample,
and it is customary to refer to their distributions as sampling distributions.
Idea
Using these facts, transform events about X̄into events about a standard normal r.v.
Z= σX̄−µ
/√n
Example
X̄ is based from a sample of size n=25 observations on a random variable X ∼
N (6, 9). Here µ= 6 and σ=3. Find P (X̄ ≥ 6.9)
Recap
Use statistics to estimate µ, the population mean, using the obvious estimator, X̄.
We would like to be able to make statements about “how likely” it is that X̄would
take on certain values. We saw above that this involves appealing to the normal
2 )
X̄ ∼ N (µ, σX̄
Practical problem
σ 2 , and hence σX̄
2 is not known. An obvious approach would be to replace σ
X̄ in
X̄−µ
our standard normal statistic σX̄ by the obvious estimator sX̄ and consider instead
the statistic
88
X̄−µ
sX̄ .The probability associated with the values taken on by this quantity is called
the Student’s t distribution with (n-1) degrees of freedom. (We discuss the
notion of degrees of freedom in a moment)
Notation
We will write tv to denote a r.v. with the t distribution with v degrees of freedom
Tables of probabilities associated with the values of a r.v. with the t distribution are
readily available. For example Table 5 of Statistical Tables on page 6.
Note: The t distribution is centered at 0 and has the same symmetric, bell
shaped as the normal, but whose probabilities in the extreme “tails” are larger
than those of the normal distribution. Again, as always, the total area is equal to 1.
Note: In some cases, the values desired won’t be in the table. In this case, we
report a range for which the expected value lies.
89
3.1.4 The
It is also known that if s21 and s22 are the variances of independent random samples
of size n1 and n2 from normal distributions with the variances σ12 and σ22 , then
s2 /σ2 σ22 s21
F = s12 /σ12 = σ12 s22
is a r.v. having an F distribution with n1 -1 and n2 -1 degrees of
2 2
freedom.
Degrees of freedom
(n−1)s2 σ 2 s2
For the statistics X̄−µ
sX̄ , σ2
,and σ22 s12 which follow the t, χ2 and the F distribu-
1 2
tions respectively, the notion of degrees of freedom has arisen. The probability
associated with each of these statistics depends on the sample size n through the
degrees of freedom value (n-1). What is the meaning of this?
Not that all statistics depend on s2 , and recall that
1 n
2
s2 = Xi − X̄
n i=1
n
Recall also that it is always true that Xi − X̄ = 0
i=1
Thus, if we know that values of (n-1) of the observations in our sample, we may
always compute the last value because the deviations about X̄of all n of them must
sum to zero. Thus, s2 must be thought of as being based on (n-1) “independent”
deviations-the final deviation can be gotten from the other (n-1)
The term degrees of freedom thus has to do with the fact that there are (n-1)
“free” or “independent” quantities upon which the r.v.s above are based.
90
4 Chapter Four: Hypothesis testing
4.1 Introduction
Often in real life we take observations on a sample with a specific question in mind.
Hypothesis testing is another way of data analysis. It begins with some theory, claim,
or assertion about a particular parameter of a population. For example, the branch
manager of CRDB bank may claim that the average waiting time for customers to
be served is 368 seconds, which the manager believes that it is a reasonable time
below which it is considered unusual thus requires reallocation of some tellers to
other branches within the country. That is, no corrective measure/action is needed
in the service/teller department of the bank. In contrast, above this time, it is con-
sidered an intolerable situation for most of the customers of the bank thus requires
attention of the bank management including increasing number of tellers to meet
recommended waiting time.
This problem can be translated into the language of statistical hypothesis or hy-
pothesis testing and based on collected data, one of the following two conclusions
will be made/reached regarding the manager’s claim:
ii. The average waiting time is not 368 seconds; either is less than 368 seconds,
or it is more than 368 seconds. Corrective measure is needed.
i. µ = 368 seconds
Note: The null hypothesis (H0 ) is the hypothesis that is always tested. The alter-
native hypothesis (H1 ) is set up as the opposite of the null hypothesis and represents
the conclusion supported if the null hypothesis is rejected.
91
4.1.1 Level of significance
The compliment (1-α) of the probability of a Type I error is called the confidence
coeffficient. It is the probability that the null hypothesis H0 is not rejected when
in fact it is true and should not be rejected.
The compliment (1-β) of the probability of a Type II error is called the power of a
statistical test. It is the probability of rejecting a null hypothesis H0 when in fact
it is false and should be rejected.
Summary
When α ↓ β ↑ and vice-versa. Solution is to increase sample size so that power of the
test increases by lowering/decreasing β. That is for any value of α, increasing sample
size n, β ↓, the power of the test (1 − β) ↑. However, remember that resources are
always limited. In general, choice of reasonable values for α and β depends on the
costs inherent in each type of error.
Question
How do we decide which hypothesis between the null and the alternative is correct?
92
In testing hypothesis we partition the possible values of the test statistic into two
subsets: an acceptance region for H0 and a rejection region for H0
93
Regions of rejection and nonrejection
In the process of deciding either of the two hypotheses, two kinds of errors (Type I
error and Type II error) are possible.
Actual situation
Statistical decision H0 True H0 False
Do not reject H0 Confidence coefficient (1-α) Type II error (β)
Reject H0 Type I error (α) Power (1-β)
Type I error occurs when the null hypothesis (H0 ) is rejected when in fact it is
true and should not be rejected. On the other hand, a Type II error occurs when
the null hypothesis H0 is not rejected when in fact it is false and should be rejected.
H0 : θ = θ0 vs.H1 : θ = θ0
In this hypothesis we reject H0 when θ̄, the point estimate of θ is much larger or
smaller than θ0 . Such a test is referred to as a two-tailed or two-sided test.
94
H0 : θ = θ0 vs. H1 : θ < θ0 or θ = θ0 vs. H1 : > θ0
It would seem reasonable to reject H0 only when the point estimate θ̄ is much smaller
(left tail of the test statistic) or larger (right tail of the test statistic) than θ0 respec-
tively. Such a test is called a one-tailed or sided test. Thus, a one sided or tailed
test is any test where the critical region consists of only one tail of the sampling
distribution of the test statistic.
iii. Determine the value of the test statistic from the sample data
iv. Check whether the value of the test statistic falls into the critical region and,
accordingly, reject H0 , or accept it or reserve judgment!
Procedures
Hypotheses:
General form: H0 : µ = µ0
against
Two-sided: H1 : µ = µ0
Level of significance: α
Test statistic:
X̄ − µ0
t=
sX̄
Decision: Reject H0 if
One-sided: t > tn−1,α or t < −tn−1,α
95
Two-sided: |t| > tn−1,α/
2
Example 1
Suppose that it is known from experience that the standard deviation of the weight
of 8-kg packages of cookies made by a certain bakery is 1.6 kg. To check whether its
production is under control on a given day, that is, to check whether the true average
weight of packages is 8-kgs, employees select a random sample of 25 packages and
find that their mean weight is X̄=8.091 kgs. Since the bakery stands to lose money
when µ >8 and the customer loses out when µ <8, test the hypotheses: H0: µ =8 vs.
H1 : µ =8 at the 0.01 level of significance.
Solution
Procedure:
Hypotheses:
Test statistic:
X̄ − µ0
Z= √
σ/ n
Decision: Reject the null hypothesis if Z < −Zα/2 or Z > Zα/2
Substituting X̄=8.091, µ0 =8, σ = 0.16 and n=25 into the test statistic we have,
8.091 − 8
Z= √ = 2.84
0.16 25
Example 2
It is thought that the body temperature of intertidal crabs exposed to air is less than
the ambient temperature. Body temperatures were obtained from a random sample
of 8 such crabs exposed to an ambient temperature of 25.4 degrees Celsius. Assume
that body temperatures are approximately normally distributed.
Solution
Procedure
96
Hypotheses
Let µ be the mean body temperature for the population of intertidal crabs exposed
to an ambient temperature of 25.4 degrees Celsius. Then we wish to test
⎡
2 ⎤
n
Xi
⎢n ⎥
s2 = 1 ⎢ Xi2 − i=1 ⎥ = 0.479821428 (check!)
n−1 ⎣ n ⎦
i=1
X̄−µ0
So that t = sX̄ = 0.245=-1.470.
−tn−1,α = −t7,0.05 = −1.895.
The value of the test statistic, -1.470 is not less than the critical value, -1.895; we
do not reject H0 at level of significance 0.05. There is not enough evidence in the
sample to suggest that the mean body temperature of intertidal crabs exposed to air
at 25.4 degrees Celsius is indeed less than 25.4
Exercise Using the same data test at α=0.1 and give your conclusion as a meaningful
sentence.
Exercise
1.Suppose that 100 cakes made by a certain fast food store lasted on the average
14 days with a standard deviation of 2 days. Test the null hypothesis µ =10 days
against the alternative that µ < 10 days at the 0.05 level of significance. Present
your findings in a written report covering the abstract, introduction, literature re-
view, methodology, results and discussion, conclusion and recommendation(s) and
references cited.
2. A supplier of ARVs for relieving HIV’s victims claims that the mean half-life
for the respective doses is 2650 hours. A random sample of 25 ARVs doses taken
had a mean half-life of 2640 hours with a standard deviation of 10 hours. Test the
supplier’s claim at 1% level of significance. Present your findings in a written report
97
covering the abstract, introduction, literature review, methodology, results and dis-
cussion, conclusion and recommendation(s) and references cited.
i. Using the 0.01 level of significance, is there evidence that the population aver-
age is above TZS 300,000?
ii. Find the p-value of the test and interpret its meaning
i. Using the 0.01 level of significance, is there evidence that the average weight
of the boxes is different from 6.25 kg?
ii. Find the p-value of the test and interpret its meaning
iii. What will your answer in (a) be if the standard deviation is 0.05 kg?
iv. What will your answer in (a) be if the sample mean is 6.211 kg?
As we have discussed, the usual situation in real life is that in which we would like
to compare two populations. For example, we may want to decide based on sample
information whether two competing treatments are significantly different or compare
a new treatment with a standard or control one or we may want to decide on the
basis of an appropriate sample survey whether the average food expenditure of fam-
ilies in one city exceed those of families in another city by at least 2500 Tsh.
The two samples are of sizes n1 and n2 having the means µ1 and µ2 with known
variances σ12 and σ22 . Suppose further that we want to test the hypothesis:
H1 : µ1 − µ2 = δ or µ1 − µ2 > δ or µ1 − µ2 < δ
98
Test statistic:
As in the case of constructing confidence intervals for µ1 − µ2 , intuition suggests that
we base our inference on X̄1 − X̄2 . The test statistic is
X̄1 − X̄2 − δ
z=
σ12 σ22
n1 + n2
For the above alternative hypotheses, the value of the test statistic is compared with
the respective critical regions |z| > zα/2 , z > zα and z < −zα
Note: When we deal with independent random samples from populations with un-
known variances that may not even be normal, we can still use the above test statistic
with s21 substituted for σ12 and s22 for σ22 as long as both samples are large enough.
Solution
Procedure
Hypothesis
Substituting the data given into the formula for test statistic we have
2.61 − 2.38 − 0.20
z= = 1.08
(0.12) (0.14)
50 + 40
The critical value, zα/2 = z0.025 =1.96. Since 1.08 does not exceed 1.96, we do not
reject the null hypothesis. This means that the difference between 2.16-2.38=0.23
and 0.20 is not significant.
99
5 Appendices
5.1 Exercise 1
i. Calculate the sample mean, median, sample variance and sample standard
deviation, range, coefficient of variation, and standard error of the mean for
the above data by hand.
ii. Verify that the sum of the deviations (observation -mean) is zero. Do this by
hand.
iii. Suppose that the observation 8.4 was recorded instead as 84 due to a record-
ing error. Calculate the sample mean and median of the data set under this
condition and comment on the effect of this error.
4. The mean and variance of a set of 10 values are known to be 17 and 33 respectively.
Of the 10 values 26 was subsequently found wrong and the correct value was 16. Find
the correct mean and variance of the distribution.
Find the missing information from the following data.
5. The mean weight of 150 students in a certain class is 60 kg. The mean weight of
the boys is 70 kg and that of the girls is 55 kg. Find the number of boys and girls
in the class.
6. The mean of 100 observations is 50 and standard deviation 10. What will be the
new mean and standard deviation if:
(i) 5 is added to each observation
100
(ii) Each observation is multiplied by 3.
7. Examine whether the following results of a piece of computation for obtaining
the variance are consistent or not.
n=120, Xi = 128, X̄=-125
8. The mean annual salary paid to all employees in a company was $ 15000. The
mean annual salaries paid to male and female employees of the company were $
15600 and $ 12600 respectively. Determine the percentages of males and females
employed by the company.
101
Construct:
Class limits and (ii) class boundaries
Class marks 3.95 4.95 5.95 6.95 7.95 8.95 9.95 10.95 11.95 12.95
Frequencies 1 4 5 13 12 19 13 10 6 4
10. (a) What is meant by a frequency distibution? Describe the main steps in
preparation of a frequency distribution table from a raw data.
(b) A frequency distribution has 8 consecutive classes of equal width. The class mark
of the third is 24.5. The upper class limit of the 5th class is 49. The frequencies
of classes from lowest to highest classes are 8, 32, 142, 216, 240, 206, 143, and 13.
Complete the frequency distribution.
11. (a) What is a measure of dispersion? Which among the following is not a
measure of dispersion. Standard deviation, first quartile, range, quartile deviation.
(b) The mean annual salary paid to all employees in a company was $ 15000. The
mean annual salaries paid to male and female employees of the company were $
15600 and $ 12600 respectively. Determine the percentages of males and females
employed by the company.
(c) On a final examination in Biometry, the mean grade of a group of 150 students
was 78 and the standard deviation was 8.0 In Mathematics, however, the mean final
grade of the group was 73 and the standard deviation was 7.6. In which course was
there the greater (i) absolute dispersion? (ii) relative dispersion?
12. (a) Give the essential characteristics of a good average.
(b) For a certain frequency distribution, the mean was 40 and mode 10. Find the
median
13. (a) Using example(s) explain the distinction between:
(i) a parameter and statistic
(ii) a discrete variable and continuous variable
(b) In general statistics plays a very significant role in any scientific research. In
your own words what do you understand by the term statistics in such a research.
14. Present the following data into a frequency distribution with classes as 80 - 89,
90 - 99, 100 - 109, etc.
15. Determine class boundaries, class limits and class marks for first, and last classes
in respect of the following:
(i) Weights of entering 300 freshmen ranged from 98 to 226 kgs
(ii) The thickness of 460 washers ranged from 0.421 to 0.563 inches.
102
5.2 Exercise 2
1. (a) In general statistics plays a very significant role in any scientific research. In
your own words what do you understand by the term statistics in such a research.
(b) Define the following terms: (i) data (ii) population (iii) sample
(c) Classify the following data as quantitative, qualitative, continuous, discrete:
i. animal colour
v. humans, of specified age-group, tribe and sex, with their income being specified
2. (a) Enumerate the different methods of collecting data. Which one is the most
suitable for conducting inquiry regarding perception of people on adherence to stan-
dard fishing practices among young men in Ukerewe District in Tanzania? Explain
its merits and demerits.
(b) Explain the merits and limitations of the observation method in collecting data.
Illustrate your answer with a suitable example.
3. (a) The numbers 3.2, 5.8, 7.9 and 4.5 have frequencies x, (x + 2), (x-3) and (x+6)
respectively. If their arithmetic mean is 4.876. Find the modal number.
(b) What is a measure of location? Which among the following are not measures of
location? Mode, standard deviation, first quartile, range, quartile deviation.
(c) Give the essential characteristics of a good average.
(d) What is a measure of dispersion? Which among the following is not a measure
of dispersion? Standard deviation, first quartile, range, quartile deviation.
(e) Compare as far as possible, the range, mean deviation, and standard deviation
as measure of dispersion.
(f) The mean of five items of an observation is 4 and the variance is 5.2. If three of
the items are 1, 2 and 6. Find the other two.
2. The mean annual salary paid to all employees in a company was Tanzania Shillings
(TZS) 150,000. The mean annual salaries paid to male and female employees of the
company were TZS 156,000 and TZS 126,000 respectively. Determine the percentages
of males and females employed by the company.
(b) For a certain frequency distribution, the mean was 40 and mode 10. Find the
median.
3. (a) An analysis of monthly wages paid to the workers in two firms A and B
belonging to the same factory, gave the following results.
103
Firm A Firm B
Number of employees 986 548
Average wages 52.5 47.5
Variance of Wages 100 121
(b) On a final examination in MTH 106: Introductory statistics, the mean grade of
a group of 150 students was 78 and the standard deviation was 8.0. In mathematics,
however, the mean final grade of the group was 73 and the standard deviation was
7.6. In which course was there the greater (i) absolute dispersion? (ii) relative
dispersion?
4. The average marks of 100 students were to be 60. But it was later discovered
that the score of 63, was misread as 36. Find the correct average, corresponding to
the correct score.
5. Find the arithmetic mean for the frequency distribution table of question 1 (b)
in section II.
6. Draw the ogives for the following data and hence find the value of median from
it. Check the value of the median by actual calculation.
Class: 0-9 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79
Freq. 328 350 720 664 598 524 378 244
104
8. Find the standard deviation of the following frequency distribution.
9. The average marks of 100 students were to be 60. But it was later discovered
that the score of 63, was misread as 36. Find the correct average, corresponding to
the correct score.
5.3 Exercise 3
1. The data in the following table represent the size of an organism at equally spaced
times 0 to 8. Use the first 8 observations to estimate the regression equation of size
of the organism on time.
Time 0 1 2 3 4 5 6 7 8
Size 0.75 1.20 1.75 2.50 3.45 4.70 6.20 8.25 11.5
105
Food cons. kg (X) 87.1 93.1 89.8 91.4 99.5 92.1 95.5 99.3
Body weight, kg-(Y) 4.6 5.1 4.8 4.4 5.9 4.7 5.1 5.2
n
n
n
n
n
Xi = 37.40, Xi2 = 214.20, Yi = 13.30, Yi2 = 26.09, Xi Yi = 73.47
i=1 i=1 i=1 i=1 i=1
Perform whatever analysis you think is most appropriate to derive a numerical char-
acterization of the nature of the association between X and Y. You need not construct
confidence intervals, estimate standard errors, or perform any hypothesis tests.
n
(Xi − X̄)(Yi − Ȳ )
i=1
rXY =
n
n
(Xi − X̄)2 × (Yi − Ȳ )2
i=1 i=1
or
n
n
n
n× Xi Yi − X Yi
i=1 i=1 i=1
rXY =
! n
2 " ! n
2 "
n × X2 − X
n n
× n× 2
Yi − Y
i
i=1 i=1 i=1 i=1
In calculating the Pearson correlation coefficient, one may use either the first or sec-
ond equation.
106
514.29 − 497.42
rXY =
[1499.4 − 1398.76] × [182.63 − 176.89]
In the table above, 0.702 (approximated to three decimal places) is the Pearson
correlation coefficient, same as the one obtained by hand calculation or using R/S-
plus above.
5.4 Exercise 4
107
5. Two good dice are rolled simultaneously. Let A be the event, the sum shown
is “6” and B the event “the two show the same number”. Find (i) P (A/B), (ii)
P (B/A).
6. The probability that a certain girl will go out is 0.60, and the probability that if
she goes out she will spend shs. 30, 0000 is 0.80. What is the probability that she
will go out and spend shs. 30,000?
7. The probabilities that a student will get passing grades in General Chemistry,
in Introductory Statistics or in both are P (PS100)=0.70, P (MTH106)=0.56. Check
whether events A and B are independent
8. The probability that a man will be alive 30 years is 2/5 and the probability that
his wife will be alive in 30 years is 1/2. Find the probability that:
(i) Both will be alive
(ii) Only the man will be alive
(iii) Only the wife will be alive
(iv) Neither will be alive in 30 years
9. Two cards are drawn from a well-shuffled ordinary deck of 52 cards. Find the
probability that there are both aces. If the first card is:(i) replaced (ii) not replaced.
5.5 Exercise 5
108
T +=the test is positive (indicating that the disease is present)
T −=the test is negative
Z+=the individual has the disease
Z−=the individual does not have the disease
and
P(T+|Z+)=the sensitivity of the test
P(T-|Z+)=the probability of a false negative
P(T-|Z-)=the specificity of the test
P(T+|Z-)=the probability of a false positive
P (Z+) = the prevalence of the disease in the population
(i) Use Bayes’ rule to find the “predictive value” of a positive test P(Z+|T+) for a
test with 98% specificity and 99% sensitivity when the prevalence is 0.5%
(ii) An ELISA test for AIDS has 99.5% specificity and is used on 140 employees of a
medical clinic. If all 140 are free of AIDS, what is the probability that at least one
of the 140 people will nevertheless test positive for the disease?
5.6 Exercise 6
In the following questions, in each case, draw a picture for the probability or normal,
t or χ2 value you are calculating.
1. Let Z denote a standard normal random variable; Z ∼ N (0, 1). Find:
(a) P (Z ≥ 1.44)
(b) P (Z ≥ 0.34)
(c) P (Z ≤ −0.86)
(d) P (Z ≤ 1.22)
(e) P (−2 ≤ Z ≤ 2)
(f) P (−0.08 ≤ Z ≤ 1.87)
(g) P ([Z] ≤ 1.96)
(h) P (|Z| ≥ 1.28)
(i) P (≤ 1.22 ≤ Z ≤ 3.01)
2. Let Z be as in the previous problem. Find z such that:
(a) P (Z ≥ z) = 0.171
(b) P (Z ≤ z) = 0.9913
(c) P (−0.25 ≤ Z ≤ z) = 0.05
(d) P (|Z| ≥ z) = .10
109
(e) P (|Z| ≤ z) = 0.95
3. Suppose that Y is a random variable with a normal distribution with mean
µ = 20 and variance σ 2 = 16, that is, Y ∼ N (20, 16). Find
(a) P (Y > 25)
(b) P (Y < 17)
(c) P (18<Y < 26)
(d) P (22 ≤ Y ≤ 30)
(e) P (Y < 23.5)
4. Suppose that Y ∼ N (10; 9). Find y such that
(a) P (Y ≥ y) = 0.025
(b) P (Y ≤ y) = 0.02
(c) P (Y < y) = 0.93
(d) P (Y > y) = 0.90
5. Suppose that Y ∼ N (40, 64) and that a sample of size n = 25 is obtained. Let
Ȳ = sample mean. Find.
(a) P (Ȳ > 44.3)
(b) P Ȳ ≤ 37.5
(c) P 42.3 ≤ Ȳ ≤ 44.4
(d) P Ȳ ≥ 38.9
(e) P (Ȳ < 41.9)
(f) y such that P Ȳ ≤ y = 0.90
6. Let χ2v be a chi-square random variable with v degrees of freedom. In each case,
find x satisfying the statement.
(a) P (χ4 > x) = 0.10
(b) P (χ22 ≤ x) = 0.025
(c) P (χ5 ≥ x) = 0.01
7. With χ2v as in Problem 6, find the following probabilities. Note: the probabilities
may not be given exactly in the Table. In this case, give a range in which the
probability in question must fall. We will see when we study hypothesis testing that
this is often all the information we need.
(a) P (χ18 > 24.18)
(b) P (χ27 ≥ 19.21)
(c) P (χ2 ≥ 12.2)
(d) P (χ14 < 7.11)
110
each case, find t satisfying the statement.
(a) P (t23 > t) = 0.025
(b) P (t7 ≤ t) = 0.975
(c) P (|t7 | ≥ t) = 0.10
(d) P (|t16 | < t) = 0.60
9. With tv as in Problem 8, find the following probabilities. As in Problem 7, the
probabilities may not be exactly given in the Table, so you must give a range in
which the probability in question must fall.
(a) P (t12 ) ≥ 2.34)
(b) P (t25 ≥ 1.45)
(c) P (t7 ≤ 1.31)
(d) P (|t19 | > 2.95)
5.7 Exercise 7
1. The following data are the random yields of two varieties of sugarcane obtained
from an experiment conducted at Mtibwa Sugar Company in Turiani-Morogoro.
Assume that dissolved oxygen concentrations may be thought of as being well repre-
sented by a normal distribution (continuous measurements), N (µ, σ 2 ), obtain a 95%
confidence interval for µ, the population mean dissolved oxygen concentration for
such samples.
3. The following data are from Finney (1978, Statistical Method in Biological Assay,
111
P. 179) and are from an experiment to investigate the influence of different doses of
vitamin A on weight gain over a 3-week period. For 5 rates receiving 2.5 units of
vitamin A, the following weight increases (mg) were observed:
35 49 51 43 27
Assume the normal assumption seems reasonable, that is the population of weight
increases for all possible rates receiving 2.5 units of vitamin A may be approximated
by a N (µ, σ 2 ) probability distribution, obtain a 90% confidence interval for µ
4. The following data concern two types of rations, A and B, being fed to pigs.
An experiment was conducted in which 12 randomly selected pigs were fed ration A
and 12 were fed ration B with the goal of determining whether there is a difference
in the weight gains (lbs) for pigs fed the two different rations.
A 31 34 29 26 32 35 38 34 30 29 32 31
B 26 24 28 29 30 29 32 26 31 29 32 28
Assume the normality assumption is reasonable; find a 95% confidence interval for
the difference in means (µ1 − µ2 )
5. It is thought that the body temperature of intertidal crabs exposed to air is less
than the ambient temperature. Body temperatures were obtained from a random
sample of 8 such crabs exposed to an ambient temperature of 25.4 degrees Celsius.
Assume that body temperatures are approximately normally distributed, test the
hypotheses
6. For the pig data in question 4. The goal was to determine if 2 different rations
fed to pigs result in different weight gains. Again, we are interested in whether the
rations are different. Test:
H0 : µ1 − µ2 = 0 vs. H1 : µ1 − µ2 = 0
7. It is thought that the mean clutch size of ducks raised in captivity is smaller
than that of ducks breeding in the wild. Suppose it is reasonable to assume that
variability in clutch size is different for ducks raised in captivity from that for ducks
breeding in the wild. Assume that clutch size is approximately normally distributed,
so that
Population 1: Wild N1 (µ1 , σ12 )
112
Captive 10 11 12 11 10 11 11
Wild 9 8 11 12 10 13 11 10 12
(a) Compute sample variances (point estimates) for the two species
(b) Test the hypothesis that the two population variances from which the samples
were derived are equal
(c) Give an expression of the test statistic you would use if you were asked to test
for the difference in population means
(d) Give assumptions, if any, you need to be able to use the test statistic you have
written in part (c) above.
10. Write TRUE for the correct statement and FALSE for the wrong statement.
(a) A critical region means all values constituting a region leads to acceptance of a
the null hypothesis
(b) Type II error is the one committed by maintaining a true null hypothesis when
in fact the alternative one is correct
(c) Confidence limits are two end points within which a sample parameter falls with
specified degrees of freedom
(d) A two sample pooled t-test is applied when the two unknown population variances
are assumed to be different
5.8 Exercise 8
1. The length of the skulls of 10 fossil skeletons of an extinct species of bird has a
mean of 5.68 cm and a standard deviation of 0.29 cm. Assuming that such measure-
ments are normally distributed, find a 95% confidence interval for the mean length
of the skull of this species of bird
2. Twelve randomly selected mature citrus trees of one variety have a mean height of
13.8 feet with a standard deviation of 1.2 feet, and 15 randomly selected mature citrus
trees of another variety have a mean height of 12.9 feet with a standard deviation of
1.5 feet. Assuming that the random samples were selected from normal populations
with equal variances, construct 90% and 95% confidence intervals for the difference
between the true average heights of the two kinds of citrus trees.
113
3. The following data are the heat producing capacities of coal from two mines (in
millions of calories per ton):
Assume that the data constitute two independent random samples from normal
populations with equal variances, construct a 99% C.I. for the difference between
the true average heat producing capacities from the two mines.
4. A paint manufacturer wants to determine the average drying time of a new
interior paint. If for 12 test areas of equal size he obtained a mean drying time of
66.3 minutes and a standard deviation of 8.4 minutes, construct a 95% C.I. for the
true mean.
5. An industrial designer wants to determine the average amount of time it takes an
adult to assemble an “easy to assemble” toy. Use the following data (in minutes),
a random sample, to construct a 95% C.I. for the mean of the population sampled:
17 13 18 19 17 21 29 22 16 28 21 15
26 23 24 20 8 17 17 21 32 18 25 22
16 10 20 22 19 14 30 22 12 24 28 11
6. A study has been made to compare the nicotine contents of two brands of ciga-
rette. Ten cigarettes of Brand A had an average nicotine content of 3.1 milligrams
with a standard deviation of 0.5 milligram, while eight cigarettes of Brand B had
an average nicotine content of 2.7 milligrams with a standard deviation of 0.7 mil-
ligram. Assuming that the two sets of data are independent random samples from
normal populations with equal variances, construct a 95% confidence interval for the
difference between the mean nicotine contents of the two brands of cigarettes.
7. A doctor is asked to give an executive a thorough physical check-up to test the
null hypothesis that he will be able to take on additional responsibilities. Explain
under what conditions the doctor would be committing a type I error and under
what conditions he would be committing a type II error.
8. An educational specialist is considering the use of instructional material on audio
cassettes for a special class of third-grade students with reading disabilities. Students
in this class are given a standardised test in May of the school year, and µ1 is the
average score obtained on these tests after many years of experience. Let µ2 be the
average score for students using the audio cassettes, and assume that high scores are
desirable.
114
(a) Explain under what conditions we would commit a type I error and under what
conditions we would commit a type II error
(b) Whether an error is a type I or a type II error depends on how we formulate
the null hypothesis. Rephrase the null hypothesis so that the type I error becomes
a type II error, and vice-versa
10. A biologist wants to test a null hypothesis that the mean wingspan of a certain
kind of insect is 12.3 mm against the alternative that it is not 12.3 mm. If she take
a random sample and decides to accept the null hypothesis if and only if the mean
of the sample falls between 12.0 mm and 12.6 mm, what decision will she make if
she gets x̄=12.9 mm and will it in error if:
(a) µ = 12.5 mm; (b) µ = 12.3 mm?
5.9 Exercise 9
For each of the following forty (40) statements, write TRUE for a correct state-
ment and FALSE for a wrong statement on the space provided at the end of each
statement.
ii. Applied or action research is mainly for uncovering new knowledge and theories
that will build upon existing knowledge or chart out new directions through
discovery.
vii. Inferential statistics utilizes sample data to make estimates, decisions, predic-
tions, or other generalizations about a population.
viii. The premise of statistical inference is that we attempt to control and assess
the uncertainty of inferences we make on the population of interest based on
observation of samples.
115
xi. Longitudinal studies are studies conducted in a single phase mode in order to
provide a snap shot picture of the problem under investigation.
xv. Sampling means drawing only a part of a population and studying it then
making inferences about the sample.
xvii. Non-probability samples are obtained by methods that are more objective than
subjective, which are based on the researcher’s judgment. These may be con-
venience, judgment, or quota samples.
xviii. Simple random samples are where each element of the sample has an equal
chance of appearing in the population.
xxi. Disproportionate stratified sampling is often employed when one stratum (or a
few) is underrepresented but is of importance to the researcher.
xxii. One important advantage of area sampling is that it can be carried out even
in the absence of a sampling frame.
xxv. A representative sample is a sample that is selected from the target population
through the use of probability sampling schemes.
xxvii. In a statistical test of hypothesis, the null hypothesis is the hypothesis we wish
to verify.
116
xxviii. In a statistical test of hypothesis, typically one tests the alternative hypothesis
against the null hypothesis and one of them is rejected.
xxix. Confidence limits are two end points within which a sample parameter falls
with specified degrees of freedom.
xxx. A critical region means all values constituting a region leads to acceptance of
the null hypothesis.
xxxi. Type II error is the one committed by maintaining a true null hypothesis when
in fact the alternative one is correct.
xxxii. Type I error is the one committed by maintaining a true null hypothesis when
in fact the alternative one is correct.
xxxiii. An estimator is quantity describing the population that is used as a guess for
the value of a corresponding population parameter.
xxxv. The standard error (of the mean) is an estimate of the standard deviation of
all possible mean values from samples of size n.
xxxvii. Correlation analysis measures the degree or strength and nature of association
between numerical variables say X and Y.
xxxix. A one-sample t-test is performed when you want to determine if the mean value
of a target variable is different from a hypothesized value.
5.10 Exercise 10
Data on body weights (kgs) and left ventricular ejection fractions (LVEF) for a
group of 28 male patients with acute dilated cardiomyopathy was collected at a
large hospital in the country. The analyst of the hospital generated some descriptive
statistics (mean, standard deviation-Std Dev, standard error-Std Error, variance,
coefficient of variation, and range). The analysis was done in SPSS and the results
are given below:
117
Some descriptive statistics for acute dilated data
Coeff of
Variable Variation Range
Complete the table in whatever way you feel is appropriate given the summary
statistics above.
5.11 Exercise 11
Write the letter of the best statement in II against the item in I in the scpace
provided.
I
S/N Item Letter
1 Variable
2 Population
3 Qualitative variable
4 Variates
5 Sampling unit
6 Hypothesis testing
7 A parameter
8 Attributes
9 A statistic
10 The number of defective items produced
during a day’s production
11 Randomness
12 Statistical inference
13 An experiment
14 Continuous data
15 Variance
16 Cluster sampling
17 Stratified sampling
18 Eye colour of a group of individuals
19 The arithmetic mean
20 Infant mortality rate
118
II
Statement
A Is a procedure for reaching a probabilistic conclusive decision about a
claimed value for a population’s parameter based on a sample
B It is the entire group of interest, which we wish to describe or about
which we wish to draw conclusions
C A characteristic or phenomenon, which may take different
values (e.g., weight, gender)
D Does not vary in magnitude in successive observations
E The values of quantitative variables
F The values of qualitative variables
G Means unpredictability
H Is a process whose outcome is not known in advance with certainty
J Is a quantity that is calculated from a sample of data
K Is a person, animal, plant or thing which is actually studied by a researcher
L Refers to extending your knowledge obtained from a random sample
from the entire population to the whole population
M Is an example of discrete data
N Is an unknown value, and therefore it has to be estimated
P Can be used whenever the population can be partitioned into smaller
sub-populations, each of which is homogeneous according to the particular
characteristic of interest
Q Is an example of qualitative data
R Are collected by measuring and are expressed on a continuous scale
S Is the average of the squared deviations of each observation in the set from
the arithmetic mean of all of the observations
T Can be used whenever the population is homogeneous but can be partitioned
U Is not a better representative of the data if some values are very large in
magnitude and others are small
V Is an example of continuous variables
W Is the middle value in an ordered array of observations
X Is the absolute value of the difference between the largest and the smallest
values in the data set
Y Cannot be computed when there are negative values in a set of observations
Z Is the most frequently occurring value in a set of observations
5.12 Exercise 12
1. Choose the most correct answer and circle the letter of the best answer.
119
(d) Research is based on observable experience or empirical evidence
120
xiii. Mortgage Requested: Tsh. 120,000,000
Classify each of the responses by type of data (continuous numerical, discrete numer-
ical or categorical) and level of measurement (interval, nominal, ratio or ordinal).
i. (c) The research process starts only with an existing practical problem
ii. (c) A careful investigation or inquiry especially directed at the search for new
facts in any branch of knowledge
iii. (d) A score greater than 30 on the Body Mass Index (BMI)
121
2.
Item No. Type of Data: Level of measurement:
(continuous numerical, ((interval, nominal,
(discrete numerical or categorical) (ratio or ordinal)
i categorical nominal
ii categorical nominal
iii continuous numerical ratio
iv continuous numerical ratio
v categorical ordinal
vi categorical nominal
vii discrete numerical ratio
viii discrete numerical ratio
ix continuous numerical ratio
x continuous numerical ratio
xi categorical nominal
xii discrete numerical ratio
xiii continuous numerical ratio
xiv continuous numerical ratio
xv categorical nominal
xvi continuous numerical ratio
5.13 Exercise 13
1. A crop scientist was interested in Y = average leaf weights per plat (grams) after
75 days of plots planted with a particular soybean variety. The following data are
values of Y measured on 9 randomly chosen such plots.
n
n
Yi = 169.8, Yi2 = 3256.5
i=1 i=1
(a) Calculate a quantity such that observations in this sample are equally likely to
have been observed above or below this value.
(b) Calculate the best value you can that quantifies the “spread” of the observations
in this sample and that has units of grams.
122
2. An experiment was conducted to determine the extent to which the growth
of a certain fungus could be affected by filling tubes containing the same medium
at the same temperature with inert gases. The data below are the result of one
such experiment, where X=molecular weight of gas, Y =growth measurement in
millimeters.
X 4.0 20.2 28.2 39.9 83.8 131.3
Y 3.85 3.48 3.27 3.08 2.56 2.21
n
n
n
n
Xi = 307.4, Xi2 = 27073.42, Yi = 18.45, Yi2 = 58.5499,
i=1 i=1 i=1 i=1
n
Xi Yi = 805.503, n = 6.
i=1
(a) Assume that there are theoretical reasons to expect the relationship between X
and Y to follow a straight line. Fit the simple linear regression line Yi = β0 +β1 Xi +εi
to these data.
(b) Provide an interpretation for the regression parameters in the model in (a) in
terms of the situation at hand.
(c) Compute the coefficient of determination R2 . Based on this value of R2 , comment
on the usefulness of the regression line for explaining the relationship between the
response and the independent variable.
3. The mean and variance of a set of five observations are respectively 4 and 5.2. If
three of the observations are 1, 2 and 6. Find the other two.
n
n
Yi = 169.8, Yi2 = 3256.5
i=1 i=1
⎛ ⎞2
n 2 n
Yi Yi
i=1 ⎜ i=1 ⎟
s= ⎜ ⎟
n −⎝ n ⎠
123
2
3256.5 169.5
Thus, s = 9 − 9 = 2.425 (approx.)
2.
X 4.00 20.20 28.20 39.90 83.80 131.30
Y 3.85 3.48 3.27 3.08 2.56 2.21
n
n
n
n
Xi = 307.4, Xi2 = 27073.42, Yi = 18.45, Yi2 = 58.5499,
i=1 i=1 i=1 i=1
n
Xi Yi = 805.503, n = 6.
i=1
(a)
n
n
n
n Xi Yi − Xi Yi
i=1 i=1 i=1 6 × 805.503 − (307.4)(18.45)
β̂1 = n
= = −0.012341
n 6 × 27073.42 − (307.4)2
n Xi2 − Xi
i=1 i=1
18.45 307.4
β̂0 = Ȳ − β̂1 X̄ = − (−0.01241) × = 3.7073
6 6
⎡
2 ⎤
n
Xi
⎢
n ⎥
β̂12 ⎢
⎣ Xi2 − i=1
n
⎥
⎦
i=1
β̂ 2 SXX
R 2
= 1 =
2 = 0.9496 = 94.96%
SY Y
n
Yi
n
Yi2 − i=1
n
i=1
124
n
Xi
i=1 X1 +X2 +...+Xn
3. By definition, mean or X̄ = n = n
4= 1+2+6+X
5
4 +X5
or
9 + X4 + X5 = 20
n ⎛
n ⎞2
Xi2 Xi
⎜ ⎟
We also know that, variance or s2 = i=1
n − ⎝ i=1n ⎠
41 + X42 + X52
= 5.2 + 42
5
41 + X42 + X52
= 21.2
5
2X52 − 22X5 + 56 = 0
X52 − 11X5 + 28 = 0
√
11± (−11)2 −4×1×(28)
Thus, X5 = 2×1
Simplifying we have,
X5 = (4, 7)
125
5.14 Exercise 14
1. Red blood cell deficiency may be determined by examining a specimen of the blood
under a microscope. Suppose a certain small fixed volume contains on the average
20 red cells for normal persons. Using Poisson distribution, obtain the probability
that a specimen from a normal person will contain less than 15 red cells.
2. Suppose that weather records show that on the average 5 out of 31 days in
October are rainy days. Assuming a binomial distribution with each day of October
as an independent trial, find the probability that the next October will have at most
three rainy days.
3. For married couples living in a certain suburb, the probability that the husband
will vote in a school board election is 0.21, the probability that the wife will vote in
the election is 0.28, and the probability that they will both vote is 0.15. What is the
probability that at least one of them will vote?
4. Let:
T +=the test is positive (indicating that the disease is present)
T −=the test is negative
Z+=the individual has the disease
Z−=the individual does not have the disease
and
1. Let X represent the number of red blood cells a normal person has.
Thus, X ∼ P (λ)
−λ λx
P (X=x)= e x! , x = 0, 1, ...
Here λ=20.
126
14
We want P (X<15)= P (X = x) = P (X=0) + P (X=1) + . . . + P (X=14)
x=0
e−20 200 e−20 201 e−20 2014
= 0! + 1! + ... + 14!
202 2014
=e−20 1 + 20 + 2! + ... + 14!
2. Let X be a random variable representing the number of rainy days in the month
of October.
Thus, X ∼ B(n, p)
n
P (X=x)= px (1 = p)n−x , x = 0, 1, ..., 31
x
5 26
Here n=31, p = 31 , q= 31 . We want P (X ≤ 3)
=P (X=0) + P (X=1) + P (X=2) + P (X=3)
0
31
3
28
31 5 26 31 5 26
+ ... +
0 31 31 3 31 31
=0.2403
3. Let A and B be the events that the husband will vote in a school election and the
wife will vote in the school election respectively. And AnB be the event that both
the husband and the wife will vote in the school election.
Thus, P (A)=0.21, P (B)=0.28 and P (AnB)=0.15. Required to find the probability
that at least one of them will vote. That is, P (AuB)
Using the addition law for not mutually exclusive events, we have
P (AuB) = P (A) + P (B) − P (AnB)=0.21 + 0.28 - 0.15=0.34
(ii) An ELISA test for AIDS has 99.5% specificity and is used on 140 employees of a
medical clinic. If all 140 are free of AIDS, what is the probability that at least one
of the 140 people will nevertheless test positive for the disease?
Given P(T-|Z-)=99.5%=0.995. Required to find: P (T + /Z−)
We know that if E is the event representing success and Ē is the event representing
failure, then P (E) + P (Ē)=1
P(T+/Z-) + P (T-/Z-)=1 . Thus, P(T+/Z-) = 1-[P (T − /Z−)]140 or 1-P (all people
test negative) =1- (0.995)140 ∼
=50%
127