Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In your statistics course, there are scheduled computer tutorials in which you explore the use of
Matlab in statistics. Matlab is available on both the Linux and the Windows desktops in the
labs in the School of Mathematics and Statistics in the Red Centre (on the ground floor and on the
mezzanine level). Each student has a scheduled class with a tutor in the labs approximately every
second week for the session. Students are welcome to use the labs at any time when they are not busy
for other classes. Check notice boards for the current opening hours. You do need to be able to
use Matlab, so start as early as you can in the session!
Exercises and data sets for the computer tutorials will be made available on your course web page
(Moodle). Solutions to these exercises will usually be available from the course web page a few weeks
thereafter. There is a general ‘Introduction to Matlab’ document available through the School of
Mathematics and Statistics web page also available from your course web page. In addition there is
an ‘Introduction to Matlab and Statistics Using Matlab’ (referred to here as SUM) document which
is available from the course web page. In all Matlab tutorials make sure you understand the
output you obtain. When you produce a graph, look at it! What does it tell you about the data?
Ask you tutor if you have questions. Explore the possibilities of Matlab - you may find it very useful
in all your courses!
Instructions
After this tutorial you should be able to: Open Matlab; Do the Matlab quiz on Maple TA, guided by the
MATLAB Self-paced tutorial.
1. Opening Matlab
Double click on the Matlab icon and a Matlab Command Window should pop-up. The >> is a
prompt, where you will enter commands.
2. Maple TA Matlab Quiz
Many of the assessments in this course will be done using Maple TA. If you log into Maple TA, you
will see a set of Matlab Quizzes, which are not assessable. They will however assist you in developing
a mastery of core Matlab tasks. We will use the remainder of this class to make a start on completing
them.
a) Go to Moodle and open two links:
(i) The Matlab Self-Paced Tutorial
(ii) Maple TA
We will use these two resources for the remainder of the lesson.
b) In Maple TA, enrol in the course with the title and semester that matches your enrolment, and
complete the Declaration.
c) Now got to the Matlab Self-Paced Tutorial and work through Lesson 0.
d) When finished, check your understanding by doing the Maple TA quiz for Lesson 0.
e) Now repeat for each of the remaining lessons in the Self-Paced Tutorial – testing yourself on
Maple TA after completing each Self-Paced Tutorial lesson.
Browse the general ‘Introduction to Matlab’ document and the ‘Statistics Using Matlab’ (here referred to
as SUM) document to find extra information. Remind that you can also use the help command at any time
to get information about any Matlab command.
Do not just enter Matlab commands! Make sure you understand the output you obtain from the com-
mands you use. When you produce a graph, look at it! What does it tell you about the data? Ask you tutor if
you have questions. Explore the possibilities of Matlab, go beyond what is actually required!
Solutions to these exercises will be available from the course web page later.
Exercise 1
In many instances, the simplest way to import data into Matlab is to use the Import Wizard. All
you need to do is click Import Data. In this course we only require you to be able to import data using
the Import Wizard and this is the same whether you are running Matlab on Linux or on Windows.
There are various data files which we will use in this course. These are available from the course
web page (Moodle) in a zip file. You should now use the following instructions and use the Import
Wizard (Home → Import Data) to check that you can import a file into Matlab without any problem.
a) Download the zip file and extract each file into your own directory on the computer. These
should all be saved as .txt files
b) See if you can use the Import Wizard to import inflows.txt into Matlab. This data file
contains two columns of data, one called Year and one called Inflows. Select this data file and
click Import Data. Alternatively, you can right-click on the data file and select Import Data.
You should see that there are the two columns of data, with names at the top, and 25 pairs of
values. The columns are separated by a tab so check that Tab is selected at the top. You can
edit the variable names as Year and Inflow. This will create vectors in Matlab called Year
and Inflow. Note that the names are case sensitive. Finally click the green tick Import
Selection.
c) To list the data you can enter >> Year. (What happens if you forget the capital letter?) Then
enter >> Inflow. To check you have all the data you might check the length of the vectors
by >> length(Year). (You should get 25.) To see the two columns you could enter >> [Year
Inflow]. (What happens if you enter >> [Year;Inflow]?)
Exercise 2
The Hardap Dam is the largest dam in Namibia, with a surface area of approximately 25 square
kilometres, a capacity of 320 million cubic metres, and a dam wall 862 metres long. Table 1 shows
annual maximum floodpeak inflows to the dam, in cubic metres per second, for years 1962/63 to
1986/87. (Data from Metcalfe, Andrew V. (1997). Statistics in Civil Engineering. London: Arnold ;
New York: Wiley). The data are stored in inflows.txt so if you completed Exercise 1 then it should
already be in your workspace.
a) Construct a frequency histogram of the inflows using the histogram command. Give the graph a
title, and label the x- and y-axes appropriately. (Reference SUM p.7-9 or use the help command).
Explore the histogram command so that you understand how to label appropriately.
b) Use Matlab to find the mean, the variance, the standard deviation and the five-number summary
of the inflows.
c) Construct a density histogram to display the Inflows data. Give the graph a title, and label the
axes appropriately. The histogram function can construct a density histogram using the option
’Normalization’, ’pdf’. For example to plot a density histogram for data x with 5 classes you
would type: histogram(x,5,’Normalization’,’pdf’)
d) How does the number of classes change the histogram? Plot density histograms with 5 and 50
classes and comment on how informative they are.
e) Equal-width classes may not be a good choice if a data set “stretches out” to one side or the other
- as it is the case for the inflows data set. Add an edges vector to the histogram function to plot
a density histogram with classes [0, 500), [500, 1500), [1500, 3000) and [3000, 6500).
f) Comment on the shape of the histogram. Without looking at the summary statistics, which one
do you think is larger : the mean or the median? Why?
g) Construct a frequency histogram with the same classes as in e). Compare it to the density histogram
obtained in e). Which one is misleading and why?
h) Create a new data vector with values equal to the logarithms of the inflows. Construct a histogram
of the new data vector with appropriate labels and titles. (Hint: Use the figure command before
constructing this second histogram, to put the figure in a new window.)
i) The coefficient of skewness is a measure of asymmetry of the data. It is defined as
Pn
(xi − x̄)3
g = i=1 3
ns
where s is the sample standard deviation. It gives an indication of departures from symmetry: a
positive value of g may indicate a long “right tail” in the data with values above the median tending
to be further away from the median than values below the median, while negative skewness may
indicate a long “left tail” with values below the median tending to be more extreme.
Use Matlab to calculate the skewness of the two datasets (see SUM p.6 or use help command).
Compare the shapes of the histograms for the inflows and the log transformed inflows, and comment
on the degree and type of skewness.
j) Write an m.file to carry out some or all of this exercise.
Exercise 3
The table below shows the age at taking office for the 45 US presidents to date.
a) Recover the vectors Name, Age and Country, imported from the presdat.txt data set into Matlab
in the first lab class. Make sure you change the format of Country to Text (string) in the
Import window.
Extract the ages of the US presidents only. Hint: the command strcmp is used to compare
character strings. The command AgeUS=Age(strcmp(Country,’US’)) will create a vector AgeUS
that contains Age values for which Country=’US’.
b) Construct a titled and labelled plot of president age versus time order. Comment on any temporal
patterns you see (i.e., what is the pattern, if any, over time).
c) Construct a titled and labelled frequency histogram of president ages.
d) Use the histogram command to construct an appropriately labelled frequency histogram showing
numbers of presidents whose ages fall in the ranges 42-44, 45-47, 48-50, ...,69-71. Hint: enter help
histogram for information on this command. You need to set up an edges vector to use in the
histogram function.
e) Compare the two frequency histograms produced in the previous steps. Comment briefly on the
visual impact of the choice of class intervals used in both.
f) This dataset also contains the ages of taking office for all Australian Prime Ministers. Now we will
compare the two countries in terms of age of their leader when taking office.
Calculate the sample mean, the sample variance and the sample standard deviation for ages at
taking office in the two countries and comment.
g) Construct side-by-side horizontal boxplots of age for the US and Australian leaders. Type help
boxplot if you are unsure how to do this.
h) Calculate the five-number summaries and sample interquartile ranges for each country. Identify
where these values appear on the boxplots.
i) The boxplot for the US shows one outlying value. Find the name of this president who was
appreciably older than the others when he took office.
j) Do the boxplots suggest that there is a difference in ages at taking office for leaders in the two
countries?
k) Write an m.file to carry out some or all of this exercise.
Exercise 4
In a study designed to test the safety of various types of vehicles, vehicles were crashed into a wall
at 60 km/h with a crash-test dummy strapped in the driver’s seat. In the crash.txt data file, you will
find the 6 variables
• Head: measure of head injuries
• Chest: chest deceleration (in g), which is a measure of chest injury
• Airbag: 1 indicates an airbag, 0 indicates only seatbelts were used
• Doors: number of doors (equals 6 for vans and sport utility vehicles)
• Year: year of vehicle
• Weight: weight of vehicle
Recover the vectors Head, Chest, Airbag, Doors, Year and Weight imported from the crash.txt data
set into Matlab from Moodle.
a) i) Chest is a measure of chest injuries sustained by the crash-test dummy. Larger values indicate
more severe injuries. Use Matlab to find the mean, the variance, the standard deviation and
the five-number summary of the chest injuries.
ii) Determine the proportion of observations that have a value of Chest that is larger than or
equal to 60.
iii) Construct a density histogram for Chest. How many classes would you consider? Comment on
its shape. Experiment with changing the number of classes. For instance, produce a density
histogram with 6 classes, and one with 25 classes. Compare and comment.
b) Suppose we wanted to see if airbags prevented injuries. Since Airbag is a categorical variable and
Chest is a quantitative variable, a good form of analysis would be a side-by-side boxplot. Type
help boxplot if you are unsure how to do this.
i) Compare the distribution of chest deceleration for airbag (Airbag = 1) and non-airbag (Airbag
= 0) cars. Do airbags appear to prevent injury?
c) Now let’s compare the chest injuries of vehicles with different numbers of doors. Create side-by-side
boxplots, as above, using these two variables.
i) Does there appear to be a relationship between the number of doors and chest deceleration?
What about the plot suggests a relationship?
ii) Which type of vehicle (in terms of number of doors) tends to have the least severe injuries?
iii) Why do you think that chest deceleration might be related to number of doors on the vehicle?
d) To test one reasonable theory, make a boxplot of Weight by Doors. What do these boxplots suggest
about the results of the previous boxplot comparing chest deceleration by number of doors?
Instructions
The following are exercises that will covered during your tutorial class.
Note that not every topic in the course can be covered by the tutorial exercises. You are expected
to take an active part in the learning process by working by yourself on most of the topics. In particular, you
are expected to attempt these tutorial questions beforehand. During the tutorial, the tutor will go
through the answers to some of the questions while directing explanation to areas where students indicate they
have difficulty.
Solutions to these exercises will be available from the course web page at a later date.
Exercise 1
a) Let A and B be two mutually exclusive events in a sample space with P (A) = 0.2 and P (B) = 0.5.
Determine, if possible, P (A ∪ B).
b) Let E and F be two events in a sample space with P (E) = 0.2, P (F ) = 0.3, and P (E ∩ F ) = 0.1.
Determine, if possible, P (E ∪ F ).
c) Let E and F be two independent events in a sample space with P (E) = 0.4 and P (F ) = 0.25.
Determine, if possible, P (E ∪ F ).
d) Let A and B be events in a sample space with complements Ac and B c . Assume that P (A) = 0.2,
P (B) = 0.4, P (A ∩ B) = 0.1. Determine (if possible) the probability P (Ac ∩ B c ).
e) A quick and inexpensive blood test is introduced to screen for a disease which we will call X. When
people who have the antibodies for X in their blood have the screening test, they have a 95% chance
of getting a positive result (and a 5% chance of a negative result). When people who do not have
the antibodies for X in their blood have the screening test, they have a 2% chance of getting a
positive result (and a 98% chance of a negative result). Assume that 5% of the population have
the antibodies in their blood.
A random person has a screening blood test.
i) What is the probability that the test comes back positive?
ii) What is the probability that they have the disease, given that their test came back positive?
Exercise 2
For any collection of events A1 , A2 , A3 , . . . , Ak , it can be shown that the inequality
P(A1 ∪ A2 ∪ A3 ∪ A4 ∪ . . . ∪ Ak ) ≤ P(A1 ) + P(A2 ) + . . . + P(Ak )
always holds (try to understand why! Hint: use Venn diagrams and the additive law of probability).
This inequality is most useful in cases where the events involved have relatively small probabilities. For
example, suppose a system consists of five subcomponents connected in series and that each component
has a 0.01 probability of failing. Find the upper bound on the probability that the entire system fails.
For any collection of events A1 , A2 , A3 , . . . , Ak , it can be shown that the inequality
P(A1 ∩ A2 ∩ A3 ∩ A4 ∩ . . . ∩ Ak ) ≥ 1 − P(Ac1 ) + P(Ac2 ) + . . . + P(Ack )
always holds (try to understand why! Hint: use De Morgan’s laws and the first part of this exercise).
This inequality is most useful in cases where the events involved have relatively high probabilities. For
example, suppose a system consists of ten subcomponents connected in series and that each component
has a 0.999 probability of functioning without failure. Find the lower bound on the probability that
the entire system functions correctly.
Exercise 3
Among the students doing a given course in engineering, there are four males enrolled in the civil
engineering program, six females enrolled in the civil engineering program, and six males enrolled in the
chemical engineering program. How many girls must be enrolled in the chemical engineering program
if gender and engineering program are to be independent when a student is selected at random?
Exercise 4*
An oil exploration company currently has two active projects, one in Asia and the other in Europe.
Let A be the event that the Asian project is successful and B the event that the European project is
successful. Suppose that A and B are independent events with P(A) = 0.4 and P(B) = 0.7.
a) If the Asian project is not successful, what is the probability that the European project is also not
successful?
b) What is the probability that at least one of the two projects will be successful?
c) Given that at least one of the two projects is successful, what is the probability that only the Asian
project is successful?
Exercise 5
Consider selecting three people at random for a committee, where each person independently has 50%
chance of being female. Let X be the number of females on the committee. The probability mass
function of X can be written as follows:
x 0 1 2 3
p(x) 18 83 38 81
a) Find F (x) = P(X ≤ x), the cumulative distribution function of X.
b) Find E(X̄).
c) Find Var(X̄).
d) X is an example of a special random variable. Which one? What are the values of its parameters?
e) An important committee is selected from a group of students that are half male and half female.
Three people are on the committee, and all three are female. Do you think there is evidence that
the committee was not selected fairly, with respect to gender?
Exercise 6
Let X be a r.v. with cumulative distribution function F (x) and density f (x) = F 0 (x). Find the
probability density function of
a) the maximum of n independent random variables all with cumulative distribution function F (x).
b) the minimum of n independent random variables all with cumulative distribution function F (x).
Exercise 7
An article in the review Knee Surgery, Sports Traumatology and Arthroscopy in 2005 cites a success
rate of more than 90% for meniscal tears with a rim width of less than 3mm, but only a 67% success
rate for tears of 3–6mm. If you are unlucky enough to suffer from a meniscal tear of less than 3mm
on your left knee and one of width 3–6mm on your right knee, what is the probability mass function
of the number of successful surgeries? Assume the surgeries are independent. Find the mean and
variance of the number of successful surgeries that you would undergo.
Exercise 8*
The article “Error Distribution in Navigation” (J. Institute of Navigation, 1971) suggests that the
distribution of the lateral position error, say X (in nautical miles), which can be either positive or
negative, is well approximated by a density like
f (x) = c × e−0.2 |x| for − ∞ < x < ∞,
for a constant c.
a) Find the value of c which makes f a legitimate density function, and sketch the corresponding
density curve.
b) In the long-run, what proportion of errors is negative? At most 2? Between −1 and 2?
Exercise 9
The probability density function of the weight X (in kg) of packages delivered by a post office is
70
f (x) = for 1 < x < 70
69x2
and 0 elsewhere.
a) Determine the mean and the variance of the weight X.
b) If the shipping cost is $2.50 per kg, what is the average shipping cost of a package? What is the
variance of the shipping cost?
c) In the long-term, what is the proportion of packages whose weight exceeds 50 kg?
Exercise 10*
a) Given the four scatter plots for two random variables X and Y in Figure 1,
Figure 1
i) which of the plots demonstrates a positive relationship ?
ii) which of the plots demonstrates a positive linear relationship ?
iii) which of the plots demonstrates a negative relationship ?
iv) which of the plots demonstrates no relationship ?
v) for which of the plots would you expect positive correlation ? Negative correlation ? No or
little correlation ?
b) For each of the following pairs of variables, indicate whether you would expect a positive correlation,
a negative correlation, or little or no correlation. Explain your choices.
i) Maximum daily temperature and cooling cost
ii) Interest rate and number of loan applications
iii) Distance a student doing MATH2099 lives from UNSW campus and their marks at the Matlab
online quizzes
Instructions
After this tutorial you should be able to: use Matlab for determining probabilities from common probability
distributions.
Browse the general ‘Introduction to Matlab’ document and the ‘Statistics Using Matlab’ (here referred to
as SUM) document to find extra information. Remind that you can also use the help command at any time
to get information about any Matlab command.
Do not just enter Matlab commands! Make sure you understand what you do and the output you obtain.
When you produce a graph, look at it! What does it tell you about the data? Ask you tutor if you have
questions. Explore the possibilities of Matlab, go beyond what is actually required!
Solutions to these exercises will be available from the course web page at a later date.
Matlab has pdf and cdf commands for determining probabilities from many distributions, includ-
ing Normal, Exponential, Uniform, Binomial and Poisson. Use the help command for more
details, or see Section 4 of the SUM. Note that Matlab uses the notation “pdf” for both the prob-
ability mass function (pmf) of discrete distributions and the probability density function (pdf) of
continuous distributions.
Exercise 4
Many families of probability distributions are useful for engineers, including Binomial and Poisson.
a) In each of the following situations state whether it is reasonable to use one of these distributions
for X. If so, tell which one, and (if it is possible) determine what are the values of the parameters :
i) Toss a fair coin 6 times, X is the number of ‘Heads’.
ii) Toss a fair coin until the first time a head appears, X is the count of the number of tosses you
make.
iii) A factory makes carpets. Sometimes there are flaws in the carpet. On average a square metre
of carpet has 3 flaws. X is the number of flaws in a random square metre of carpet.
iv) Most calls made at random by sample surveys don’t succeed in talking to a person. Of calls
in New York City, only 1/12 succeed. A survey calls 500 randomly selected numbers in New
York City, and X is the number that reach a live person.
v) Calls to a telephone exchange come in at an average of 250 an hour, X is the number of calls
in a given hour.
vi) A die (6 faces, numbered 1,2,3,4,5,6) is tossed twice and X is the number of 6s obtained.
vii) A die (6 faces, numbered 1,2,3,4,5,6) is tossed 7 times and X is the largest number obtained
in the 7 tosses.
viii) A die (6 faces, numbered 1,2,3,4,5,6) is tossed 4 times and X is the total of the numbers
obtained on the top face.
b) Select one of the variables in part b) which can be modelled by a Binomial distribution, and
find the long-run proportion of times that X ≤ 1.
c) Select one of the variables in part b) which can be modelled by a Poisson distribution, and find
the long-run proportion of times that X ≤ 1.
d) In October 1994, a flaw in a certain Pentium chip installed in computers was discovered that could
result in a wrong answer when performing division. The manufacturer initially claimed that the
chance of any particular division being incorrect was only 1 in 9 billion, so that it would take
thousands of years before a typical user encountered a mistake. However, statisticians are not
typical users; some modern statistical techniques are so computationally intensive that a billion
divisions over a short period of time in not outside the realm of possibility. Assuming that the 1 in
9 billion figure is correct and that results of divisions are independent of one another, what is the
probability that at least one error occurs in one billion divisions with this chip?
Exercise 5
An individual claims to have extrasensory perception (ESP). As a test, a fair coin is tossed ten
times, and he is asked to predict in advance the outcome. Our individual gets seven out of ten correct.
What is the probability he would have done at least this well if he had no ESP? Would you believe in
his powers?
Exercise 6: the Uniform distribution*
The Uniform distribution is a continuous distribution. Use help to determine how to use unifpdf
and unifcdf. You may also find useful the command unifinv (use help to find out what it is for).
a) For X ∼ U[−1,1] , find
i) P(X < 0)
ii) P(X ≤ 0)
iii) P(−0.9 ≤ X ≤ 0.8)
iv) the value of x such that P(−x ≤ X ≤ x) = 0.9
b) The thickness of photoresist applied to wafers in semiconductor manufacturing at a particular
location on the wafer is uniformly distributed between 0.2050 and 0.2150.
i) Determine the proportion of wafers that exceeds 0.2125 micrometers in photoresist thickness.
ii) What thickness is exceeded by 10% of the wafers?
Instructions
The exercises that will be gone over during the tutorial class will be chosen from those printed here.
Note that not every topic in the course can be covered by the tutorial exercises. You are expected
to take an active part in the learning process, by working by yourself on most of the topics.
In particular, you are expected to attempt these tutorial questions beforehand. At the tutorial,
the tutor will go through the answers to some of the questions, directing explanation to areas where students
indicate they have difficulty.
Solutions to these exercises will be available from the course web page at a later date.
Exercise 1
Consider X̄, the sample mean of a random sample of size n from a normally distributed variable with
mean µ and variance σ 2 .
a) Find E(X̄).
b) Find Var(X̄).
c) What is the distribution of X̄?
Exercise 2
The Rockwell scale is a hardness scale based on indentation hardness of a material.
a) We have a sample of n = 9 measures of Rockwell hardness of certain metal pins:
48.70 52.05 49.73 51.69 47.17 52.00 51.42 51.96 50.65
52 53
51.5
52
Quantiles of Input Sample
51
51
50.5
50 50
49.5 49
49
48
48.5
47
48
46
1 -2 -1 0 1 2
Standard Normal Quantiles
It is usually assumed that the distribution of all such pin hardness measurements is normal. Given
the plots (boxplot and qq-plot) above, do you think that this assumption is reasonable?
b) Experts believe that the Rockwell hardness of that particular metal is µ = 50, and that measures
returned by the tester are normally distributed with standard deviation σ = 1.5. Imagine that we
were given many other random samples of size n = 9 of hardness measures for similar metal pins.
What can be said about the distribution of the sample means from all those samples?
c) The above sample shows an average of x̄ = 49.8. Does this contradict the assumption that the ‘true’
Rockwell hardness of that metal is µ = 50? Compute the probability that the average hardness for
another random sample of size n = 9 is smaller than the observed 49.8, and form an opinion.
d) An experimenter has found a box containing 9 metal pins, but it is not clear if it is the same metal
as the above. The Rockwell hardness of each pin is measured, and the average is 52. Do you
think that it is plausibly the same metal? Compute the probability that the average hardness for
a random sample of n = 9 pins of the initial metal is greater than 52, and form an opinion.
e) Suppose now that the samples at disposal have size n = 200, and not n = 9. What is now the
sampling distribution of the average hardness measure for such random samples? What is the
probability that the average hardness in such a random sample of 200 pins is at most 49.8? Do you
need to assume normality here?
In answering the above questions, you can make use of the following Matlab output:
>> z = [-4 -1.87 -0.4];
>> normcdf(z)
ans =
Exercise 3*
a) Suppose that X is a random variable which is Exponentially distributed with mean 2.
i) Write down the density function of X. What are µ, σ 2 and σ for this distribution?
ii) Random samples of size n = 48 are taken from the distribution of X. Determine the mean,
variance, standard deviation and approximate sampling distribution of the sample mean X̄.
b) Suppose that X is a random variable which is Poisson distributed with mean 5.
i) Write down the probability mass function of X. What are µ, σ 2 and σ for this distribution?
ii) Random samples of size n = 40 are taken from the distribution of X. Determine the mean,
variance, standard deviation and approximate sampling distribution of the sample mean X̄.
c) Suppose that X is a random variable which is Uniform on the interval [2, 5].
i) Write down the density function of X. What are µ, σ 2 and σ for this distribution?
ii) Random samples of size n = 15 are taken from the distribution of X. Determine the mean,
variance, standard deviation and approximate sampling distribution of the sample mean X̄.
Exercise 4*
The number of flaws X on an electroplated car grill is known to have the following probability mass
function:
x 0 1 2 3
p(x) 0.8 0.1 0.05 0.05
a) Calculate the mean and standard deviation of X.
b) What are the mean and the standard deviation of the sampling distribution of the average number
of flaws per grill in a random sample of 64 grills?
c) For a random sample of 64 grills, calculate (approximately) the probability that the average number
of flaws per grill exceeds 0.5.
Exercise 5*
a) A synthetic fibre used in manufacturing carpet has tensile strength that is normally distributed
with mean 75.5 psi and standard deviation 3.5 psi. Find the probability that a random sample of
n = 6 fibre specimens will have a sample mean tensile strength that exceeds 75.75 psi. How does
this probability change when the sample size is increased from n = 6 to n = 49?
b) The compressive strength of concrete is normally distributed with µ = 2500 psi and σ = 50 psi.
Find the probability that a random sample of n = 5 specimens will have a sample mean strength
that falls in the interval from 2499 to 2510 psi.
c) The amount of time that a customer spends waiting at an airport check-in counter is a random
variable with mean 8.2 minutes and standard deviation 1.5 minutes. Suppose that a random
sample of n = 49 customers is observed. Find the probability that the average waiting time for
these customers is
i) less than 10 minutes.
ii) between 7 and 10 minutes.
iii) less than 8.5 minutes.
Exercise 6
a) Suppose we have a random sample X1 , X2 , . . . , X2n of size 2n from a population denoted by X, and
E(X) = µ and Var(X) = σ 2 . Let
2n n
1 X 1X
X̄1 = Xi and X̄2 = Xi
2n i=1 n i=1
be two estimators of µ.
i) Calculate E(X̄1 ) and E(X̄2 ). Is either estimator biased?
ii) Calculate Var(X̄1 ) and Var(X̄2 ).
iii) Which is the better estimator of µ? Explain.
b) Let X1 , X2 , . . . , Xn denote a random sample from a population having mean µ and variance σ 2 .
Consider the following estimators of µ:
X1 + X2 + . . . + X7
µ̂1 =
7
2X1 − X6 + X4
µ̂2 =
2
i) Is either estimator unbiased?
ii) Which estimator is best? In what sense is it best?
Exercise 7
Two different plasma etchers in a semiconductor factory have the same mean etch rate µ. However,
machine 1 is newer than machine 2 and consequently has smaller variability in etch rate. We know
that the variance of etch rate for machine 1 is σ12 and for machine 2 is σ22 = aσ12 (a > 1). Suppose that
we have n1 independent observations on etch rate from machine 1 and n2 independent observations
on etch rate from machine 2.
a) Show that µ̂ = αX̄1 + (1 − α)X̄2 is an unbiased estimator for µ for any value of α between 0 and 1.
b) Find the standard error of the point estimate of µ in part a).
c) What value of α would minimise the standard error of the point estimate of µ?
d) Suppose that a = 4 and n1 = 2n2 . What value of α would you select to minimise the standard
error of the point estimate of µ? How “bad” would it be to arbitrarily choose α = 0.5 in this case?
How “bad” would it be to only use the observations coming from machine 1?
End of the tutorial class for Week 5.
MATH2089
Numerical Methods and Statistics
There are no week 6 statistics classes. Please work through the below exercises in your
own time.
Instructions
After this tutorial you should be able to: use simulations in Matlab to explore the sampling distribution
of the sample mean and to illustrate the Central Limit Theorem, use Matlab to derive confidence intervals
of a given level, and check if the normal assumption is reasonable.
Browse the general ‘Introduction to Matlab’ document and the ‘Statistics Using Matlab’ (here referred to
as SUM) document to find extra information. Remind that you can also use the help command at any time
to get information about any Matlab command.
Do not just enter Matlab commands! Make sure you understand what you do and the output you obtain.
When you produce a graph, look at it! What does it tell you about the data? Ask your tutor if you have
questions. Explore the possibilities of Matlab. Go beyond what is actually required!
Solutions to these exercises will be available from the course web page at a later date.
In Matlab you can simulate random samples from known distributions, using a -rnd command (rnd
for ‘random’). For instance, you can obtain a sample column vector of 5 values from the Normal distri-
bution N (12, 3) by entering the command normrnd(12,3,5,1). If you just enter normrnd(12,3,5),
then you will get a 5 × 5 matrix of random values from N (12, 3). To obtain samples from the Ex-
ponential distribution, use exprnd, to obtain samples from the Uniform distribution, use unifrnd,
to obtain samples from the Binomial distribution, use binornd, to obtain samples from the Poisson
distribution, use poissrnd. Use help for more information. See also Section 5 of the SUM document.
Exercise 1
Consider the applet mentioned in lectures:
http://onlinestatbook.com/stat_sim/sampling_dist/index.html
a) Load this applet and find the sampling distribution of X̄ for sample sizes of 5 and 25, from a few
different distributions (normal, uniform, skewed).
Comparing the distribution of the mean to the distribution of the original variable (“parent
population”) what happened to:
i) The mean
ii) The variance
iii) The distribution
b) Use the “Custom” setting to try to find a distribution for which the sampling distribution of X̄ is
noticeably asymmetrical, even when the sample size is n = 25 observations.
Exercise 2
In this question we consider lots of simulated random samples of size 100 from a Binomial distribu-
tion.
a) Simulate (use the Matlab command binornd) a matrix B with 100 rows and 500 columns con-
taining elements that are independent values drawn from the Binomial distribution with n = 1 and
π = 0.5 (that is, the Bernoulli distribution with parameter π = 0.5). You can think of each column
of B as being a random sample of size 100 from X ∼ Bern(0.5). With 500 columns in B, we have
500 independent random samples of size 100.
b) Use the Matlab mean function to compute the means of each of the 500 columns. You get a row
vector, call it meanB (say), of length 500, which consists of the means of 500 samples of size 100,
1
that is 500 values of the random variable X̄ = 100 (X1 + X2 + . . . + X100 ).
Note that, when applied to a matrix, the Matlab function mean computes the mean of each column
and returns a vector.
c) From lectures, what do you expect the mean and variance of the sampling distribution of the
random variable X̄ to be? Calculate the sample mean and variance for meanB and compare with
your expectations.
d) From lectures, what do you expect the sampling distribution of the random variable X̄ to be?
Recall the Central Limit Theorem. Construct a density histogram of the simulated sample means.
Comment on the shape.
e) The histfit function can overlay a histogram with the probability density function of any distri-
bution. To plot a histogram of meanB with 10 classes, overlaid with the Normal distribution type
histfit(meanB,10,’normal’), see help histfit for other distributions. Comment on how well
the Normal distribution fits the data.
f) Write an m.file to carry out some or all of this exercise.
Exercise 3
The file kevlar90.txt contains times to failure (in hours) for a sample of size n = 101 of strands of
Kevlar 49/epoxy, a material used in the space shuttle, when tested at a stress level of 90%. Recover
the vector kevlar90, imported from the kevlar90.txt data set into Matlab in the first lab class.
a) i) Construct a density histogram for the kevlar90 data. Comment on the shape. Do you think that
a Normal distribution would fit the data well? Do you think that an Exponential distribution
would fit the data well?
iii) From i), we can assume that the Exponential distribution is the population distribution, and
that the sample is a random sample drawn from it. Recall that the expectation of the
Exponential(µ)-distribution is µ, which can thus be regarded as the population mean. We
can estimate the ‘true’ Exponential distribution Exp(µ) by the Exponential distribution with
parameter x̄: we just replace in the distribution the unknown value of the mean (µ) by its
natural estimate, the sample mean x̄ (→ µ̂ = x̄). This simple way of fitting a distribution is
known as the method of moments. Write down the fitted Exponential density for the Kevlar
failure times. Given that the standard deviation of the Exp(µ) density is also µ, state the
standard deviation of a random variable with the fitted Exponential density. Compare the
sample standard deviation of the data set kevlar90.
iv) Use the histfit function to overlay the histogram of the Kevlar data with the fitted Expo-
nential density and briefly comment on how well the exponential density fits the data.
b) Next we examine the sampling distribution of the mean failure time for a sample of 10 strands of
Kevlar.
i) Use the Matlab command exprnd to simulate a matrix T with 10 rows and 1000 columns
containing independent elements that are Exponentially distributed with mean equal to the
sample mean of the Kevlar failure times. We can think of each column of T as being failure
times for a sample of 10 strands of Kevlar similar to those in the dataset. With 1000 columns
in T, we have 1000 random samples of size 10. Each of them could have been the observed
sample in kevlar90.txt! It must be clear that here the population distribution which generated
the samples is Exp(µ̂), with µ̂ found in a)iii).
ii) Use the mean function to compute the means of the 1000 columns. Thus, you get a row vector,
called meanT (say), of length 1000, which consists of the means of 1000 samples of size 10, that
1
is 1000 values of the random variable X̄ = 10 (X1 + X2 + . . . + X10 ).
iii) From lectures, what do you expect the mean and variance of the sampling distribution of the
random variable X̄ to be? Calculate the sample mean and variance for meanT and compare
with your expectations.
iv) From lectures, what do you expect the sampling distribution of the random variable X̄ to be?
Recall the Central Limit Theorem. Construct a density histogram of the simulated sample
means. Comment on the shape.
v) Use the histfit function to overlay the expected large sample density on the histogram and
briefly comment on the level of similarity.
c) Repeat part b) but this time generating 1000 random samples of size n = 500.
Exercise 4
The file shearstrength.txt contains 100 observations on shear strength (in lb) of ultrasonic spot
welds made on a certain type of alclad sheet. This data is in agreement with the one in Comparison of
Properties of Joints Prepared by Ultrasonic Welding of Other Means, Journal of Aircraft, 1983: 552-
556. Recover the vector shearstrength, imported from the shearstrength.txt data set into Matlab
in the first lab class.
a) Construct a density histogram of the shear strengths. Comment on the shape of the distribution.
b) Is it plausible that the sample was selected from a normal distribution? Draw a normal quantile plot
and conclude. Note: in Matlab, a normal quantile plot is represented by entering the command
qqplot. Use help qqplot for more information.
c) Investigators believe that the population standard deviation of shear strength in this context is
σ = 350 lb. Estimate µ, the true mean shear strength and give a 95% two-sided confidence interval
for it. Interpret your result. Note: in Matlab, this type of ‘z-confidence interval’ (see Slide 19,
Week 8 of the lecture slides) is obtained with the command ztest. This is actually a command used
for hypothesis testing (which we have not yet studied), but is also used for determining confidence
intervals. For instance, to calculate the 95% CI for a population mean from a sample contained in
the vector data, use the command
[h, p, ci] = ztest(data, xbar, sigma); ci
where xbar is the sample mean and sigma is the supposedly known population standard deviation
σ. The default option is for a 95% confidence interval. If you require any other level of significance
you need to enter a fourth parameter alpha. The outputs h and p refer to the underlying hypothesis
test and are not relevant for us at this point; that is why we focus only on the output ci here.
Type help ztest for more information. See also the SUM document, Section 6.1.
d) Now, suppose σ is not known. Give a 95% two-sided confidence interval for µ. Interpret your
result. Compare with the previous confidence interval, and explain why they are different. Note: in
Matlab, this type of ‘t-confidence interval’ (see Slide 19, Week 8 of the lecture slides) is obtained
with the command ttest. Again, this is actually a command used for hypothesis testing but
it is also used for determining confidence intervals. For instance, to calculate the 95% CI for a
population mean from a sample contained in the vector data, use the command
[h, p, ci] = ttest(data, xbar); ci
where xbar is the sample mean. The default option is for a 95% confidence interval. If you require
any other level of significance you need to enter a third parameter alpha. Type help ttest for
more information. See also the SUM document, Section 6.1.
Exercise 5
In this exercise, we will use simulations to drive home the proper interpretation of confidence intervals.
a) Simulate (use the Matlab command normrnd) a matrix C with 36 rows and 500 columns containing
elements that are independent values drawn from the Normal distribution with µ = 20 and σ = 5.
You can think of each column of C as being a random sample of size 36 from X ∼ N (20, 5). With
500 columns in C, we have 500 independent random samples of size 36.
b) Calculate 95% z- and t-confidence intervals for µ from each sample. Recall that the z-confidence
interval of level 100 × (1 − α)% is given by
σ
x̄ ± z1−α/2 √
n
where σ is assumed to be known, and the t-confidence interval of level 100 × (1 − α)% is
s
x̄ ± tn−1;1−α/2 √
n
where s is the sample standard deviation.
i) Use the Matlab mean function to compute the means of each of the 500 samples. You get a
row vector, call it meanC (say), of length 500, which consists of the 500 sample means.
Note that, when applied to a matrix, the Matlab function mean computes the mean of each
column and returns a vector.
ii) Use the Matlab std function to compute the standard deviations of each of the 500 samples.
You get a row vector, call it sC (say), of length 500, which consists of the 500 sample standard
deviations.
(Like mean, when applied to a matrix, std computes the standard deviation of each column
and returns a vector.)
iii) From the vector meanC, build the vector upz of upper bounds for the z-CI for each sample.
Recall that z0.975 = 1.96 and that here, n = 36 and σ = 5 (if assumed to be known).
iv) From the vector meanC, build the vector lowz of lower bounds for the z-CI for each sample.
v) From the vectors meanC and sC, build the vector upt of upper bounds for the t-CI for each
sample. Find the value t35;0.975 in Matlab (hint: use the tinv function).
vi) From the vectors meanC and sC, build the vector lowt of lower bounds for the t-CI for each
sample.
c) You have now 500 z-confidence intervals and 500 t-confidence intervals for µ computed from 500
independent samples. Compute how many of the 500 z-confidence intervals and how many of the
500 t-confidence intervals contain the true population mean µ = 20. Comment.
d) If you were to repeat this exercise over and over, on average, what fraction of the confidence intervals
you calculate would you expect to contain the population mean?
Exercise 6
For each of 18 preserved cores from oil-west carbonate reservoirs, the amount of residual gas sat-
uration after a solvent injection was measured at water flood-out. The data can be found in the
file porevolume.txt, and are from “Relative Permeability Studies of Gas Water Flow Following Sol-
vent Injection in Carbonate Rocks”, Soc. Petroleum Engineers J., 1976: 23-30. Recover the vector
porevolume, imported from the porevolume.txt data set into Matlab in the first lab class.
a) Is it plausible that the sample was selected from a normal population distribution? Show a density
histogram and a normal quantile plot and conclude.
b) Determine a 98% CI for the true average amount of residual gas saturation.
c) Determine a 95% CI for the true average amount of residual gas saturation. What is the difference
between a 95% CI and a 98% CI?
Statistics Tutorial Class Week 7 : Confidence intervals about means and proportions
Instructions
The exercises that will be covered during the tutorial class will be chosen from those printed here.
Note that not every topic in the course can be covered by the tutorial exercises. You are expected
to take an active part in the learning process, by working by yourself on most of the topics.
In particular, you are expected to attempt these tutorial questions beforehand. During the tutorial,
the tutor will go through the answers to some of the questions, directing explanation to areas where students
indicate they have difficulty.
Solutions to these exercises will be available from the course web page at a later date.
Exercise 1
An article in Journal of Agricultural Science (1997) reported June temperatures at Harper Adams
Agricultural College in the UK between 1982 and 1993 during June (an important time during the
growing season). The temperatures measured in June were obtained as follows:
15.2 14.2 14.0 12.2 14.4 12.5 14.3 14.2 13.5 11.8 15.2
with summary statistics x̄ = 13.77, s = 1.15 (to two decimal places).
a) Construct a 99% two-sided confidence interval on the mean temperature.
b) What assumptions did you make to construct this interval? What can you to ensure this is reason-
able?
c) Construct a 95% lower-confidence bound on the mean temperature (that is, a one-sided confidence
interval).
d) Suppose that we wanted to be 95% confident that the error in estimating the mean temperature is
less than 0.4°C. What sample size should be used?
Exercise 2
The wall thickness of 25 glass 2-litre bottles was measured by a quality-control engineer. The sample
mean was x̄ = 4.05 millimetres, and the sample standard deviation was s = 0.08 millimetre. Find a
95% lower confidence bound for mean wall thickness. Interpret the interval you have obtained. Assume
the normal distribution for wall thickness.
Exercise 3
While performing a certain task under simulated weightlessness, the pulse rate of 6 astronaut trainees
increased by an average of 26.4 beats per minute with a standard deviation of 14.28 beats per minute.
a) Construct a two-sided 95% confidence interval for the true average increase in the pulse rate of
astronaut trainees performing the given task.
b) What can one assert with 95% confidence about the maximum error if x̄ = 26.4 is used as a point
estimate of the true average increase in the pulse rate?
c) What did you assume in order to do the above calculations? What could be done to check if these
assumptions are reasonable?
Exercise 4*
Consider the following sample of fat content (in percentage) of n = 10 randomly selected hot dog
sausages of a given brand (“Sensory and Mechanical Assessment of the Quality of Frankfurters”,
Journal of Texture Studies, 1990):
25.2 21.3 22.8 17.0 29.8 21.0 25.5 16.0 20.9 19.5
Assume that the fat content follows a normal population.
a) Find a 95% confidence interval for the true mean fat content of the sausages of that brand.
b) Suppose, however, you are going to eat a single hot dog of this type and want a prediction for the
resulting fat content. Find a 95% prediction interval for the fat content of the hot dog sausage you
will eat.
Exercise 5
The article “Repeatability and Reproducibility for Pass/Fail data” (J. of Testing and Eval., 1997,
151-153) reported that in n = 48 trials in a particular laboratory, 16 resulted in ignition of a particular
type of substrate by a lighted cigarette. Find a 95% approximate confidence interval for π, the true
long-run proportion (or probability) of all such trials that would result in ignition.
Browse the general ‘Introduction to Matlab’ document and the ‘Statistics Using Matlab’ (here referred to
as SUM) document to find extra information. Remind that you can also use the help command at any time
to get information about any Matlab command.
Do not just enter Matlab commands! Make sure you understand what you do and the output you obtain.
When you produce a graph, look at it! What does it tell you about the data? Ask your tutor if you have
questions. Explore the possibilities of Matlab. Go beyond what is actually required!
Solutions to these exercises will be available from the course web page at a later date.
Exercise 1*
a) In a hypothesis test of H0 with alternative Ha , the p-value is calculated to be 0.008. What would
be the conclusion of the hypothesis test?
b) In a hypothesis test of H0 with alternative Ha , the p-value is calculated to be 0.08. What would
be the conclusion of the hypothesis test?
c) In a hypothesis test of H0 with alternative Ha , the p-value is calculated to be 0.8. What would be
the conclusion of the hypothesis test?
Exercise 2
Consider again the astronaut exercise (Exercise 3) from last week’s tutorial. The change in heart
rate (in beats per minute) is as follows:
15 32.5 49 26 8 28
a) Is there evidence that astronaut pulse rate has in fact increased under simulated weightlessness?
Use a hypothesis test to answer this question, via Matlab’s ttest function.
b) Is this consistent with your results from last week?
c) Produce a normal quantile plot of the data. How reasonable is the normality assumption for these
data?
Exercise 3
In the book ‘Statistical quality design and control: Contemporary concepts and methods’ DeVore,
Chang, and Sutherland (1992) discuss a cylinder boring process for an engine block. Specifications
require that these bores be 3.5199 ± 0.0004 inches. Management is concerned that the true proportion
of cylinder bores outside the specifications is excessive. Current practice is willing to tolerate up to
10% outside the specifications. Out of a random sample of 165 observations, 36 were outside the
specifications.
a) Conduct the most appropriate hypothesis test using a 0.01 significance level.
b) Construct a 99% confidence interval for the true proportion of bores outside the specification.
c) What assumptions did you need to make to carry out this hypothesis test, and determine the
confidence interval? Can you check these assumptions? If so, how?
Exercise 4
Recall from last week that the article “Repeatability and Reproducibility for Pass/Fail data” (J. of
Testing and Eval., 1997, 151-153) reported that in n = 48 trials in a particular laboratory, 16 resulted
in ignition of a particular type of substrate by a lighted cigarette. The experimenters were hoping for
at least a 50% ignition rate.
a) State the appropriate null and alternative hypotheses, making sure you define all terms.
b) What is the sample proportion?
c) What is the rejection region for the sample proportion, at the 0.05 significant level?
d) Is there evidence that the true ignition rate is less than 50%?
e) Recall that an important measure of the effectiveness of a test procedure is its power, given by
power = 1 − β = P(reject H0 when it is false).
What is the power of this test, if the true ignition rate were actually 40%? Answer this question
by simulation, as follows:
• Use the binornd function to generate 500 simulated values for the number of trials reporting
ignition, out of 48, when the chance of ignition is actually 40%.
• Calculate the sample proportion that ignited for each of these simulated experiments.
• Calculate the proportion of times the sample proportion falls in the rejection region.
Exercise 5
In the article ‘Estimating the current mean of a process subject to abrupt changes’ (Technometrics,
37, 311-323), Yashchin (1995) discusses a process for the chemical etching of silicon wafers used in
integrating circuits. This company wishes to detect an increase in the thickness of the silicon oxide
layers because thicker layers require longer etching times. Process specifications state a target value
of 1 micron for the true mean thickness. You may assume that layer thickness is normally distributed.
A recent random sample of ten wafers (in microns) yielded the following values:
0.940, 0.985, 0.986, 0.936, 0.999, 1.089, 0.949, 1.007, 0.947, 0.963
a) Use Matlab’s ttest function to:
i) Conduct a hypothesis test to determine whether the true mean thickness differs from the target
value of 1 micron. Use a significance level of 0.05.
ii) Construct a 95% confidence interval for the true current mean µ. Use this interval to determine
whether the mean thickness has changed from its target value.
b) Discuss the relationship of the 95% confidence interval and the corresponding hypothesis test.
c) Where possible, check the assumptions you made.
Instructions
The exercises that will be gone over during the tutorial class will be chosen from those printed here.
Note that not every topic in the course can be covered by the tutorial exercises. You are expected
to take an active part in the learning process, by working by yourself on most of the topics.
In particular, you are expected to attempt these tutorial questions beforehand. At the tutorial,
the tutor will go through the answers to some of the questions, directing explanation to areas where students
indicate they have difficulty.
Solutions to these exercises will be available from the course web page at a later date.
Exercise 1*
The paper “Selection of a Method to Determine Residual Chlorine in Sewage Effluents” (Water and
Sewage Works, 1971, pp. 360-364) reports the results of an experiment in which two different methods
for determining chlorine content were used on specimens of Cl2 -demand-free water for various doses
and contact times. Observations are in mg/l.
Specimen 1 2 3 4 5 6 7 8
MSI method 0.39 0.84 1.76 3.35 4.69 7.70 10.5 10.92
SIB method 0.36 1.35 2.56 3.92 5.35 8.33 10.70 10.91
The sample mean of the differences (MSI-SIB) between readings for these 8 samples is −0.4137 and
the corresponding sample standard deviation is 0.3210.
a) Construct a 99% confidence interval for the true average of the difference of residual chlorine
readings between the two methods (assume normality for the difference distribution).
b) Test whether the mean difference is zero at the 5% level.
c) What assumptions did you need to make to carry out this hypothesis test and to determine the
confidence interval? Can you check any or all of these assumptions? If so, how?
Exercise 2
Let µ1 and µ2 denote true average tread lives for two competing brands of size P205/65R15 radial
tyres. From samples of the two brands we have the data:
n1 = 45, x̄1 = 42, 500, s1 = 2200, n2 = 45, x̄2 = 40, 400 and s2 = 1500.
That’s it for classroom exercises. You should try to do the remaining laboratory exercises in your
own time – note they are assessable in quiz 3!
Exercise 4
The file fusiondat.txt, available from Moodle, contains results from an experiment in visual
perception using random dot sterograms. Two images appear to be composed entirely of random
dots, but the images will fuse and a 3D object will appear when they are viewed in a certain way.
An experiment was performed to determine whether knowledge of the form of the embedded image
affected the time required for subjects to fuse the images. One group of subjects received either no
information or just verbal information about the shape of the embedded object. A second group
received both verbal information and visual information (e.g., a drawing of the object).
Import this dataset to Matlab which contains two columns of data. The variables in the file are the
group the subject belongs to (1=no information or just verbal information, 2=verbal information and
visual information), and the log of the time taken to fuse the images.
a) Create comparative boxplots to compare the log time taken to fuse the images for the two groups
of subjects. Do they appear to be different?
b) What method would you use to analyse the data? What does it assume? Where possible, check
these assumptions are reasonable.
c) Use Matlab to perform an appropriate test to assess whether log times differ by groups. Carry
out the test using a significance level of α = 0.05. Include the statements of H0 and Ha , rejection
criterion, observed value of the test statistic, p-value, and your conclusion in plain language.
Exercise 5
The file abs.txt contains data on internal deadband impedances for front and rear sensors from an
antilock braking system (ABS). Each row of the dataset corresponds to the same unit, with the first
measurement being the front sensor impedance and the second the rear sensor impedance (in kohms).
Recover the matrix abs, imported from the abs.txt data set into Matlab in the first lab class.
a) Which formula do you think is appropriate for constructing a confidence interval for the difference
in mean impedance for the front and rear sensors? Why?
b) Use Matlab’s ttest function to construct a 95% confidence interval for the difference in mean
impedance for front and rear sensors. Does location of sensor appear to affect impedance?
Instructions
The exercises that will be covered during the laboratory class will be chosen from those printed here.
Note that not every topic in the course can be covered by the laboratory exercises. You are
expected to take an active part in the learning process, by working by yourself on most of the topics.
In particular, you are expected to attempt these laboratory questions beforehand. At the tutorial,
the tutor will go through the answers to some of the questions, directing explanation to areas where students
indicate they have difficulty.
Solutions to these exercises will be available from the course web page later.
a) Recover the matrix milk, imported from the milk.txt data set into Matlab in the first lab class.
The first column gives the milk production (in kg/day) for a cow and the second column the
corresponding milk protein production (in kg/day). Define two vectors x and y as the two columns
of the matrix milk.
b) Produce a scatter-plot of protein content (y) versus milk production (x), and comment on the
shape of the relationship between the two variables. Label appropriately. Write an equation for
the regression model that you would like to fit to these data.
c) Using the Matlab command lsline, add the least squares regression line to the plot. Enter help
lsline if you need more information on this command.
d) Use the Matlab command fitlm to fit a linear regression model
Y = β0 + β1 X + ε
to the data. Here, the response Y is the protein content and the predictor X is the milk production.
e) Write down the equation of the fitted regression line.
f) Does the simple regression model specify a useful relationship between production and protein?
Test at level α = 0.05 the significance of the predictor X in the model and draw a conclusion.
g) Estimate the true average protein content for all cows whose production is 30 kg/day. Use a
confidence interval of 99%.
h) Calculate a 99% prediction interval for the protein from a single cow whose production is 30
kg/day.
d) Using a 5% significance level, test the hypothesis in b) and state your conclusion.
e) For this ANOVA plot the residuals versus fitted values and a normal quantile plot of residuals and
comment.
f) How would your previous conclusion change if Batch was omitted from the model?
That’s it for laboratory exercises. You should try to do the following tutorial questions in your own
time.
Exercise 4*
Mist (airborne droplets or aerosols) is generated when metal-removing fluids are used in machining
operations to cool and lubricate the tool and work-piece. Mist generation is a concern of the OSHA
(Occupational Safety and Health Administration), which has recently lowered the allowable mist gen-
eration standard for the workplace by a substantial amount. In a 2002 article in the Lubrication
Engineering journal the following data was provided on :
0.9
0.8
Extent mist
0.7
0.6
0.5
0.4
0 100 200 300 400 500 600 700 800 900 1000
Fluid velocity
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ______ __________
#prediction interval
>> [ypred,yci] = predict(MyFit,500,’Prediction’,’observation’) ;yci
yci =
0.5639 0.8654
Exercise 5
An experiment was carried out to compare the flow rates of 6 different types of nozzles, types A, B,
C, D, E and F. ANOVA calculations yielded an observed F value f0 = 4.2 from 66 observations. State
H0 and Ha for the analysis of variance, and carry out the hypothesis test at level α = 0.05, giving the
range of values for the p-value which can be determined from the tables, and stating your conclusion
in plain language. What assumptions need to be made for an ANOVA to be an appropriate test? How
could you check if they are plausible assumptions?
Exercise 6
An experiment was carried out to compare the yield of 4 different crops. Random samples were
taken of size 12 from each crop. The observed msEr was 8.2 and the observed value of the test statistic
was f0 = 2.8.
a) Draw up and complete the ANOVA table.
b) Carry out the test using a significance level of α = 0.01. Include the statements of H0 and
Ha , rejection criterion, observed value of the test statistic, p-value, and your conclusion in plain
language.
Exercise 7
Having a pet may reduce the owner’s stress level. To examine this effect, researchers recruited 45
women who said they were dog lovers. The subjects were randomly assigned to three groups of fifteen
women. One group did a stressful task alone; one group with a good friend present; and one group
with their dog present. The heart rate during the task is one measure of the effect of stress. Heart
rates (beats per minute) during stress were recorded for these 45 women. Below you see the computer
ANOVA output for these data.
>> stressMod=fitlm(Group,Rate,’CategoricalVars’,[1]);
>> anova(stressMod)
ans =
Source df SS MS F
Treatment A 17470016 B E
Error 27 D 874482.48
Total C 41081016