Sei sulla pagina 1di 9

5 Describing Populations

In this chapter we describe populations and samples using the language of probability.
5.1 Population
The range of values will be called the population.
random variable to be a random number drawn from a population.
distribution of a random variable to a description of the range and the
probabilities of r.v
5.1.1 Discrete random variables
P(X=k)>0 and P(X=k)1. Furthermore, as X has some value, we have k P(X=k)=1.
#spike plot
k = 0:4
p=c(1,2,3,2,1)/9
plot(k,p,type="h",xlab="k", ylab="probability",ylim=c(0,max(p))) # argument type=h for
spike plot
points(k,p,pch=16,cex=2)
#
add
the
balls
to
top
of
spike

#Using sample() to generate random values


k = 0:2
p = c(1,2,1)/4
sample(k,size=1,prob=p)
sample(k,size=1,prob=p)

sample(1:6,size=1) + sample(1:6, size=1)


sample(1:6, size=1)+sample(1:6, size=1)

#The mean and standard deviation of a discrete random variable


population mean:

population variation:

5.1.2 Continuous random variables


The p.d.f. and c.d.f.
The mean and standard deviation of a continuous random variable
Quantiles of a continuous random variable
5.1.3 Sampling from a population
A sequence that is both independent and identically distributed is called an i.i.d.
sequence, or a random sample.
Random samples generated by sample()
## toss a coin 10 times. Heads=1, tails=0
sample(0:1,size=10,replace=TRUE)
sample(1:6,size=10,replace=TRUE) ## roll a die 10 times
## sum of dice roll 10 times
sample(1:6, size=10,replace=TRUE) + sample(1: 6,size=10,replace=TRUE)

#a random sample can also be produced by specifying the probabilities using prob=:

sample(0:1,size=10,replace=T,prob=c(1-.62,.62))

5.1.4 Sampling distributions


The distribution of a statistic is known as its sampling distribution.
The sampling distribution of a statistic can be quite complicated. However, for many
common statistics, properties of the sampling distribution are known and are related to
the population parameters. For example, the sample mean of a random sample has

5.2 Families of distribution


5.2.1 The d, p, q, and r functions
R has four types of functions for getting information about a family of distributions.
The d functions return the p.d.f. of the distribution, whereas the p functions return the
c.d.f. of the distribution. The q functions return the quantiles, and the r functions
return random samples from a distribution.
dunif(x=1, min=0, max=3)
punif(q=2, min=0, max=3)
qunif(p=1/2, min=0, max=3)
runif(n=1, min=0, max=3)

the arguments to these functions can be vectors,


ps = seq(0,1,by=.2)
# vector
names(ps)=as.character(seq(0,100,by=20)) # give names
qunif(ps, min=0, max=1)

5.2.2 Binomial, normal, and some other named distributions


Bernoulli random variables
n = 10; p = 1/4
sample(0:1, size=n, replace=TRUE,prob=c(1-p,p))

Binomial random variables

Example 5.5: Tossing ten coins Toss a coin ten times. Let X be the number of heads. If
the coin is fair, X has a Binomial(10,1/2) distribution.
The probability that X=5 can be found directly from the distribution with the choose()
function:
choose(10,5)*(1/2)^5 * (1/2)^(10-5)
dbinom(5, size=10, prob=1/2)

The probability that there are six or fewer heads, P(X6)=k6 P(X=k), can be given
either of these two ways:
sum(dbinom(0:6,size=10,prob=1/2))
pbinom(6,size=10,p=1/2)

If we wanted the probability of seven or more heads, we could answer using


P(X7)=1P(X6), or using the extra argument lower .tail=FALSE. This returns P(X>k )
rather than P(X k) .
sum(dbinom(7:10,size=10,prob=l/2))
pbinom(6,size=10,p=1/2)
pbinom(6,size=10,p=1/2, lower.tail=FALSE) # k=6 not 7!

A spike plot (Figure 5.4) of the distribution can be produced using dbinom():
heights=dbinom(0:10,size=10,prob=1/2)
plot(0:10, heights, type="h",main="Spike plot of X", xlab="k", ylab="p.d.f.")
points(0:10, heights, pch=16,cex=2)

Normal random variables

We can verify this with the p function:


pnorm(1.5, mean=0,sd=1)
pnorm(4.75, mean=4,sd=1/2)

# same z-score as above

How much area is no more than one standard deviation from the mean? We use pnorm()
to find this:
pnorm(1)pnorm(1)
12*pnorm(2)
# subtract area of two tails
diff(pnorm(c(3,3))) # use diff to subtract

Example 5.8: Testing the rules of thumb We can test the rules of thumb using random
samples from the normal distribution as provided by rnorm().
First we create 1,000 random samples and assign them to res:
mu = 100; sigma = 10
res = rnorm(1000,mean = mu,sd = sigma)
k = 1;sum(res > mu k*sigma & res < mu + k*sigma)
k = 2;sum(res > mu k*sigma & res < mu + k*sigma)
k = 3;sum(res > mu k*sigma & res < mu + k*sigma)

5.2.3 Popular distributions to describe populations


Uniform distribution
res = runif(50, min=0, max=10)
## fig= setting uses bottom 35% of diagram
par(fig=c(0,1,0,.35))
boxplot(res,horizontal=TRUE, bty="n", xlab="uniform sample")
## fig= setting uses top 75% of figure
par(fig=c(0,1,.25,1), new=TRUE)
hist(res, prob=TRUE, main="", col=gray(.9))
lines(density(res),lty=2)
curve(dunif(x, min=0, max=10), lwd=2, add=TRUE)
rug(res)

Exponential distribution
res = rexp(50, rate=1/5)
## boxplot
par(fig=c(0,1,0,.35))
boxplot(res, horizontal=TRUE, bty="n",xlab="exponential sample")
## histogram
par(fig=c(0,1,.25,1), new=TRUE)
## store values, then find largest y one to set ylim=
tmp.hist = hist(res, plot=FALSE)
tmp.edens = density(res)
tmp.dens = dexp(0, rate=1/5)
y.max = max(tmp.hist$density, tmp.edens$y, tmp.dens)
## make plots
hist(res, ylim=c(0,y.max), prob=TRUE, main="",col=gray(.9))
lines(density(res), lty=2)
curve(dexp(x, rate=1/5), lwd=2, add=TRUE)
rug(res)

Lognormal distribution

qt(c(.025,.975), df=10)
# 10 degrees of freedom
qf(c(.025,.975), df1=10, df2=5) # 10 and 5 degrees of freedom
qchisq(c(.025,.975), df=10)
# 10 degr

5.3 The central limit theorem


5.3.1 Normal parent population
X

That is, with greater and greater probability, the random value of
is close to the mean,
, of the parent population. This phenomenon of the sample average concentrating on the
mean is known as the law of large numbers
if adult male heights are normally distributed with mean 70.2 inches and standard
deviation 2.89 inches, the average height of 25 randomly chosen males is again normal
with mean 70.2 but standard deviation 1/5 as large. The probability that the sample
average is between 70 and 71 is found with
mu=70.2; sigma=2.89; n=25
diff( pnorm(70:71, mu, sigma/sqrt(n)) )
[1] 0.5522

5.3.2 Nonnormal parent population


The central limit theorem states that for any parent population with mean and
standard
deviation , the sampling distribution of

for large n satisfies

Figure 5.9 illustrates the central limit theorem for data with an Exponential (1)
distribution. This parent population and simulations of the distribution of
and 100 are drawn. As n gets bigger, the sampling distribution of
more bell shaped

for n=5, 25,

becomes more and

pnorm(.9, mean=1, sd = 1/sqrt(20))


[1] 0.3274

Potrebbero piacerti anche