Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Basic Statistic
© EduPristine – www.edupristine.com
© EduPristine
Agenda
Introduction
Data
Basic Statistics
© EduPristine 1
3. Basic Statistics
I. Probability
© EduPristine 2
3.a. Probability
© EduPristine 3
3.a. Probability- Other topics
Set operations
• Union (A U B)
U
• Intersection (A B)
Venn diagrams
• Basic operations on Venn diagrams
Bayes theorem
© EduPristine 4
3.b. Random variables
I. Definition
© EduPristine 5
3.b. Random variables- Definition
A random variable is a function or a rule which maps each event in a sample space to real
numbers.
X (w) = x
Random variable
w1 x1
w2 x2
w3 x3
. .
. .
. .
Sample space S Set of real numbers
So, if w is an element of the sample space S (i.e. w is one of the possible outcomes of the
experiment concerned) and the number x is associated with this outcome, then X(w) = x .
Convention:
• Denote random variable by capital letter “X”
• Denote the outcome or possible values by small letter “x” i.e. X(w) = x
© EduPristine 6
3.b. Random variables- Definition
Example:
Suppose there are 8 balls in a bag. The random variable X is the weight, in kg, of a ball
selected at random. Balls 1, 2 and 3 weigh 0.1kg, balls 4 and 5 weigh 0.15kg and balls 6, 7
and 8 weigh 0.2kg. Using the notation above, write down this information.
Solution:
X(b1) = 0.10 kg, X(b2) = 0.10 kg, X(b3) = 0.1 kg,
X(b4) = 0.15 kg, X(b5) = 0.15 kg
X(b6) = 0.2 kg, X(b7) = 0.2 kg X (bi) = x
© EduPristine 8
3.b. Discrete Random variables
Definition:
The set of all possible values of the outcome (or x) takes discrete values
• e.g. Outcome of rolling a dice= {1, 2, 3, 4, 5, 6}
• Or # credit cards owned by an individual = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Probabilities:
Probabilities are defined on events (subsets of the sample space S).
So what is meant by “P(X = x) ”?
• Suppose sample space consists of eight events {s1, s2, s3, s4, s5, s6, s7, s8}
• Let the outcome for
– E1 = {s1, s2, s3} be associated with number x1
– E2 = {s4, s5} be associated with number x2
– E3 = {s6, s7, s8} be associated with number x3
• P(X = x1) is meant P(E1)
• P(X = x2) is meant P(E2)
• P(X = x3) is meant P(E3)
© EduPristine 9
3.b. Discrete Random variables
Probability functions
• The function fX (x) = P(X = x) for each x in the range of X is the probability function (PF) of X
• It specifies how the total probability of 1 is divided up amongst the possible values of X
• Thus, gives the probability distribution of X.
• Also known as “probability distribution functions” (pdf)
Following are the requirements for a function to qualify as the probability function of a discrete
random variable:
• fX (x) >= 0 for all x within the range of X
• ∑fX (x) = 1
Cumulative distribution functions
• Gives the probability that X assumes a value that does not exceed x.
• Denoted as FX(x) = P(X <= x) where max (FX(x)) = 1
© EduPristine 10
3.b. Discrete Random variables- Probability
Example:
Suppose there are 8 balls in a bag. The random variable X is the weight, in kg, of a ball
selected at random. Balls 1, 2 and 3 weigh 0.1kg, balls 4 and 5 weigh 0.15kg and balls 6, 7
and 8 weigh 0.2kg. Write down the different probability distribution functions.
Solution:
fX(0.10) = P(X=0.10) = probability the ball b1 or b2 or b3 is selected out of 8 balls = 3/8
fX(0.15) = P(X=0.15) = probability the ball b4 or b5 is selected out of 8 balls = 2/8
fX(0.20) = P(X=0.20) = probability the ball b6 or b7 or b8 is selected out of 8 balls = 3/8
Definition:
The set of possible values taken by a continuous random variable falls in an interval (or a collection of
intervals) on the real line:
• e.g. Salary of a set of individuals
• Mathematically examples {x: x > 0} or {x: − ∞ < x < ∞} or {x: 0 < x < 1}
Probability Density Function
First define the range or the interval in which the probability has to be determined.
Say its (a, b).
The probability associated is represented as P(a < X < b) or P(a ≤ X ≤ b).
Also, it is the area under the curve of the probability density function (PDF) from a to b.
So probabilities can be evaluated by integrating the PDF fX (x) .
This relationship defines the PDF.
Mathematically
b
• P(a < X < b) = ∫a fX(x) dx
The conditions for a function to serve as PDF are
• fX (x) ≥ 0 − ∞ ≤ x ≤ ∞
∞
• ∫ -∞ fX(x) dx = 1
© EduPristine 12
3.b. Continuous Random variables
© EduPristine 13
3.b. Random variables- Expected values
Definition:
Expected values are numerical summaries of important characteristics of the distributions of
random variables.
Expected values of a Random Variable “X” is denoted as E[X]
Important Expected values are
• Mean
• Variance and Standard deviation
Mean:
• E[X] is a measure of central location
• For discrete case calculated as E[X] = ∑(xi * Pi) OR E[X] = (∑x * fX(x))
∞
• For continuous case calculated as E[X] = ∫-∞ x * fX(x) dx
• Usually denoted by μ
Variance:
• Var[X] = E[{X – E[X]}2]
• Var[X] = E[X2] – E2[X]
© EduPristine 14
3.b. Random variables- Expected values
Example:
Suppose there are 8 balls in a bag. The random variable X is the weight, in kg, of a ball
selected at random. Balls 1, 2 and 3 weigh 0.1kg, balls 4 and 5 weigh 0.15kg and balls 6, 7
and 8 weigh 0.2kg. Find mean and variance of weight.
Solution:
fX(0.10) = P(X=0.10) = 3/8
fX(0.15) = P(X=0.15) = 2/8
fX(0.20) = P(X=0.20) = 3/8
E[X] = ∑Pi * xi = 3/8 * 0.10 + 2/8 * 0.15 + 3/8 * 0.20 = 1.2/8 = 0.15 kg
Var[X] = E[X2] – E2[X] = 0.024375 – 0.0225 = 0.001875 kg2
X (bi) = x
b1 Weight (Random
b2 variable)
b3
x1=0.10
b4
b5 x2=0.15
b6 x3=0.20
b7
b8
Sample space S- Individual balls Set of real numbers- Weights in kg
© EduPristine 15
3.c. Discrete Probability distributions
© EduPristine 16
3.c. Discrete PDF- Uniform distribution
© EduPristine 17
3.c. Discrete PDF- Bernoulli distribution
A Bernoulli trial is an experiment which has only two possible outcomes – s (“success”) and f
(“failure”).
“success” and “failure” are mere labels and should not be taken literally. Instead we could have
“yes” and “no” OR “true” and “false”
Sample space S = {s,f} .
Probability measure:
• P({s}) = p, P({f}) = 1 – p 0<p<1
Random variable X defined by X(s) = 1, X(f) = 0.
Distribution: P(X = x) = px * (1-p)1-x , x = 0, 1; 0 < p < 1
Expected values:
• Mean, μ = p
• Variance, σ2 = p (1 – p)
Examples:
• Tossing of a coin. “Head” corresponds to “success” and “Tail” corresponds to “failure”.
• Defaulting a home loan. “Default” corresponds to “success” and “Non-default” corresponds to “failure”.
• Auto insurance policy. “No claim” corresponds to “success” and “Claim” corresponds to “failure”.
© EduPristine 18
3.c. Discrete PDF- Bernoulli distribution
0.50
0.25
0.00
s (X = 1) f (X = 0)
© EduPristine 19
3.c. Discrete PDF- Binomial distribution
© EduPristine 20
3.c. Case study for binomial distribution
© EduPristine 21
3.c. Discrete PDF- Poisson distribution
Annual mean “Damage to Physical Assets” frequency in Agency Services is 5.9 events p.a.
Find the probability of recording 0, 1, 2, 3, 4…..20 losses over next 12M.
The Bank actually records 10 such events over next 12M. The management feels that it is 1
out of 100 years scenario. Verify this hypothesis.
© EduPristine 23
3.c. Discrete PDF- Negative- Binomial distribution
© EduPristine 24
3.c. Continuous Probability distributions
© EduPristine 25
3.c. Continuous PDF- Uniform distribution
Assigns equal probability to all values between its minimum and maximum values.
Random variable X takes a value between two number a and b (say).
Probability density function: fX(x) = 1/(b-a), a<x<b
Denoted as X ~ U(a, b)
Expected values:
• Mean, μ = (a + b)/2
• Variance, σ2 = (b - a)2/12
Example: Assigning equal probability of default to a portfolio of credit card holders.
© EduPristine 26
3.c. Continuous PDF- Gamma distribution
Gamma family if distributions is a positively-skewed distribution explained by two parameters “α” and
“λ” (say).
It is bounded at zero and can take various shapes depending on values of parameters.
Random variable X takes a non-zero positive value.
α
Probability density function: fX(x) =( λ xα-1e- λx )/Γ(α) , x>0
Denoted as X ~ Gamma(α, λ)
Expected values:
• Mean, μ = α/λ
• Variance, σ2 = α/λ 2
Special cases:
• Exponential distribution when α = 1: fX(x) =λ e- λx , x>0
• Chi-square distribution with α = 2v (v any positive integer) and λ = 1/2
Example:
• Used the predict claim amount in Auto insurance.
• Used the predict loss amount in bank loan defaults
© EduPristine 27
3.c. Continuous PDF- Gamma distribution
Probability
Ga(20, 0.5)
7 7.54% 4.34% 8.17% 10%
8 6.18% 3.38% 13.98% 8%
9 4.98% 2.63% 17.73% 6%
10 3.96% 2.05% 17.77% 4%
11 3.12% 1.60% 14.71% 2%
12 2.44% 1.24% 10.40%
0%
13 1.90% 0.97% 6.44% 1 3 5 7 9 11 13 15 17 19
14 1.46% 0.75% 3.56%
15 1.12% 0.59% 1.79% Random variable (X)
16 0.86% 0.46% 0.82%
17 0.65% 0.36% 0.35%
18 0.50% 0.28% 0.14%
19 0.37% 0.22% 0.05%
20 0.28% 0.17% 0.02%
© EduPristine 28
3.c. Continuous PDF- Normal distribution
© EduPristine 29
3.c. Continuous PDF- Normal distribution
3 0.4% 4.3% 5%
4 0.0% 1.1% 0%
-5 -4 -3 -2 -1 0 1 2 3 4 5
5 0.0% 0.2%
© EduPristine 30
3.c. Continuous PDF- Normal distribution
Problem:
If X ~ N(25,36) , by making use of standard normal probability distribution table, find:
(i) P( X < 28)
(ii) P( X > 30)
(iii) P( X < 20)
Solution:
(i) P(X < 28) = P(Z < (28-25)/sqrt(36)) = P(Z < 3/6) =0.69146
(ii) P(X > 30) = P(Z > 0.833) =1− P(Z < 0.833) =1− 0.79758 = 0.20242
(iii) P(X < 20) = P(Z < −0.833) =1− P(Z < 0.833) =1− 0.79758 = 0.20242
© EduPristine 31
3.c. Continuous PDF- Lognormal distribution
Example:
• Used the predict claim amount in Auto insurance.
• Used the predict loss amount in bank loan defaults
© EduPristine 32
3.c. Continuous PDF- Lognormal distribution
logN(0,1)
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 2 4 6 8 10 12 14 16 18 20
© EduPristine 33
3.d. The Central Limit Theorem
Introduction:
It is perhaps one of the most important result in statistics
It provides the basis for large-sample inference about a population mean when the population
distribution is unknown.
It also provides the basis for large-sample inference about a population proportion, for example, in
opinion polls and surveys.
Definition:
If X1, X2, ….,Xn is a sequence of independent, identically distributed (iid) random variables with finite
mean μ and finite (non-zero) variance σ 2 then the distribution of (<X> – μ)/(σ /√n) approaches the
standard normal distribution, N(0,1) , as n → ∞
μ is the population mean from which X1, X2, ….,Xn have been extracted.
i=n
<X> is the sample mean calculated as <X> = (1/n) i=1
∑ Xi
For large n, (<X> – μ)/(σ /√n) and (∑ Xi – n μ)/(√(n σ 2)) has N(0, 1) distribution
OR
• <X> ~ N(μ, σ 2/n)
• ∑ Xi ~ N(n μ, n σ 2)
© EduPristine 34
3.d. The Central Limit Theorem
Example:
It is assumed that the number of claims arriving at an insurance company per working day has
a mean of 40 and a standard deviation of 12. A survey was conducted over 50 working days.
Find the probability that the sample mean number of claims arriving per working day was less
than 35.
Solution:
We have, μ = 40, σ = 12 , n = 50 .
2
The central limit theorem states that <X) ~ N(40,12 /50) .
We want P( <X> < 35) :
2
P( <X> < 35) = P(Z < (35-40)/ √(12 /50))
= P(Z < -2.946) = 1 – P(Z < 2.946)
= 1 – 0.9984 = 0.0016
© EduPristine 35
3.e. Sampling and Statistical Inference
I. Introduction
II. Random samples
III. Sample Mean
IV. Sample variance
V. The t- result
VI. The F- result
© EduPristine 36
3.e. Sampling and Statistical Inference
Introduction:
When a sample is taken from a population the sample information can be used to infer
certain things about the population.
For example, a population quantity could be its mean or variance.
If we were to keep taking samples from the same population and calculating the mean and
variance for each of the samples, we would find that the mean and variance results form
distributions as well.
The distributions of the sample mean and sample variance are called sampling
distributions.
© EduPristine 37
3.e. Sampling and Statistical Inference- Normal distribution
The t result:
Distribution:
• (<X> – μ)/(σ /√n) ~ N(0,1) is used to draw inference about μ when population variance σ2 is known.
• But for a population usually σ2 is not known.
• We combine (<X> – μ)/(σ /√n) ~ N(0,1) and (n-1) S2/σ2 ~ χ2n-1 to solve this problem.
• (<X> – μ)/(S /√n) ~ N(0,1)/√(χ2n-1/n-1) = tn-1
• As N(0,1)/√(χ2k/k) = tk
Example:
State the distribution of (<X>-100)/(S/ √5) for a random sample of 5 values taken from a N(100,σ 2 )
population. What is the probability that this quantity will exceed 1.533?
Solution:
Distribution: (<X>-100)/(S/ √5) ~ t4
From the t-Distribution table, the probability that this quantity will exceed 1.533 is 10%.
© EduPristine 39
3.e. Sampling and Statistical Inference- Normal distribution
The F result:
if independent random samples of size n1 and n2 respectively are taken from normal populations
with variances σ12 and σ22 , then
• (S12/ σ12 ) / (S22/ σ22 ) ~ Fn1-1, Fn2-1
• The F distribution gives us the distribution of the variance ratio for two normal populations.
© EduPristine 40
3.e. The F-test
Example: William Waugh is examining the earnings for two different industries. He suspects that the
earnings for chemical industry are more divergent than those of petroleum industry. To confirm, he
took a sample of 35 chemical manufacturers & a sample of 45 petroleum companies. He measured
the sample standard deviation of earnings across the chemical industry to be $3.5 & that of
petroleum industry to be $3.00. Determine if the earnings of the chemical industry have greater
standard deviation than those of the petroleum industry.
© EduPristine 41
Solution
4. State the decision rule regarding the hypothesis: Reject H0 is F > 1.74
F = S12 = $3.502 = 1.1165 < 1.74 (Hence no sufficient evidence to reject H0)
S22 $3.002
© EduPristine 42
3.f. Point Estimate & Confidence Intervals
Point estimates: These are the single (sample) values used to estimate population parameters
Confidence interval: It is a range of values in which the population parameter is expected to lie
© EduPristine 43
3g. Hypothesis Testing
A statistical hypothesis test is a method of making statistical decisions from and about
experimental data.
• “How well the findings fit the possibility that chance factors alone might be responsible."
44
© EduPristine
3g. Key steps in Hypothesis Testing
Null Hypothesis (H0): The hypothesis that the researcher wants to reject
Test Statistic
Rejection/Critical Region
Conclusion
45
© EduPristine
3g. Launching a niche course for MBA students?
Sam, a brand manager for a leading financial training center, wants to introduce a new niche finance course for MBA
students. He met some industry stalwarts and found that with the skills acquired by attending such a course, the
students would able to land up a in a good job.
He meets a random sample of 100 students and discovers the following characteristics of the market
• Mean household income to $20,000
• Interest level in students = high
• Current knowledge of students for the niche concepts = low
Sam strongly believes the course would adequately profitable in students if they have the buying power for the
course. They would be able to afford the course only if the mean household income is greater than $19,000.
Would you advice Sam to introduce the course?
• What should be the hypothesis?
o Hint: What is the point at which the decision changes (19,000 or 20,000)?
o What about the alternate hypothesis?
• What other information do you need to ensure that the right decision is arrived at?
o Hint: confidence intervals/ significance levels?
o Hint: Is there any other factor apart from mean, which is important? How do I move from population
parameters to standard errors?
• What is the risk still remaining, when you take this decision?
o Hint: Type-I/II errors?
o Hint: P-value
46
© EduPristine
3g. Criterion for Decision Making
• To reach a final decision, Sam has to make a general inference (about the population) from the
sample data.
• Criterion: Mean income across all households in the market area under consideration.
– If the mean population household income is greater than $19,000, then PD should introduce
the product line into the new market.
47
© EduPristine
3g. Identifying the Critical Sample Mean Value – Sampling Distribution
0.25
0.2
0.15
Critical Value
0.1 (Xc)
0.05
0
-10 -5 $19,000
0 5 10
• Sample mean values greater than $19,000--that is x-values on the right-hand side of the sampling
distribution centered on µ = $19,000--suggest that H0 may be false.
• More important the farther to the right x is , the stronger is the evidence against H 0
48
© EduPristine
3g. Computing the Criterion Value
• Standard deviation for the sample of 100 households is $4,000. The standard error of the mean
(sx) is given by:
s
sx $ 400
n
– Substitute the values of zc, s, and m (under the assumption that H0 is "just" true )
– Critical Value xc
– xc = m + zcs = $19,658.
– In this case, since the observed sample statistic (20,000) is greater than the critical value (19,658), so
the null hypothesis is rejected =>
Decision Rule
If the sample mean household income is greater than $19,658, reject the null hypothesis and introduce the
new course
49
© EduPristine
3g. Test Statistic
The value of the test statistic is simply the z-value corresponding to = $20,000.
x m
Z 2 .5 0.25
sx
0.2
0.1 α= 0.05
0.05
• There is a significant difference in
the hypothesized population
parameter and the observed 0
Z c
1 . 645
50
© EduPristine
3g. Errors in Estimation
Please note: You are inferring for a population, based only on a sample
• This is no proof that your decision is correct
• It’s just a hypothesis
Actual
– There is still a chance that your inference is wrong H0 is True H0 is False
– How do I quantify the prob. of error in inference? Inference
Type I and Type II Errors:
H0 is True
– Type I error occurs if the null hypothesis is Correct Decision Type-II Error P(Type-
rejected when it is true Confidence Level=1-α II Error)=β
© EduPristine
3g. P - Value – Actual Significance Level
P-value 0.2
© EduPristine 5352
3g. Some variations in the Z-Test
What if Sam surveyed the market and found that the student behavior is estimated to be:
־They would found the training too expensive if their household income is < US$ 19,000 and
hence would not have the buying power for the course?
־They would perceive the training to be of inferior quality, if their household income is >
US$19,000 and hence not buy the training?
־How would the decision criteria change? What should be the testing strategy?
Hint: From the question wording infer: Two tailed testing
־Appropriately modify the significance value and other parameters
־Use the Z-test
Appropriate change in the decision making and testing process process:
־Students will not attend the course if:
• The household income >$19,000 and the students perceive the course to be inferior
• The household income is <$19,000
־This becomes a two tailed test wherein the student will join the course only when the
household lie between a particular boundary. i.e. the household income should be neither
very high neither very low
© EduPristine
Two- Tailed Test
Reject H0 Reject H0
• Conclusion: If the household income lies Do not
between $18,216 and $19,784 then the
Reject H0
student will attend the course at 95%
confidence
54
© EduPristine
Thank you!
EduPristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com
Ph. +91 22 3215 6191
© EduPristine – www.edupristine.com
© EduPristine