Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Lesson 2
Agenda
After completing
this course, you will
be able to
understand:
Confidence Interval
Statistical Methods
Statistics is a applied/business mathematics which estimate the present and predict the
future.
Descriptive Statistics
Inferential Statistics
Sample
Population
Estimation
Measure of Dispersion
Hypothesis Testing
It is important that the investigator carefully and completely defines the population before
collecting the sample, including a description of the members to be included.
A sample is a group of units selected from a larger group (the population). By studying the sample it
is hoped to draw valid conclusions about the larger group.
A sample is generally selected for study because the population is too large to study in its entirety.
The sample should be representative of the general population. This is often best achieved by
random sampling.
Copyright 2014, Simplilearn, All rights reserved.
Sampling techniques
Sampling
Probability
Simple
Random
Systematic
Convenience
Non-Probability
Stratified
Judgmental
Cluster
Quota
Snowball
Descriptive Statistics
Number of
Students
20
22
33
21
13
5
114
35
30
25
20
15
10
5
0
Below 40-50 50-60 60-70 70-80
40
>80
Mean
Median
Mode
Mean
Median
Mode
Frequency
30
25
20
15
10
5
0
6
Copyright 2014, Simplilearn, All rights reserved.
Measure of Dispersion
Variance
Standard Deviation
Variance =
Standard Deviation = 4 = 2
20
5
=4
Valid
N
Min
Max
Mean
Standard Deviation
1000
1000
0.90
99.95
11.72
10.36
1000
475
0.00
173.00
13.27
16.90
1000
386
0.00
77.70
14.21
19.07
1000
678
0.00
109.25
13.78
14.08
1000
296
0.00
111.95
11.58
19.72
On average, customers spend the most on equipment rental, but there is a lot of variation in the
amount spent.
Customers with calling card service spend only slightly less, on average, than equipment rental
customers, and there is much less variation in the values.
The real problem here is that most customers don't have every service, so a lot of 0's are being
counted. One solution to this problem is to treat 0's as missing values so that the analysis for each
service becomes conditional on having that service.
Copyright 2014, Simplilearn, All rights reserved.
Probability Theory
Probability is a branch of mathematics that deals with the uncertainty of an event happening in
the future.
Probability value always occurs within a range of 0 to 1.
Probability of an event, P(E) = No. of favorable occurrences
No. of possible occurrences
HEAD
TAIL
Assigning Probabilities
Classical method based on equally likely
outcomes.
E.g.: Rolling a dice.
Relative frequency method based on
experimentation or historical data.
No. of
cars used
No. of
days
Probability
(3/60) = 0.05
10
(10/60) = 0.17
16
(16/60) = 0.27
15
(15/60) = 0.25
(9/60) = 0.15
(7/60) = 0.11
Probability Distribution
Probability distribution for a random variable gives information about how the probabilities are
distributed over the values of that random variable.
Its defined by f(x) which gives probability of each value.
E.g. Suppose we have sales data for AC sale in last 300 days.
Units
sold
Probability of units
No. of days
sold, f(x)
0
10
0.03
55
0.18
150
0.5
55
0.18
25
0.08
0.02
0.2
0
Binomial Distribution
Discrete probability distribution
Following conditions should be satisfied
A fixed number of trials
Each trial is independent of the others
The probability of each outcome remains constant from trial to trial.
Examples
Tossing a coin 10 times for occurrences of head
Surveying a population of 100 people to know if they watch television or not
Rolling a die to check for occurrence of a 2
(Assume that the conditions of binomial distribution apply: the outcomes for Amirs purchases are
independent, and the population of chocolate bars is effectively infinite.)
Copyright 2014, Simplilearn, All rights reserved.
p = 1/6
q = 5/6
Copyright 2014, Simplilearn, All rights reserved.
2.
3.
4.
= 0.979
5.
Number of purchase days required so that probability of success is greater than 0.95:
P(X 1) 0.95 = 1 P(X 0) 0.95
= 1 P(X=0) 0.05
= n 16.43 (applying log function)
= 17days.
Copyright 2014, Simplilearn, All rights reserved.
Normal Distribution
Theoretical model of the whole population
Centered around the mean and symmetrical on both sides
Standard normal distribution mean 0 and standard deviation 1
Poisson distribution
Discrete probability distribution for events that happen randomly in time
Following conditions need to be satisfied
The event results in a success or failure
The average number of successes, is known
Probability of success is proportional to the region/time
Probability of success in an extremely small region/time is almost zero.
Properties: Mean and variance is equal, and is denoted by .
Examples
Average number of houses sold by a company is 5 per day. What is the probability that exactly 4
houses will be sold tomorrow?
Average number of births in a hospital is 2.1 births per hour. What is the probability that there
will be exactly 6 births in the next two hours?
Copyright 2014, Simplilearn, All rights reserved.
Skewness
Kurtosis
Statistic
Std. Error
Statistic
Std. Error
1000
2.966
0.077
14.012
0.155
475
3.465
0.112
26.735
0.224
386
0.756
0.124
0.641
0.248
678
2.150
0.094
7.572
0.187
296
1.359
0.142
3.079
0.282
Equipment last month data is more accurate in nature and its SD is comparatively lower than the
other measures.
Conclusion - Equipment is the segment where the telecom company is getting more profits than
the others and it can invest more .
Confidence interval
Its a rule for a population parameter to determine an interval that is likely to include the parameter
based on the sample information.
Supposing that a random variable has been taken and the random samples were taken repeatedly
from the population, certain percentage of interval contains unknown value.
In such case, if population is repeatedly sampled and intervals calculated in that fashion then 95%
of interval contains true value of the unknown parameter.
This interval is then said to be 95% confident for the population proportion.
Data Requirements
Confidence level
Statistic
Margin of error
Range of the confidence interval = sample statistic + margin of error.
The uncertainty associated with the confidence interval is specified by the confidence level.
Tests of Significance
Tests used in assessing the evidence in favor of or against a given assumption
Begins with a Null Hypothesis, H0
Tests either validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha
Tests of Significance
One sample z-test
Two sample z-test
One sample t-test
Two sample t-test
Paired t-test
Chi Squared test
F test - Analysis of Variance (ANOVA)
F test - Regression
Copyright 2014, Simplilearn, All rights reserved.
Total
Exposure
Yes
No
Yes
37
13
50
No
17
53
70
Total
54
66
120
Copyright 2014, Simplilearn, All rights reserved.
= 29.1
Calculate the degrees of freedom :
(Number of rows 1) X (Number of columns 1)
df = (2 1) X (2 1) = 1
Calculate the p-value from the chi-squared table
For chi-squared value 29.1 and degrees of freedom = 1, from the table, p-value is < 0.001
Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and
observed values if there is no association
Conclusion : There is an association between the exposure and disease.
Copyright 2014, Simplilearn, All rights reserved.
ANOVA
Analysis of Variance used to compare more than two groups
Extension of the independent t-tests
Factor variable variable defining the groups
Marks
82
83
38
83
78
59
97
68
55
Basic Idea : Partition the total variation in the data into the variance between groups and variance
within groups.
Parametric
Non Parametric
Sign Test
ANOVA
Kruskall Wallis
Correlation
Measure of association between variables
Positive and negative correlation, ranging between +1 and -1
Positive correlation example:
Earning and expenditure
Negative correlation example
Speed and time
Correlation coefficient
r : correlation coefficient
+1 : Perfectly positive
-1 : Perfectly negative
Summary
Here is a quick
recap of what we
have learned in this
lesson
Probability distribution
Quiz
QUIZ
1
a.
Mean
b. Median
c.
Mode
d.
Standard Deviation
QUIZ
1
a.
Mean
b. Median
c.
Mode
d.
Standard Deviation
Answer: d
Explanation: Standard Deviation is used to measure dispersion and not to measure central
tendency.
Copyright 2014, Simplilearn, All rights reserved.
Calculate the mean, median and mode of the following data and choose the right
option:
QUIZ
2
d.
Calculate the mean, median and mode of the following data and choose the right
option:
QUIZ
2
d.
Answer: a.
Mean is the average of all the values, median is the middle value and the mode is the most
commonly occurring value.
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
3
Calculate the variance of the following data and choose the right option:
5,10,12,4,8,9,16
a.
15.41
b. 14.41
c.
9.14
d.
12.41
QUIZ
3
Calculate the variance of the following data and choose the right option:
5,10,12,4,8,9,16
a.
15.41
b. 14.41
c.
9.14
d.
12.41
Answer: b.
Variance is the average of squared deviations about the mean, given by
From the research question below, choose the alternative hypothesis from the
following options.
QUIZ
4
= 0
b. > 0
c.
< 0
d.
From the research question below, choose the alternative hypothesis from the
following options.
QUIZ
4
= 0
b. > 0
c.
< 0
d.
Answer: b.
Explanation: The question forms a one sided hypothesis, checking if the average
temperature has increased, that is, if > 0
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
5
Choose the commonly used value for significance level from the values given below
a.
0.1
b. 0.5
c.
1.0
d.
0.05
QUIZ
5
Choose the commonly used value for significance level from the values given below
a.
0.1
b. 0.5
c.
1.0
d.
0.05
Answer: d.
Explanation: The commonly used value for significance levels are 0.01 and 0.05.
QUIZ
6
Choose the right answer Non parametric tests can also be referred to as?
a.
Distribution free
b. Deviation free
c.
Dispersion free
d.
Decision free
QUIZ
6
Choose the right answer Non parametric tests can also be referred to as?
a.
Distribution free
b. Deviation free
c.
Dispersion free
d.
Decision free
Answer: a.
Explanation: Non parametric tests are distribution free
QUIZ
7
a.
Estimation
b. Hypothesis testing
c.
Dispersion
d.
Data mining
QUIZ
7
a.
Estimation
b. Hypothesis testing
c.
Dispersion
d.
Data mining
Answer: c.
Explanation: Descriptive statistics deals with the measure of dispersion.
QUIZ
8
QUIZ
Answer: d.
Explanation: Normal distribution. Rest of the things are satisfied by binomial distribution.
QUIZ
9
a. 0 and 1
b. -1 and 1
c. Negative and positive
d. Only positive
QUIZ
a. 0 and 1
b. -1 and 1
c. Negative and positive
d. Only positive
Answer: a.
Explanation: The probability of an event always lies between 0 and 1, i.e. failure and success
of that event
QUIZ
10
a. Sample
b. Measure of central tendency
c. Measures of dispersion
d. Hypothesis testing
QUIZ
10
a. Sample
b. Measure of central tendency
c. Measures of dispersion
d. Hypothesis testing
Answer: d.
Explanation: Hypothesis testing is not a part of descriptive statistics, it is a part of inferential
statistics.
QUIZ
11
a. Median
b. Mode
c. Mean
d. Standard deviation
QUIZ
11
a. Median
b. Mode
c. Mean
d. Standard deviation
Answer: c.
Explanation: Mean is used to calculate the central value or average of an given value of
numbers.
QUIZ
12
a. Median
b. Mode
c. Mean
d. Standard deviation
QUIZ
12
a. Median
b. Mode
c. Mean
d. Standard deviation
Answer: b.
Explanation: Mode is used to calculate the highest frequency which is being occurred in a
given value of numbers.
QUIZ
13
a. Median
b. Mode
c. Mean
d. Standard deviation
QUIZ
13
a. Median
b. Mode
c. Mean
d. Standard deviation
Answer: d.
Explanation: Standard deviation is used to measure the dispersion in a given set of numbers.
QUIZ
14
QUIZ
14
Answer: b.
Explanation: if the p value is less than 0.05, i.e., p<0.05, we reject the true null hypothesis.
QUIZ
15
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
QUIZ
15
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
Answer: c.
Explanation: Kurtosis is mainly used to measure the peakedness of an distribution of a
particular data set .
QUIZ
16
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
QUIZ
16
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
Answer: a.
Explanation: Skewness is the measure of deviation from symmetry and this maybe left
skewed or right skewed.
Thank You