Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
20161003
Basic Statistics
Revision 00; 03rd October 2016
Authors: Tushar Behl
For further clarifications, write to sanket.deshpande@jsw.in © HR & BE, JSW Vijayanagar
For internal use only. Any unauthorised reproduction of
the contents of this training module is not permitted
Flow of the module
1 Introduction to Statistics
2 Introduction to Data
3 Descriptive Statistics
4 Inferential Statistics
Definition of Statistics
Collection
• Condensation is mainly • Comparison of on entity • In business, predicting the • Statistics helps to • Business tends to
applied at embracing the with other entity is easy to future trends, forecasting estimate about the optimize its productivity,
understanding of a huge do with the help of the future problems and population by drawing profits, cost.
mass of data by providing statistics through tools opportunities is always the inferences from samples, • Statistical Tools like DOE,
only few observations. like graphs, diagrams, concern of management. tools like estimation Linear programming,
Dealing with large number coefficient of correlation • Statistical tools like time theory etc. enables us to Simulations etc. helps us
of data is cumbersome etc. series analysis, do the same. to generate and evaluate
thus, statistical measures regression tools etc. strategies to optimize
helps us to understand the enables us to do the business interests.
complexity of huge data. same.
Descriptive & Inferential Statistics
Levels of measurement
Inferential
Correlations
Nominal scale
Statistics
Ordinal scale
Interval scale
Ratio scale
Tests of significance Parametric tests of significance Z-tests (>30)
T-tests (<30)
Non Parametric tests of significance Chi-square goodness of fit
Chi-square independence
Mann-Whitney U
Wilcoxin Matched Pairs (signed rank test)
Measures of Association Parametric Measures of Association Pearson Product Moment Correlation
Non Parametric Measures of Association The Spearman Rank Order Correlation (Rs)
Phi coefficient
Kendall’s Q
Prediction Parametric Prediction Simple Linear Regression
Multiple Linear Regression
Non Parametric Prediction Tau, Gamma, Lambda
Multiple Comparisons Parametric Multiple Comparisons Analysis of Variance (ANOVA)
Two Way Analysis of Variance (ANOVA)
Non Parametric Multiple Comparisons Kruskal-Wallis Test
McNemar Test
Friedman Test
Flow of the module
1 Introduction to Statistics
2 Introduction to Data
3 Descriptive Statistics
4 Inferential Statistics
Types of Data
Data
1 Introduction to Statistics
Central Tendency
2 Introduction to Data
Dispersion
Shape
4 Inferential Statistics
Measure of Central Tendency
Statistics
Descriptive Inferential
Measure of
Measure of Measure of Measure of Hypothesis Estimation of
central
dispersion Position Shape testing parameters
tendency
1. We have a set of data of tap to tap time (min.) of BOF in Steel Melting Shop
46.8 47.83 48.81 46.71 50.38 50.52 49.61 56.78 61.95 52.2
Employee Salary (₹) 1. We have a set of data of employees and their salaries as shown in table at the left.
1 15k
2. Mean salary will be
2 18k
3 16k 15𝑘 + 18𝑘 + 16𝑘 + 14𝑘 + 15𝑘 + 15𝑘 + 12𝑘 + 17𝑘 + 90𝑘 + 95𝑘 = 30.7𝑘
10
4 14k
7 12k
3. However, inspecting the raw data suggests that this mean value might not be the best way
to accurately reflect the typical salary of a worker, as most workers have salaries in the ₹12k
8 17k to ₹ 18k range. The mean is being skewed by the two large salaries. Hence, in
9 90k this particular case Median is a better measure of central tendency.
10 95k
Median
5 55 89 56 35 14 56 55 87 45 92
Case 1:
Odd no. of 2. We first rearrange this in order of magnitude, (here, we have smallest first)
points
14 35 45 55 55 56 56 65 87 89 92
3. Since there are odd no. of data points in this data set,
Our Median will be the middle value i.e. 56
3. Now we take the 5th and 6th data point in our data set
and take their mean to get a median of 55.5.
Mode
Mean
Frequency
Frequency
Frequency
Mode
Median Mode
Mode
Median Median
Mean
Mean
x x x
Normal (symmetric) distribution Left skewed (asymmetric) Right skewed (asymmetric)
distribution distribution
Mean = Median = Mode Mean < Median < Mode Mean > Median > Mode
Measure of Dispersion
Statistics
Descriptive Inferential
Measure of Estimation
Measure of Measure of Measure of Hypothesis
central of
dispersion Position Shape testing
tendency parameters
Standard
Range Variance
Deviation
Dispersion in statistics is a way of describing how spread out a set of data is.
When a data set has a large number of values.
Introduction: Measure of Dispersion
Measure of dispersion
High Low
Range
Definition Calculation
The range is defined as the difference If in a data set,
between the largest score (XL) in the set Largest numerical value = XL
of data and the smallest score (XS) in the Smallest numerical value = XS , then
set of data. It is the simplest measure of 𝑹𝒂𝒏𝒈𝒆 = 𝑿𝑳 − 𝑿𝒔
dispersion. Range is a measure of e.g. for data set (4,8,1,6,6,2,9,3,6,9)
between point dispersion Range = 9-1=8
When to use Note/Limitations
• When you have ordinal data. • It is fairly insensitive.
• It is most useful in representing • It depends on only two scores in
the dispersion of small data sets. the set of data, XL and XS. Hence,
• It is used when we want to study sampling may effect it adversely.
within sample variation. • Two very different sets of data can
have the same range:
(1,1,1,1,9) vs (1,3,5,7,9)
Standard Deviation & Variance
Definition Calculation
Deviation is the difference of each data 𝒙𝒊 −µ 𝟐
Variance (σ ) = σ 𝑵
2
point from the mean
𝒙 −µ 𝟐
Variance is the expectation of the squared Standard Deviation (σ) = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ 𝒊 𝑵
deviation of a random variable from its mean where 𝒙 is the ith term in population
𝒊
Standard deviation is the most common µ is the mean of population
measure of how data are spread out from the 𝑁 is number of data points in population
mean. It is the square root of variance. * (n-1) will replace ‘N’ and µwith ഥ
𝒙 when the above are calculated
for sample, where ‘n’ is the sample size *
Descriptive Inferential
Measure of Estimation
Measure of Measure of Measure of Hypothesis
central of
dispersion Position Shape testing
tendency parameters
Measures of position are the technique’s used to divide a set of data into equal
groups
Percentile
Definition Calculation
A percentile (or a centile) is a measure used
Step 1: compute rank (R) corresponding the desired perentile P
in statistics indicating the value below which a 𝑷
given percentage of observations in a group of 𝑹 = ∗ 𝑵+𝟏 ,
𝟏𝟎𝟎
observations fall. Elements in a data set are rank ordered where N is the number of data points
from the smallest to the largest. Step 2 : If R is an integer, the Pth percentile is the number with
The values that divide a rank-ordered set of elements into rank R.
100 equal parts are called percentiles. If R is non-integer then ,
E.g. the 80th percentile is the value (or score) below which Define IR & FR as integer & fraction portion of R resp.
80% of the observations may be found. Find the scores with Rank ‘IR’ and with Rank ‘IR + 1’
An observation at the 50th percentile would correspond to Now interpolate using formulae :
the median value in the set. 𝑭𝑹 ∗ 𝒅𝒊𝒇𝒇𝒆𝒓𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒔𝒄𝒐𝒓𝒆𝒔 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ 𝒂𝒏𝒅 ′𝑰𝑹 + 𝟏′
Similarly, Deciles are the values that divide a rank ordered + 𝑴𝑰𝑵(𝒗𝒂𝒍𝒖𝒆 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ , ′𝑰𝑹 + 𝟏′ )
set of elements into 10 equal parts.
Refer next slide for example
Where to use?
Percentiles are calculated as a means of dividing a
distribution of values into 2 or more groups.
They are used to determine where to draw the line
between observed values within the distribution.
Example : Calculation of Percentile
Score Rank We have a data set, we have rank ordered them from smallest to largest. We
3 1 are interested to find the 25th percentile.
5 2
𝑷
7 3
Step 1: Using equation :𝐑 = 𝟏𝟎𝟎 ∗ 𝐍 + 𝟏
8 4
We have R= 2.25
9 5
11 6
13 7 Step 2: Since R is non integer, we have, IR= 2 and FR = .25
15 8 Now score corresponding IR (= 2 ) is 5 and that of IR+1 (= 3) is 7.
Interpolating using equation:
𝑭𝑹 ∗ 𝒅𝒊𝒇𝒇𝒆𝒓𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒔𝒄𝒐𝒓𝒆𝒔 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ 𝒂𝒏𝒅 ′𝑰𝑹 + 𝟏′
+ 𝑴𝑰𝑵(𝒗𝒂𝒍𝒖𝒆 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ , ′𝑰𝑹 + 𝟏′ )
We have .25*(2) + 5 = 5.5
No. of data
points are even
3. Check
4. Q1 the median of the
2. Use Median to divide How many
lower half of data & Q3 is
the ordered data set into data points
the median of the upper
two halves are there in
half of the data
data set
No. of data
points are odd
3.1 Include the median
in both halves
Box Plots
Upper whisker
Box plots
• It is the representation of
statistical data, in which a Upper Quartile (Q3)
rectangle is drawn to
represent Q2 and Q3, with a
vertical line inside to indicate
Middle quartile/ median (Q2)
the median.
• The lower and upper quartiles
are shown as horizontal lines
either side of the rectangle.
• We can also determine the
direction of variation that is in
which way the data sways, Lower quartile (Q1)
hence giving a good overview
of the data’s distribution
Lower whisker
Z-Score
Application
Z-score facilitates us to understand
the distribution of data.
Helps us to determine the exact
location of data from the mean on a
standard scale (or score).
Helps us to compare two or more
distributions.
Interpretation of Z-score
Statistics
Descriptive Inferential
Measure of Estimation
Measure of Measure of Measure of Hypothesis
central of
dispersion Position Shape testing
tendency parameters
Skewness Kurtosis
Skewness
Skewness speaks of the amount and direction of skew that is the deviation of data from
horizontal symmetry. Skewness has no units, it is a pure number, like a z-score.
Mathematically: Skewness can be easily calculated using SKEW() function in MS-Excel or by using
descriptive statistics under Analysis Toolpak.
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔
𝟑 Interpretation of coefficient of skewness is given below
𝒏 𝒙𝒊 − ഥ
𝒙
=
(𝒏 − 𝟏)(𝒏 − 𝟐) 𝒏𝒔𝟑
Caution: The sample skewness doesn’t necessarily apply to the whole population.
Skewness: Graphs
Mean Mode
Frequency
Frequency
Frequency
Median Mode
Mode
Median Median
Mean
Mean
x x x
Normal distribution Negative or Left skewed Positive or Right
distribution skewed distribution
Kurtosis
Kurtosis tells us how tall and sharp the central peak is, relative to a standard normal
distribution curve.
It can be easily calculated using KURT() function in MS excel
Note: This equation takes into account the sample size and subtracts 3 from the
kurtosis, hence with this the kurtosis of a normal distribution is 0
Kurtosis: Graphs & Interpretation