Sei sulla pagina 1di 37

Series C TQM Training Module; Doc. No. C-01.00.

20161003
Basic Statistics
Revision 00; 03rd October 2016
Authors: Tushar Behl
For further clarifications, write to sanket.deshpande@jsw.in © HR & BE, JSW Vijayanagar
For internal use only. Any unauthorised reproduction of
the contents of this training module is not permitted
Flow of the module

1 Introduction to Statistics

2 Introduction to Data

3 Descriptive Statistics

4 Inferential Statistics
Definition of Statistics
Collection

Statistics is the science of learning from data, and of


measuring, controlling, and communicating uncertainty
from the data collected in a systematic manner for a pre-
Organization
determined purpose and placed in relation to each other.

It is the methodology which scientists and mathematicians


have developed for interpreting and drawing conclusions
from collected data. Analysis

Statistics is the science of dealing with uncertain


phenomenon and events, providing crucial guidance in &
determining what information is reliable and which Interpretation
of numerical
predictions can be trusted. data
Functions of statistics

1. Condensation 2. Comparison 3. Forecasting 4. Estimation 5. Optimization

• Condensation is mainly • Comparison of on entity • In business, predicting the • Statistics helps to • Business tends to
applied at embracing the with other entity is easy to future trends, forecasting estimate about the optimize its productivity,
understanding of a huge do with the help of the future problems and population by drawing profits, cost.
mass of data by providing statistics through tools opportunities is always the inferences from samples, • Statistical Tools like DOE,
only few observations. like graphs, diagrams, concern of management. tools like estimation Linear programming,
Dealing with large number coefficient of correlation • Statistical tools like time theory etc. enables us to Simulations etc. helps us
of data is cumbersome etc. series analysis, do the same. to generate and evaluate
thus, statistical measures regression tools etc. strategies to optimize
helps us to understand the enables us to do the business interests.
complexity of huge data. same.
Descriptive & Inferential Statistics

Descriptive Statistics Inferential Statistics


• Organize • Generalize from samples to
pops
• Summarize
• Hypothesis testing
• Simplify
• Relationships among variables
• Presentation of data

Describing data Making predictions


Branches of statistics
Branch of Measuring
Meaning What does it entail
statistics (of)
Describing the whole set of data with a single value
Central
that represents middle or centre of distribution.
Tendency
 involves summarizing, tabulating, organizing and graphing Mean, Median, Mode
data in order to summarize a data set for the entire describing how spread out a set of data is. When a
population of subjects. Dispersion data set has a large number of values. Range,
Descriptive
 No attempt is made to infer or predict the characteristics of Std. Deviation, Variance
Statistics
subject that have not been measured or observed.
 tell us what the data shows and help to present quantitative Technique used to divide a set of data into equal
Position
descriptions in a sensible way. groups. Percentile, Quartile, Z-score
Shows shape and distribution of data. Skewness,
Shape
Kurtosis
 quantitatively describing the main features of a population Hypothesis
based on sample under study. testing
Inferential  helps take inference on a population based on results of a
Statistics sample under study which represents whole population. Estimation of
 Is valuable when it is not convenient or possible to examine parameters
each member of an entire population.
Degrees of freedom One sample T-test or paired test
Independent T-test
Chi-square test
ANOVA

Levels of measurement
Inferential
Correlations
Nominal scale
Statistics
Ordinal scale
Interval scale
Ratio scale
Tests of significance Parametric tests of significance Z-tests (>30)
T-tests (<30)
Non Parametric tests of significance Chi-square goodness of fit
Chi-square independence
Mann-Whitney U
Wilcoxin Matched Pairs (signed rank test)
Measures of Association Parametric Measures of Association Pearson Product Moment Correlation

Non Parametric Measures of Association The Spearman Rank Order Correlation (Rs)
Phi coefficient
Kendall’s Q
Prediction Parametric Prediction Simple Linear Regression
Multiple Linear Regression
Non Parametric Prediction Tau, Gamma, Lambda
Multiple Comparisons Parametric Multiple Comparisons Analysis of Variance (ANOVA)
Two Way Analysis of Variance (ANOVA)
Non Parametric Multiple Comparisons Kruskal-Wallis Test
McNemar Test
Friedman Test
Flow of the module

1 Introduction to Statistics

2 Introduction to Data

3 Descriptive Statistics

4 Inferential Statistics
Types of Data
Data

Categorization, classification of data → Qualitative Quantitative ← Numerical data


Nominal Ordinal Binary Discrete Variable

 Ordinal data has items  Discrete data is a count


 Nominal data or unordered  Binary data place things in  Continuous data, could be
assigned to categories that that can't be made more
data is the data which has one of two mutually divided and reduced to finer
do have some kind of precise; it involves integers
no implicit or natural order exclusive categories: levels & can take any value.
implicit or natural order and cannot take decimal
or rank between the right/wrong, true/false, or  E.g. if I use a weighing
 Eg., Height type - Short, values.
categories. accept/reject. scale to measure the
Medium, or Tall, Social  E.g. If I tally the number of
 E.g. Colour of candies,  E.g. Good candy and Bad weight of candies, it
Class – rich, middle class or candies, it would be
Gender, Ethnic origin etc candy would be continuous data
poor discrete data.

Another classification of Quantitative data → Interval Ratio

 E.g. Temp range 15-25 deg  E.g., Body Mass Index


C, (BMI), Pulse Rate
Flow of the module

1 Introduction to Statistics

Central Tendency
2 Introduction to Data
Dispersion

3 Descriptive Statistics Position

Shape
4 Inferential Statistics
Measure of Central Tendency
Statistics

Descriptive Inferential

Measure of
Measure of Measure of Measure of Hypothesis Estimation of
central
dispersion Position Shape testing parameters
tendency

Mean Median Mode

The measure of central tendency is an overview of measures that describe the


whole set of data with a single value that represents middle or centre of
distribution.
Mean When is mean used ?
1. Mean is the best measure of central
tendency for interval and ratio scales of
Definition Notation measurement.
For data set, the terms Arithmetic Mean for a 2. Mean is often used when data is
arithmetic mean, sample is denoted by symmetric and unimodal, that is when the
average are used
synonymously to refer to ഥ
𝒙 data is not skewed.
a central value of a 3. It is used when we want to study variation
discrete set of numbers: and
specifically, the sum of between multiple samples.
the values divided by the
Population Mean by
number of values. µ

Mathematically Note Caution


1. It produces the lowest
For Data set amount of error from all
(x1,x2,x3…xn) other values in the data set. 1.Mean is susceptible to the influence of
2. Larger the sample size
Mean = closer the sample mean to outliers.
(𝐱𝟏 + 𝐱𝟐 + 𝐱𝟑 … + 𝐱𝐧) population mean. 2. Mean is usually not a good measure of
3. It includes every value in
𝒏 data set as part of the central tendency for skewed distribution.
calculation.
Example 1 : Calculation of Mean

1. We have a set of data of tap to tap time (min.) of BOF in Steel Melting Shop

46.8 47.83 48.81 46.71 50.38 50.52 49.61 56.78 61.95 52.2

2. Mean of the above data will be:


(46.8 + 47.83 + 48.81 + 46.71 + 50.38 + 50.52 + 49.61 + 56.78 + 61.95 + 52.2)
Mean = 10

Mean tap to tap time = 51.159 minutes


Example 2 :Calculation of Mean

Employee Salary (₹) 1. We have a set of data of employees and their salaries as shown in table at the left.
1 15k
2. Mean salary will be
2 18k

3 16k 15𝑘 + 18𝑘 + 16𝑘 + 14𝑘 + 15𝑘 + 15𝑘 + 12𝑘 + 17𝑘 + 90𝑘 + 95𝑘 = 30.7𝑘
10
4 14k

5 15k Mean Salary = ₹ 30.7k


6 15k

7 12k
3. However, inspecting the raw data suggests that this mean value might not be the best way
to accurately reflect the typical salary of a worker, as most workers have salaries in the ₹12k
8 17k to ₹ 18k range. The mean is being skewed by the two large salaries. Hence, in
9 90k this particular case Median is a better measure of central tendency.
10 95k
Median

Definition Notation When to use?


A Median is the value Median is denoted by
separating the higher half 1. It is preferred when data is ordinal,
of a data sample (or M or 𝑥෤
a population) from the however it can be calculated for
lower half. It is the middle
value when the data is in interval/ratio as well.
order of magnitude. 2. The median can be used as a
measure of location for skewed
distributions.
Mathematically Note 3. It can be used when end-values are
1. Arrange data in It is of central
not known.
ascending or descending. importance, as it is the 4. It can also be used when one
2. For odd no. of data most resistant statistic.
points, The median is the requires reduced importance to be
Median = middle value.
3. For even no. of data
2nd quartile, 5th decile,
and 50th percentile
attached to outliers.
points 5. A median can be defined
Median = mean of
middle two values. on ordered one-dimensional data
Calculation of Median: Example

1. Suppose we have a data set

5 55 89 56 35 14 56 55 87 45 92
Case 1:
Odd no. of 2. We first rearrange this in order of magnitude, (here, we have smallest first)
points
14 35 45 55 55 56 56 65 87 89 92
3. Since there are odd no. of data points in this data set,
Our Median will be the middle value i.e. 56

1. Now suppose we have even no. of data points


65 55 89 56 35 14 56 55 87 45
Case 2: 2. We rearrange them in order of magnitude, we have
Even no. of
points 14 35 45 55 55 56 56 65 87 89

3. Now we take the 5th and 6th data point in our data set
and take their mean to get a median of 55.5.
Mode

Definition When to use


The mode is the 1.Mode is the
value that appears preferred measure of
most frequently in a central tendency
set of data. when data is nominal,
categorical in nature.
2. Mode is very rarely
used with continuous
data.
Limitation of mode is that it does not
provide us a very good measure of
Note Note central tendency when the most
There may be more
1.It is the value that is than one mode for a common mark is far away from the
most likely to be data set.
sampled rest of the data in the data set, as
-Only one mode-
2. The highest bar in unimodal. depicted in the diagram above
a bar chart or - Two modes - Bimodal
histogram represents - Three modes Trimodal
mode - More than three modes
Multi modal
Central tendencies: Relationship
Mathematically,
𝟐𝑴𝒆𝒂𝒏 = 𝟑𝑴𝒆𝒅𝒊𝒂𝒏 − 𝑴𝒐𝒅𝒆

Mean
Frequency

Frequency

Frequency
Mode
Median Mode
Mode
Median Median

Mean
Mean

x x x
Normal (symmetric) distribution Left skewed (asymmetric) Right skewed (asymmetric)
distribution distribution
Mean = Median = Mode Mean < Median < Mode Mean > Median > Mode
Measure of Dispersion

Statistics

Descriptive Inferential

Measure of Estimation
Measure of Measure of Measure of Hypothesis
central of
dispersion Position Shape testing
tendency parameters

Standard
Range Variance
Deviation

Dispersion in statistics is a way of describing how spread out a set of data is.
When a data set has a large number of values.
Introduction: Measure of Dispersion

Measures of dispersion are descriptive


statistics that describe how similar a set of Which of the following distribution has more dispersion?
data are to each other. The three
frequently used measures are:
1. Range 125 125
100
2. Variance
3. Standard Deviation.
75
50
vs 100
75
50
25 25
0 0
Low High
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Closeness in numerical value of data Distribution A Distribution B

Measure of dispersion
High Low
Range
Definition Calculation
The range is defined as the difference If in a data set,
between the largest score (XL) in the set Largest numerical value = XL
of data and the smallest score (XS) in the Smallest numerical value = XS , then
set of data. It is the simplest measure of 𝑹𝒂𝒏𝒈𝒆 = 𝑿𝑳 − 𝑿𝒔
dispersion. Range is a measure of e.g. for data set (4,8,1,6,6,2,9,3,6,9)
between point dispersion Range = 9-1=8
When to use Note/Limitations
• When you have ordinal data. • It is fairly insensitive.
• It is most useful in representing • It depends on only two scores in
the dispersion of small data sets. the set of data, XL and XS. Hence,
• It is used when we want to study sampling may effect it adversely.
within sample variation. • Two very different sets of data can
have the same range:
(1,1,1,1,9) vs (1,3,5,7,9)
Standard Deviation & Variance
Definition Calculation
Deviation is the difference of each data 𝒙𝒊 −µ 𝟐
Variance (σ ) = σ 𝑵
2
point from the mean
𝒙 −µ 𝟐
Variance is the expectation of the squared Standard Deviation (σ) = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = σ 𝒊 𝑵
deviation of a random variable from its mean where 𝒙 is the ith term in population
𝒊
Standard deviation is the most common µ is the mean of population
measure of how data are spread out from the 𝑁 is number of data points in population
mean. It is the square root of variance. * (n-1) will replace ‘N’ and µwith ഥ
𝒙 when the above are calculated
for sample, where ‘n’ is the sample size *

Why square the difference in Note


• If you add a constant to all data points variance does
numerator of formulae not change, however if you multiply every data point
The total sum of the deviates in the by a constant the variance is adversely effected by it
numerator equals to zero, so to avoid • They measure only the degree, not direction of the
this problem statisticians square the variation.
Notation Variance Population
sum of deviates prior to averaging
them. Sample s2 s
Population σ2 σ
Measure of Position
Statistics

Descriptive Inferential

Measure of Estimation
Measure of Measure of Measure of Hypothesis
central of
dispersion Position Shape testing
tendency parameters

Percentiles Quartiles Z-score

Measures of position are the technique’s used to divide a set of data into equal
groups
Percentile
Definition Calculation
A percentile (or a centile) is a measure used
Step 1: compute rank (R) corresponding the desired perentile P
in statistics indicating the value below which a 𝑷
given percentage of observations in a group of 𝑹 = ∗ 𝑵+𝟏 ,
𝟏𝟎𝟎
observations fall. Elements in a data set are rank ordered where N is the number of data points
from the smallest to the largest. Step 2 : If R is an integer, the Pth percentile is the number with
The values that divide a rank-ordered set of elements into rank R.
100 equal parts are called percentiles. If R is non-integer then ,
E.g. the 80th percentile is the value (or score) below which Define IR & FR as integer & fraction portion of R resp.
80% of the observations may be found. Find the scores with Rank ‘IR’ and with Rank ‘IR + 1’
An observation at the 50th percentile would correspond to Now interpolate using formulae :
the median value in the set. 𝑭𝑹 ∗ 𝒅𝒊𝒇𝒇𝒆𝒓𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒔𝒄𝒐𝒓𝒆𝒔 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ 𝒂𝒏𝒅 ′𝑰𝑹 + 𝟏′
Similarly, Deciles are the values that divide a rank ordered + 𝑴𝑰𝑵(𝒗𝒂𝒍𝒖𝒆 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ , ′𝑰𝑹 + 𝟏′ )
set of elements into 10 equal parts.
Refer next slide for example
Where to use?
Percentiles are calculated as a means of dividing a
distribution of values into 2 or more groups.
They are used to determine where to draw the line
between observed values within the distribution.
Example : Calculation of Percentile
Score Rank We have a data set, we have rank ordered them from smallest to largest. We
3 1 are interested to find the 25th percentile.
5 2
𝑷
7 3
Step 1: Using equation :𝐑 = 𝟏𝟎𝟎 ∗ 𝐍 + 𝟏
8 4
We have R= 2.25
9 5
11 6
13 7 Step 2: Since R is non integer, we have, IR= 2 and FR = .25
15 8 Now score corresponding IR (= 2 ) is 5 and that of IR+1 (= 3) is 7.
Interpolating using equation:
𝑭𝑹 ∗ 𝒅𝒊𝒇𝒇𝒆𝒓𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒔𝒄𝒐𝒓𝒆𝒔 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ 𝒂𝒏𝒅 ′𝑰𝑹 + 𝟏′
+ 𝑴𝑰𝑵(𝒗𝒂𝒍𝒖𝒆 𝒄𝒐𝒓𝒓𝒆𝒔𝒑𝒐𝒏𝒅𝒊𝒏𝒈 ′𝑰𝑹′ , ′𝑰𝑹 + 𝟏′ )
We have .25*(2) + 5 = 5.5

Therefore, 25th percentile is 5.5


Quartile
In descriptive statistics, the quartiles of a ranked set of data values are the three points
Quartiles help us
that divide the data set into four equal groups, each group comprising a quarter of the
measure this
data.
how data is
distributed in the 25% 25% 25% 25%
two arms on
either side of the Q1 Q2 Q3
median. First Quartile Second quartile Third Quartile
Inter Quartile (Q1): (Q2): (Q3):
Range (IQR) Middle number Is the median of the middle value
measures the between the the data. between the
middle fifty smallest number median and the
percent of the and the median of highest value of
data the data set. the data set.
Mathematically,
IQR= Q3-Q1
Flow Chart: Calculation Methodology for Quartiles

1. Arrange the data in 3.1 Split this data set


order of magnitude exactly in half

No. of data
points are even

3. Check
4. Q1 the median of the
2. Use Median to divide How many
lower half of data & Q3 is
the ordered data set into data points
the median of the upper
two halves are there in
half of the data
data set

No. of data
points are odd
3.1 Include the median
in both halves
Box Plots
Upper whisker
Box plots
• It is the representation of
statistical data, in which a Upper Quartile (Q3)
rectangle is drawn to
represent Q2 and Q3, with a
vertical line inside to indicate
Middle quartile/ median (Q2)
the median.
• The lower and upper quartiles
are shown as horizontal lines
either side of the rectangle.
• We can also determine the
direction of variation that is in
which way the data sways, Lower quartile (Q1)
hence giving a good overview
of the data’s distribution
Lower whisker
Z-Score

Definition Calculation to transform ‘x’ to


A z-score (or standard score) ‘Z’:
indicates how many standard 𝒙−µ
deviations an element is from the 𝒁=
𝝈
mean.

Application
Z-score facilitates us to understand
the distribution of data.
Helps us to determine the exact
location of data from the mean on a
standard scale (or score).
Helps us to compare two or more
distributions.
Interpretation of Z-score

Z-score Interpretation 68% of data


Equal to 0 Element = Mean
Less than 0 Element < Mean 95% of data
Greater than 0 Element > Mean
Equal to -1 Element is 1 std
dev. less than mean
Equal to -2 Element is 2 std 99.7% of data
dev. less than mean
Equal to 1 Element is 1 std
dev. greater than -3 -2 -1 0 1 2 3
mean
Equal to 2 Element is 2 std
dev. greater than
mean
Measure of Shape

Statistics

Descriptive Inferential

Measure of Estimation
Measure of Measure of Measure of Hypothesis
central of
dispersion Position Shape testing
tendency parameters

Skewness Kurtosis
Skewness

 Skewness speaks of the amount and direction of skew that is the deviation of data from
horizontal symmetry. Skewness has no units, it is a pure number, like a z-score.
Mathematically:  Skewness can be easily calculated using SKEW() function in MS-Excel or by using
descriptive statistics under Analysis Toolpak.
𝑺𝒌𝒆𝒘𝒏𝒆𝒔𝒔
𝟑  Interpretation of coefficient of skewness is given below
𝒏 𝒙𝒊 − ഥ
𝒙
= ෍
(𝒏 − 𝟏)(𝒏 − 𝟐) 𝒏𝒔𝟑

highly moderately approx. approx. moderately highly


Where,
skewed skewed symmetric symmetric skewed skewed
xi = ithvalue of x
𝑥ҧ is the average
n is the sample size
s is the standard deviation
-1 -0.5 0 +0.5 +1
Coefficient of Skewness

Caution: The sample skewness doesn’t necessarily apply to the whole population.
Skewness: Graphs

Mean Mode
Frequency

Frequency

Frequency
Median Mode
Mode
Median Median

Mean
Mean

x x x
Normal distribution Negative or Left skewed Positive or Right
distribution skewed distribution
Kurtosis
 Kurtosis tells us how tall and sharp the central peak is, relative to a standard normal
distribution curve.
 It can be easily calculated using KURT() function in MS excel

Note: This equation takes into account the sample size and subtracts 3 from the
kurtosis, hence with this the kurtosis of a normal distribution is 0
Kurtosis: Graphs & Interpretation

Kurtosis is typically measured with respect to


the normal distribution. A distribution that is
peaked in the same way as any normal
distribution, not just the standard normal Leptokurtic (thin)
distribution, is said to be mesokurtic. Positive
Kurtosis > 3

Platykurtic distributions have peak lower than


mesokurtic distribution. Mesokurtic
• These are characterized by a certain flatness Normal distribution
to the peak, and have slender tails.
Platykurtic
(flat)
A leptokurtic distribution has kurtosis greater Negative
Kurtosis < 3
than a mesokurtic distribution.
• These have peaks that are thin and tall.
• The tails, to both the right and the left, are
thick and heavy.
Back up
Definitions
 Parameter: Parameters are descriptive measures of an entire population used as the inputs
for a probability distribution function to generate distribution curves. Parameters are usually
signified by Greek letters to distinguish them from sample statistics. Parameters are fixed
constants, that is, they do not vary like variables. However, their values are usually unknown
because it is infeasible to measure an entire population.
 Error : It usually refers to how much functions, formulas, and statistics fail to fully explain or
model a true or theoretical value. In other words, it is the difference between an actual and
predicted value. While some degree of error or uncertainty can exist in statistical analyses,
identifying and quantifying it can at least help us explain its presence. Residual error
Standard error of the fits (SE of fits) Family error rate Type I and Type II error
 Proportion A proportion is a relative portion of a whole, as opposed to a count or frequency.
Proportions let you compare groups of unequal size. For example, suppose our
company manufactures Steel. If you were interested in the proportion of CaO (by weight) in
the mixture, you aren't worried about how many tons of CaO your plant uses in a day, you're
worried about the portion of CaO relative to the additives. Assuming that precisely a quarter
of the additives is CaO, the proportion of CaO could be expressed as a percentage (50%), a
decimal (0.5), or a fraction (1/2).

Potrebbero piacerti anche