DexLab Analytics Business Analytics - Data Science - Study Material PDF

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved
DISCLAIMER
This book is designed to provide information on

Business Analytics and Data Science only. This book
does not contain all information available on the
subject. This book has not been created to be specific
to any individual’s or organizations’ situation or
needs. Every effort has been made to make this book
as accurate as possible. However, there may be
typographical and or content errors. Therefore, this
book should serve only as a general guide and not as
the ultimate source of subject information. This book
contains information that might be dated and is
intended only to educate and entertain. The
management shall have no liability or responsibility to
any person or entity regarding any loss or damage
incurred, or alleged to have incurred, directly or
indirectly, by the information contained in this book.
You hereby agree to be bound by this disclaimer or
you may return this book within a week of receipt of
this book.
Copyright© 2017-2018, DexLab Solutions Corp
All Rights Reserved.
No part of this book may be reproduced or distributed in any form or by any electronic or
mechanical means including information storage and retrieval systems, without permission in
writing from the management.
Contents
1. Introduction to Analytics 6
2. Concept of Probability 38
3. Sampling Theory 66
4. Parametric Tests 85
5. Association between Variables 99
6. Concept of ANOVA 109
7. Factor Analysis 115
8. Cluster Analysis 124
9. Linear Regression 129
10. Logistic Regression 139
11. Time Series Analysis 150

Introduction CHAPTER
to Analytics 1
1. Introduction to Analytics
Every decade or so, the business world invents another term for how it
extracts managerial and decision-making value from computerized data. In the
1970s the favoured term was “decision support systems,” accurately reflecting the
importance of a decision-centred approach to data analysis. In the early 80s,
“executive information systems” was the preferred nomenclature, which addressed
the use of these systems by senior managers.
Later in that decade, emphasis shifted to the more technical-sounding “online
analytical processing,” or OLAP. The 90s saw the rise of “business intelligence” as a
descriptor. In the middle of 2000’s first decade, “analytics” began to come into
favour, at least for the more statistical and mathematical forms of data analysis.
Each of these terms has its virtues and its ambiguities. No supreme being
has provided us with a clear, concise definition of what anything should be called, so
we mortals will continue to wrestle with appropriate terminology. It appears,
however, that another shift is taking place in the label for how we take advantage of
data to make better decisions and manage organizations. The new label is “business
analytics”.
In one sense, “business analytics” is simply the combination of business
intelligence and analytics. Such a combination reflects the increased importance of
quantitative analysis of data for understanding, prediction, and optimization.
Business intelligence is a vague term that primarily denoted reporting-related
activity—certainly useful, but perhaps somewhat commoditized. “Analytics” in some
organizations could be a somewhat academic activity that lacked clear business
objectives, but that has changed as companies increasingly compete on their
analytical capabilities. After a quick review of the shortcomings of these two terms
by themselves, I’ll provide a definition of the merged term “business analytics.”

Chapter 1: INTRODUCTION TO ANALYTICS | 7
2. Definition of Business Analytics

A new term, of course, will not in itself solve any problems with previous
terms or the activities they encompass. However, if a new term denotes a new set of
emphases and meanings, it could influence people and organizations to adopt more
effective behaviour. In that sense, then, “business analytics” can be defined as the
broad use of data and quantitative analysis for decision-making within
organizations. It encompasses query and reporting, but aspires to greater levels of
mathematical sophistication. It includes analytics, of course, but involves
harnessing them to meet defined business objectives. Business analytics empowers
people in the organization to make better decisions, improve processes and achieve
desired outcomes. It brings together the best of data management, analytic
methods, and the presentation of results—all in a closed-loop cycle for continuous
learning and improvement.
Despite the name, business analytics is not restricted to private-sector, profit-
seeking businesses. The meaning of “business” here is that of “an immediate task or
objective,” with analytics being a means to achieve that objective. Governmental
and non-profit organizations can use business analytics to advance their objectives
as well, and in fact many do just that
As the Wikipedia definition (as of March 11, 2010) of “business analytics”
suggests, “In contrast with Business intelligence, business analytics focuses on
developing new insights and understanding of business performance whereas
business intelligence traditionally focuses on using a consistent set of metrics to
both measure past performance and guide business planning.”
This suggests a greater focus on statistically and mathematically-derived
insights in business analytics. If business intelligence typically stopped at
performance reporting, business analytics encompasses both the reporting of
performance and the attempt to understand and predict it.
We live in a world in which many amazing feats of data manipulation and
algorithmic transformation are possible. The name for these activities might as well
reflect their power and potential. “Business analytics” seems the term with the best
fit, at least for the moment.
It was a textbook problem in the volatile world of retail. In the summer of
2009, a major US music distributor experienced a sudden spike in demand for the
CDs of one of the artists in its back catalog. How could it ramp up production to
meet immediate needs without creating excess inventory in the future?

This company had an edge, however – a longstanding outsourcing

arrangement in place that, over a period of several years, had turned its supply
chain into a finely tuned, industrialized engine. Moreover, the engine came
equipped with a predictive analytics capability that could do more than just tell the
company that manufacturing could not meet demand.
The analytics could also identify how to solve the problem, leveraging
insights across multiple domains beyond the supply chain, including finance and
CRM.
In short order, the music distributor had pinpointed the source of greatest
demand. The company was able to make informed and timely decisions about where
to boost production and where the most cost-effective and profitable locations were
to locate inventory.
Net Result:
Instead of the costly alternative of increasing production everywhere in the
world—which would have resulted in excess inventory and expensive shipping to
redistribute products later—the company had an accurate, just-in-time, precisely
targeted delivery approach that met customers’ needs for a dramatically lower cost
Example: How do grocery cashiers know to hand you coupons you might
actually use?
Each Tuesday, you head to the grocery store and fill up your cart. The
cashier scans your items, then hands you a coupon for 50 cents off your favourite
brand of whole-grain cereal, which you didn't get today but were planning to buy
next week.
With hundreds of thousands of grocery items on the shelves, how do stores
know what you're most likely to buy? Computers using predictive analytics are
able to crunch terabytes and terabytes of a consumer's historical purchases to
figure out that your favourite whole-grain cereal was the one item missing from
your shopping basket that week. Further, the computer matches your past cereal
history to ongoing promotions in the store, and bingo - you receive a coupon for
the item you are most likely to buy.

Example: Why were the Oakland A's so successful in the early 2000s,
despite a low payroll?
During the early 2000s, the New York Yankees were the most acclaimed
team in Major League Baseball. But on the other side of the continent, the
Oakland A's were racking up success after success, with much less fanfare - and
much less money.
While the Yankees paid its star players tens of millions, the A's managed to
be successful with a low payroll. How did they do it? When signing players, they
didn't just look at basic productivity values such as RBIs, home runs, and earned-
run averages. Instead, they analysed hundreds of detailed statistics from every
player and every game, attempting to predict future performance and production.
Some statistics were even obtained from videos of games using video recognition
techniques. This allowed the team to sign great players who may have been
lesser-known, but who were equally productive on the field. The A's started a
trend, and predictive analytics began to penetrate the world of sports with a
splash, with copycats using similar techniques. Perhaps predictive analytics will
someday help bring Major League salaries into line.
Descriptive analytics mines data to provide business insights.

Example: How does Netflix frequently recommend just the right movie?
Netflix has tens of millions of users, each with their own movie preferences.
Let's say a user watched two movies this past weekend, and they were both
dramas. Across the Netflix universe, many other people watched similar dramas
to the ones he or she chose. Then, next Saturday, one of these other users chooses
a third movie, which might also be a drama. Based on this information about user
preferences, Netflix will predict that the first user would likely want to watch a
drama that's similar to the third movie chosen by others with similar tastes.
Netflix uses descriptive analytics to find correlations among different
movies that subscribers rent. Movies have many attributes, including genre,
rating, length, subject matter, and so on. With so many users and so many
attributes across Netflix's spectrum, obtaining a recommendation for a single
individual within seconds is a daunting task. But analytics helps confine the
universe of movies' attributes to a small number, while still capturing most of the
relationships needed to build a set of preference data. Descriptive analytics helps
companies like Netflix make sense of the millions of choices its users make every
day.

Example: Why is it better to charge an electric vehicle overnight and not

during the day?
We know that electricity prices are the highest during times of peak energy
demand. But when are those times? Intuition might tell us that peak demand
often happens in the late afternoon, when everyone comes home from their day
and turns on lights, air conditioners, washing machines, televisions, and
computers. It might also tell us the lowest demand occurs while we sleep at night.
Descriptive analytics examines historical electricity usage data to confirm our
suspicions. This type of data analysis also helps electric companies set prices,
which sometimes means that electricity rates during the low-demand night time
hours might even be negative! What a great time to charge your electric vehicle -
you're not only reducing pollution, but you even get paid for using electricity to
charge your car. As electric vehicles gain traction in the marketplace, it will be
interesting to see if they create a second peak during the night. Data analytics
will help us find the answer.
Example: Why do airline prices change every hour?

Basic economics teaches us that higher demand drives higher prices.
Following this logic, if a customer knows when the least desirable flying days and
times are, he or she would easily know when to book the cheapest airline tickets
for their next vacation. However, airlines have a leg up on consumers when it
comes to this information; they use prescriptive analytics to sift through millions
and millions of flight itineraries instantaneously. They then use this data to set
an optimal price at any given time based on supply and demand, thus maximizing
their profits.
Prescriptive analytics helps airlines squeeze every possible dollar out of
passengers' wallets, ensuring that higher prices are charged during higher
periods of demand. Airlines even take calculated gambles by deliberately
withholding cheap fares during low-demand times in anticipation of future,
higher-paying passengers. Analytics is key in helping industries like airlines
ensure that their pricing structures are working optimally to contribute to
bottom-line results.

Example: Why does Facebook often find your acquaintance as potential

friends?
Imagine that every Facebook user is represented by a dot. Now imagine
drawing a straight line between every Facebook user and each of his or her
friends. With over 750 million Facebook users, we would have a pretty chaotic
map of intersecting lines.
This is where prescriptive analytics comes in - to create order out of the
chaos and help Facebook recommend the right friends for each user. It works like
this: if you and your friend John Doe have many friends in common, then John
and your lines have common endpoints. Therefore, if John has a friend who is not
on your list of Facebook friends, it is very likely that you know that person.
Prescriptive analytics facilitates the scanning of billions of such lines to determine
possible missing friendships. So thanks to analytics, we're all able to find our
school buddies or long-lost childhood friends.
3. Types of Analytics
Descriptive: Analytics, which use data aggregation and data mining
techniques to provide insight into the past and answer: “What has happened?”
Descriptive analysis or statistics does exactly what the name implies they
“Describe”, or summarize raw data and make it something that is interpretable by
humans. They are analytics that describe the past. The past refers to any point of
time that an event has occurred, whether it is one minute ago, or one year ago.
Descriptive analytics are useful because they allow us to learn from past behaviors,
and understand how they might influence future outcomes.
Use Descriptive statistics when you need to understand at an aggregate level

what is going on in your company, and when you want to summarize and describe
different aspects of your business.
Predictive: Analytics, which use statistical models and forecasts techniques

to understand the future and answer: “What could happen?”
Predictive analytics has its roots in the ability to “Predict” what might
happen. These analytics are about understanding the future. Predictive analytics
provides companies with actionable insights based on data. Predictive analytics
provide estimates about the likelihood of a future outcome. It is important to
remember that no statistical algorithm can “predict” the future with 100%

certainty. Companies use these statistics to forecast what might happen in the
future. This is because the foundation of predictive analytics is based on
probabilities.
Use Predictive analysis any time you need to know something about the
future, or fill in the information that you do not have.
Prescriptive: Analytics, which use optimization and simulation algorithms

to advice on possible outcomes and answer: “What should we do?”
The relatively new field of prescriptive analytics allows users to “prescribe” a
number of different possible actions to and guide them towards a solution. In a nut-
shell, these analytics are all about providing advice. Prescriptive analytics attempt
to quantify the effect of future decisions in order to advise on possible outcomes
before the decisions are actually made. At their best, prescriptive analytics predicts
not only what will happen, but also why it will happen providing recommendations
regarding actions that will take advantage of the predictions.
Use prescriptive statistics anytime you need to provide users with advice on
what action to take.
Figure 1: Different stages Of Analytics

4. Confirmatory vs Exploratory Data Analysis
Confirmatory Analysis
Inferential Statistics - Deductive Approach
Heavy reliance on probability models
Must accept untestable assumptions
Look for definite answers to specific questions
Emphasis on numerical calculations
Hypotheses determined at outset
Hypothesis tests and formal confidence interval estimation
Exploratory Analysis
Descriptive Statistics - Inductive Approach
Look for flexible ways to examine data without preconceptions
Attempt to evaluate validity of assumptions
Heavy reliance on graphical displays
Let data suggest questions
Focus on indications and approximate error magnitudes
Figure 2: Deductive and Inductive approach

5. Scales of Measurement
Nominal
Let’s start with the easiest one to understand. Nominal scales are used for
labelling variables, without any quantitative value. “Nominal” scales could simply
be called “labels.” Here are some examples, below. Notice that all of these scales are
mutually exclusive (no overlap) and none of them have any numerical significance.
A good way to remember all of this is that “nominal” sounds a lot like “name” and
nominal scales are kind of like “names” or labels.
Ex: Nominal Scale Of Measurement
Figure 3: Nominal Scale of Measurement
Ordinal
With ordinal scales, it is the order of the values is what’s important and
significant, but the differences between each one is not really known.
Take a look at the example below. In each case, we know that a #4 is better
than a #3 or #2, but we don’t know–and cannot quantify–how much better it is. For
example, is the difference between “OK” and “Unhappy” the same as the difference
between “Very Happy” and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts like
satisfaction, happiness, discomfort, etc.
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to
remember with “ordinal scales”–it is the order that matters, but that’s all you really
get from these.
Advanced note: The best way to determine central tendency on a set of ordinal
data is to use the mode or median; the mean cannot be defined from an ordinal set.

Figure 4: Ordinal Scale of Measurement
Interval
Interval scales are numeric scales in which we know not only the order, but
also the exact differences between the values. The classic example of an interval
scale is Celsius temperature because the difference between each value is the same.
For example, the difference between 60 and 50 degrees is a measurable 10 degrees,
as is the difference between 80 and 70 degrees. Time is another good example of an
interval scale in which the increments are known, consistent, and measurable.
Interval scales are nice because the realm of statistical analysis on these data
sets opens up. For example, central tendency can be measured by mode, median, or
mean; standard deviation can also be calculated.
Like the others, you can remember the key points of an “interval scale” pretty
easily. “Interval” itself means “space in between,” which is the important thing to
remember–interval scales not only tell us about order, but also about the value
between each item.
Here’s the problem with interval scales: they don’t
have a “true zero.” For example, there is no such thing as
“no temperature.” Without a true zero, it is impossible to
compute ratios. With interval data, we can add and
subtract, but cannot multiply or divide. Confused? Ok,
consider this: 10 degrees + 10 degrees = 20 degrees. No
problem there. 20 degrees is not twice as hot as 10
degrees, however, because there is no such thing as “no
temperature” when it comes to the Celsius scale. I hope
that makes sense. Bottom line, interval scales are great, Figure 5: Interval
but we cannot calculate ratios, which brings us to our last Scale of Measurement
measurement scale…

Ratio
Ratio scales are the ultimate nirvana when it comes to
measurement scales because they tell us about the order, they tell us
the exact value between units, AND they also have an absolute zero–
which allows for a wide range of both descriptive and inferential
statistics to be applied. At the risk of repeating myself, everything
above about interval data applies to ratio scales + ratio scales have a
clear definition of zero. Good examples of ratio variables include
height and weight.
Ratio scales provide a wealth of possibilities when it comes to
statistical analysis. These variables can be meaningfully added,
subtracted, multiplied, divided (ratios). Central tendency can be Figure 6:
measured by mode, median, or mean; measures of dispersion, such as Ratio Scale
standard deviation and coefficient of variation can also be calculated of Measurement
from ratio scales.
SUMMARY
In summary, nominal variables are used to “name,” or label a series of
values. Ordinal scales provide good information about the order of choices, such as
in a customer satisfaction survey. Interval scales give us the order of values + the
ability to quantify the difference between each one. Finally, Ratio scales give us the
ultimate–order, interval values, plus the ability to calculate ratios since a “true
zero” can be defined.
Table 1: Different Scales of Measurement

6. Attribute
Attributes are qualitative character that cannot be numerically expressed.
Individuals possessing an attribute can be grouped into several disjoint classes.
Attribute may be of two types-ordinal and nominal.
Qualitative data are nonnumeric.
{Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and
types of material {straw, sticks, bricks} are examples of qualitative data.
Qualitative data are often termed categorical data. Some books use the
terms individual and variable to reference the objects and characteristics
described by a set of data. They also stress the importance of exact definitions of
these variables, including what units they are recorded in. The reason the data
were collected is also important.
7. Variable
The term variable means a character of an item or an individual that can be
expressed in numeric terms. It is also called a qualitative character and such
characters can be measured or counted.
Quantitative data are numeric.
Quantitative data are further classified as either discrete or continuous.
Discrete data are numeric data that have a finite number of possible values.
● A classic example of discrete data is a finite subset of the counting numbers,

{1,2,3,4,5} perhaps corresponding to {Strongly Disagree... Strongly Agree}.
● Another classic is the spin or electric charge of a single electron. Quantum

Mechanics, the field of physics which deals with the very small, is much
concerned with discrete values.
● When data represent counts, they are discrete. An example might be how
many students were absent on a given day. Counts are usually considered
exact and integer. Consider, however, if three tradies make an absence, then
aren't two tardies equal to 0.67 absences?

Continuous data represents infinite values (real numbers) in a given interval, so

that they can represent any intermediate value, in theory at least, in their range
of variation. For instance; size, weight, blood pressure, temperature. Have infinite
possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421...
The real numbers are continuous with no gaps or interruptions. Physically

measureable quantities of length, volume, time, mass, etc. are generally considered
continuous. At the physical level (microscopically), especially for mass, this may not
be true, but for normal life situations is a valid assumption.
Graphical Representation of Quantitative Data

● Textual
● Tabular
● bar chart-Simple bar chart, multiple bar chart
● Pie diagram
Figure 7: Simple Bar diagram Figure 8: Multiple Bar Diagram
Graphical Representation of Quantitative Data

Discrete Variable
● Column Diagram
● Frequency Polygon
Continuous
Variable
● Frequency Polygon
● Histogram
Figure 9: Pie Chart Figure 10: Histogram

8. Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a
set of data by identifying the central position within that set of data. As such,
measures of central tendency are sometimes called measures of central location.
They are also classed as summary statistics. The mean (often called the average) is
most likely the measure of central tendency that you are most familiar with, but
there are others, such as the median and the mode.
The mean, median and mode are all valid measures of central tendency, but
under different conditions, some measures of central tendency become more
appropriate to use than others. In the following sections, we will look at the mean,
mode and median, and learn how to calculate them and under what conditions they
are most appropriate to be used.
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of
central tendency. It can be used with both discrete and continuous data, although
its use is most often with continuous data (see our Types of Variable guide for data
types). The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set.
So, if we have n values in a data set and they have values x1, x2, ..., xn, the
sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the

Greek capitol letter, , pronounced "sigma", which means "sum of...":
If we are working with a continuous data set and the values are given in
intervals, we calculate the mid-point of the interval then multiply it with the
frequency for calculating arithmetic mean.

Items 0-10 10-20 20-30 30-40
Frequency 2 5 1 3
Mid-pt Frequency
Items Fm
M F
0-10 5 2 10
10-20 15 5 75
20-30 25 1 25
30-40 35 3 105
N=11 ∑fm=215
Based on the above mentioned formula, Arithmetic Mean will be:

=215/11=19.54
The Arithmetic Mean of the given numbers is 19.54.
Mean (Geometric)
The arithmetic mean is relevant any time several quantities add together to
produce a total. The arithmetic mean answers the question, "if all the quantities
had the same value, what would that value have to be in order to achieve the same
total?"
In the same way, the geometric mean is relevant any time several quantities
multiply together to produce a product. The geometric mean answers the question,
"if all the quantities had the same value, what would that value have to be in order
to achieve the same product?"
Let us calculate geometric mean on a discrete data set:
Given xi = 4, 9
Here, n = 2

Geometric mean formula for two numbers is

(∏ni=1xi)1n(∏i=1nxi)1n = x1x2−−−−√nx1x2n
Substituting xi and n
(∏2i=1xi)12(∏i=12xi)12 = 4∗9−−−−√24∗92
= 36−−√2362
=6
GM = (∏2i=1xi)12(∏i=12xi)12 = 6
Hence Geometric mean of two numbers (4, 9) is 6.
Mean (Harmonic)
Harmonic mean is another measure of central tendency and also based on
mathematics footing like arithmetic mean and geometric mean. Like arithmetic
mean and geometric mean, harmonic mean is also useful for quantitative data.
Harmonic mean is defined in following terms:
Harmonic mean is quotient of “number of the given values”

and “sum of the reciprocals of the given values”.
Example: To find the Harmonic Mean of 1,2,3,4,5.

The total number of values. N = 5
Now find Harmonic Mean using the above formula.
N/(1/a1+1/a2+1/a3+1/a4+.......+1/aN)
= 5/(1/1+1/2+1/3+1/4+1/5)
= 5/(1+0.5+0.33+0.25+0.2)
= 5/2.28 So,
Harmonic Mean = 2.19. This example will guide you to calculate the harmonic
mean manually.
n
simple harmonic mean  n
1
( )
i 1 xi

In case of data with frequency,

 n 
  fi 
Harmonic Mean   in1 
 fi 
 x 
 i 1 i 
When not to use the mean?

The mean has one main disadvantage: it is particularly susceptible to the
influence of outliers. These are values that are unusual compared to the rest of the
data set by being especially small or large in numerical value. For example,
consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw
data suggests that this mean value might not be the best way to accurately reflect
the typical salary of a worker, as most workers have salaries in the $12k to 18k
range. The mean is being skewed by the two large salaries. Therefore, in this
situation, we would like to have a better measure of central tendency. As we will
find out later, taking the median would be a better measure of central tendency in
this situation.
Median
The median is the middle score for a set of data that has been arranged in
order of magnitude. The median is less affected by outliers and skewed data. In
order to calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56 (highlighted in bold).
It is the middle mark because there are 5 scores before it and 5 scores after it. This
works fine when you have an odd number of scores, but what happens when you
have an even number of scores? What if you had only 10 scores? Well, you simply
have to take the middle two scores and average the result. So, if we look at the
example below:
65 55 89 56 35 14 56 55 87 45
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to
get a median of 55.5.
When we work with continuous data, the method of calculation changes.

Let’s discuss an example with continuous data:
The following distribution represents the number of minutes spent per week
by a group of teenagers in going to the movies. Find the median number of minutes
spent per by the teenagers in going to the movies.
Number of minutes per week Number of teenagers
0-99 26
100-199 32
200-299 65
300-399 75
400-499 60
500-599 42
Let us convert the class intervals given, to class boundaries and construct the less
than type cumulative frequency distribution.

Number of Minutes Class Number of Teenagers Cumulative

per Week Boundaries (Frequency) Frequency
0-99 0-99.5 26 26
100-199 99.5-199.5 32 58
200-299 199.5-299.5 65 123
300-399 299.5-399.5 75 198
400-499 399.5-499.5 60 258
500-599 499.5-599.5 42 300
N 300
Here,   150
2 2
Here, the cumulative frequency just greater than or equal to 150 is 198.
 Fk  198 and
Fk is the cumulative frequency corresponding to the class boundary 399.5
 xk  399.5
 the median class is the class for which upper class boundary is  xk  399.5
In other words, 299.5-399.5 is the median class, i.e. the class containing the median
value.
using the formula for median we have,
300
 123
Median or x  299.5  2 100
75
where, xl  299.5 (lower class boundary of the median class),
N  300 (total frequency),

Fl  123 ( less than type cumulative frequency corresponding to xl  299.5 ),
f m  75 (frequency of the median class),
and c  100 (class width of the median class).

150  123 27
 x  299.5  100  299.5  100  299.5  3.6 100  299.5  36  335.5
75 75
or, Median  x   335.5

So, the median number of

minutes spent per week by this
group of 300 teenagers in going
to the movies is 335.5, i.e. there
are 150 teenagers for whom the
number of minutes spent per
week in going to the movies is
less than 335.5 and there are
another 150 teenagers for
whom the number of minutes
spent per week in going to the Figure 11: Showing median of the bell shaped curve
movies is greater than 335.5.
Mode
The mode is the most frequent score in our data set. On a histogram it
represents the highest bar in a bar chart or histogram. You can, therefore,
sometimes consider the mode as being the most popular option. An example of a
mode is presented below:
Normally, the mode is used for categorical data where we wish to know which is the
most common category as illustrated below:
Considering an example-
The number of points scored in a series of football
games is listed below. Which score occurred most
often?
7, 13, 18, 24, 9, 3, 18
Solution: Ordering the scores from least to
greatest, we get: 3, 7,9, 13, 18, 18, 24
Answer: The score which occurs most often is 18.
Figure 12: Highlighting the MODE
This was for ungrouped or discrete data,
when we look for grouped data, the calculations
are modified.

Problem:
9. Measures of Dispersion
The values of a variable are generally not equal. In some cases the values are
very close to one another; again, in some cases they are markedly different from one
another. In order to get a proper idea about the overall nature of a given set of
values, it is necessary to know, besides average, the extent to which the given
values differ among themselves or equivalently how they are scattered about the
average. This feature of frequency distribution which represents the variability of
the given values or reflects how scattered the values are, is called its dispersion.
Absolute Measures of Dispersion
The Range
The Range is the difference between the lowest and highest values.

Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.

So the range is 9 – 3 = 6
Range = Highest Observation – Lowest observation
Here,
Lowest Observation = 96
Highest Observation = 129
Range = 33
It is that simple!
But perhaps too simple ...
The Range Can Be Misleading:

The range can sometimes be misleading when there are extremely high or low
values.
Example: In {8, 11, 5, 9, 7, 6, 3616}:

● the lowest value is 5,
● and the highest is 3616,
So the range is 3616-5 = 3611.
The single value of 3616 makes the range large, but most values are around 10.
Standard Deviation and Variance

Deviation just means how far from the normal
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the Greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you ask,
"What is the Variance?"

Variance
The Variance is defined as:
The average of the squared differences from the Mean
To calculate the variance follow these steps:

● Work out the Mean (the simple average of the numbers)
● Then for each number: subtract the Mean and square the result (the squared
difference).
● Then work out the average of those squared differences.
Example:
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.

Your first step is to find the Mean:

Answer:
600 + 470 + 170 + 430 + 300 1970

Mean = = = 394
5 5
so the mean (average) height is 394 mm. Let's plot this on the chart:
Now we calculate each dog's difference from the Mean:
To calculate the Variance, take each difference, square it, and then average the
result:
So, the Variance is 21,704.

And the Standard Deviation is just the square root of Variance, so:
Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)

And the good thing about the Standard Deviation is that it is useful. Now we can
show which heights are within one Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is
normal, and what is extra large or extra small.
Conceptual Difference between Standard Deviation and

Variance
The variance of a data set measures the mathematical dispersion of the data
relative to the mean. However, though this value is theoretically correct, it is
difficult to apply in a real-world sense because the values used to calculate it were
squared. The standard deviation, as the square root of the variance gives a value
that is in the same units as the original values, which makes it much easier to work
with and easier to interpret in conjunction with the concept of the normal curve.
Mean Deviation
The mean of the distances of each value from their mean.
Yes, we use "mean" twice: Find the mean ... use it to work out distances ...
then find the mean of those!
Mean Deviation = Σ|x - μ|N
Three steps:
● Find the mean of all values
● Find the distance of each value from that mean (subtract the mean from
each value, ignore minus signs)
● Then find the mean of those distances
Like this

Example: the Mean Deviation of 3, 6, 6, 7, 8, 11, 15, 16
Step 1: Find the mean:

3 + 6 + 6 + 7 + 8 + 11 + 15 + 16 72
Mean = = =9
8 8
Step 2: Find the distance of each value from that mean:
Value Distances from 9
3 6
6 3
6 3
7 2
8 1
11 2
15 6
16 7
Which looks like this:

Step 3: Find the mean of those distances:

Mean Deviation = 6+3+3+2+1+2+6+7 = 30 = 3.75
So, the mean = 9, and the mean deviation = 3.75

It tells us how far, on average, all values are from the middle.
10. Quartiles
Quartiles are the values that divide a list of numbers into quarters.
● First put the list of numbers in order
● Then cut the list into four equal parts
● The Quartiles are at the "cuts"
Like this
Example: 24, 25, 26, 27, 30, 32, 40, 44, 50, 52, 55, 57
Cut the list into quarters:
And the result is:

● Quartile 1 (Q1) = 26.5
● Quartile 2 (Q2), which is also the Median = 36
● Quartile 3 (Q3) = 51
Sometimes a "cut" is between two numbers ... the Quartile is the average of the two
numbers.
Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
The numbers are already in order
Cut the list into quarters:
In this case Quartile 2 is half way between 5 and 6:
Q2 = (5+6)/2 = 5.5

And the result is:

● Quartile 2 (Q2) = 5.5
11. Interquartile Range

The "Interquartile Range" is from Q1 to Q3:
To calculate it just subtract Quartile 1 from Quartile 3, like this:

Example:
The Interquartile Range is: Q3 - Q1 = 77-64 = 13
12. Skewness and Kurtosis

A fundamental task in many statistical analyses is to characterize the
location and variability of a data set. A further characterization of the data
includes skewness and kurtosis.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry.
A distribution, or data set, is symmetric if it looks the same to the left and right of
the center point.

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed

relative to a normal distribution. That is, data sets with high kurtosis tend to have
heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack
of outliers. A uniform distribution would be the extreme case.
The histogram is an effective graphical technique for showing both the
skewness and kurtosis of data set.
Example for Skewness:
Here are grouped data for heights of 100 randomly selected male students,
adapted from Spiegel and Stephens (1999, 68).
Figure 13: Descriptive Graph
A histogram shows that the data are skewed left, not symmetric.
But how highly skewed are they, compared to other data sets? To answer
this question, you have to compute the skewness.
Begin with the sample size and sample mean. (The sample size was given,
but it never hurts to check.)
n = 5+18+42+27+8 = 100
x = (61×5 + 64×18 + 67×42 + 70×27 + 73×8) ÷ 100
x = 9305 + 1152 + 2814 + 1890 + 584) ÷ 100
x = 6745÷100 = 67.45
Now, with the mean in
hand, you can compute the
skewness. (Of course in real life
you’d probably use Excel or a
statistics package, but it’s good
to know where the numbers come
from.)

Finally, the skewness is

g1 = m3 / m23/2 = −2.6933 / 8.52753/2 = −0.1082
But wait, there’s more! That would be the skewness if you had data for the
whole population. But obviously there are more than 100 male students in the
world, or even in almost any school, so what you have here is a sample, not the
population. You must compute the sample skewness:
n(n  1)
G1  g1
n2
= [√(100×99) / 98] [−2.6933 / 8.52753/2] = −0.1098
If skewness is positive, the data are positively skewed or skewed right, meaning
that the right tail of the distribution is longer than the left. If skewness is negative,
the data are negatively skewed or skewed left, meaning that the left tail is longer.
If skewness = 0, the data are perfectly symmetrical. But a skewness of exactly
zero is quite unlikely for real-world data, so how can you interpret the
skewness number? Bulmer (1979) — a classic — suggests this rule of thumb:
● If skewness is less than −1 or greater than +1, the distribution is highly
skewed.
● If skewness is between −1 and −½ or between +½ and +1, the distribution is
moderately skewed.
● If skewness is between −½ and +½, the distribution is approximately
symmetric.
With a skewness of −0.1098, the sample data for student heights are approximately
symmetric.
Figure 14: Skewness

Example for Kurtosis: Let’s continue with the example of the college men’s
heights, and compute the kurtosis of the data set. n = 100, x = 67.45 inches, and the
variance m2 = 8.5275 in² were computed earlier.
Finally, the kurtosis is

a4 = m4 / m2² = 199.3760/8.5275² = 2.7418
and the excess kurtosis is
g2 = 2.7418−3 = −0.2582
But this is a sample, not the population, so you have to compute the sample excess
kurtosis:
G2 = [99/ (98×97)] [101×(−0.2582)+6)] = −0.2091
This sample is slightly platykurtic: its peak is just a bit shallower than the peak
of a normal distribution.
Rule of thumb
● Kurtosis < 3, Platykurtic
● Kurtosis = 3, Mesokurtic
● Kurtosis > 3, Leptokurtic
Figure 15: Kurtosis

13. What are outliers in the data?

In statistics, an outlier is an observation point that is distant from other
observations. An outlier may be due to variability in the measurement or it may
indicate experimental error; the latter are sometimes excluded from the data set.
14. Boxplot
Boxplots are used to better understand how values are spaced out in different
sets of data. When reviewing a boxplot, an outlier is defined as a data point that is
located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the
interquartile range above the upper quartile and below the lower quartile).
This simplest possible box plot displays the
full range of variation (from min to max), the likely
range of variation (the IQR), and a typical value
(the median). Not uncommonly real datasets will
display surprisingly high maximums or surprisingly
low minimums called outliers. John Tukey has
provided a precise definition for outliers:
Outliers are either 3×IQR or more above the
third quartile or 3×IQR or more below the first
quartile. The values for Q1 – 1.5×IQR and Q3 +
1.5×IQR are the "fences" that mark off the
"reasonable" values from the outlier values. Outliers
Figure 16: Boxplot
lie outside the fences.
Example: Find the outliers, if any, for the following data set:
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7,
14.7, 14.7, 14.9, 15.1, 15.9, 16.4
To find out if there are any outliers, I first have to find the IQR. There are
fifteen data points, so the median will be at position (15 + 1) ÷ 2 = 8.
Then Q2 = 14.6. There are seven data points on either side of the median, so
Q1 is the fourth value in the list and Q3 is the twelfth: Q1 = 14.4and Q3 = 14.9.
Then IQR = 14.9 – 14.4 = 0.5.
Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or
above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.
Then the outliers are at 10.2, 15.9, and 16.4.

Concept of CHAPTER
Probability 2
1. Concept of Probability
Many events can't be predicted with total certainty. The best we can say is
how likely they are to happen, using the idea of probability.
Probability: How likely something is to happen.
Tossing a Coin
When a coin is tossed, there are
two possible outcomes:
 heads (H) or
 tails (T)
We say that the probability of the
coin landing H is ½.
And the probability of the coin
landing T is ½.
Throwing Dice
When a single die is thrown, there are
six possible outcomes: 1, 2, 3, 4, 5, 6. The
probability of any one of them is 1/6.

Chapter 2: CONCEPT OF PROBABILITY | 39
PROBABILITY
In general
Number of ways it can happen

Probability of an event happening =
Total number of outcomes
Example: the chances of rolling a "4" with a die

Number of ways it can happen: 1 (there is only 1 face with a "4" on it)
Total number of outcomes: 6 (there are 6 faces altogether)
So, the probability = ⅙
Example: There are 5 marbles in a bag: 4 are blue, and 1 is red. What is
the probability that a blue marble gets picked?
Number of ways it can happen: 4 (there are 4 blues)
Total number of outcomes: 5 (there are 5 marbles in total)
So the probability = ⅘=0.8
Probability Line
We can show probability on a Probability Line:
Figure 17: Probability Scale
Probability is Just a Guide

Probability does not tell us exactly what will happen, it is just a guide

Example: Toss a coin 100 times, how many Heads will come up?
Probability says that heads have a ½ chance, so we can expect 50 Heads.
But when we actually try it we might get 48 heads, or 55 heads ... or anything
really, but in most cases it will be a number near 50.
2. Words
Some words have special meaning in Probability:
Experiment or Trial: An action where the result is uncertain
Tossing a coin, throwing dice, seeing what pizza people choose are all
examples of experiments.
Sample Space: All the possible outcomes of an experiment
Example: Choosing a card from a deck

There are 52 cards in a deck (not including Jokers)
So the Sample Space is all 52 possible cards: {Ace of Hearts, 2 of Hearts, etc... }
The Sample Space is made up of Sample Points:
Sample Point: Just one of the possible outcomes
Example: Deck of Cards

● the 5 of Clubs is a sample point
● the King of Hearts is a sample point
"King" is not a sample point. As there are 4 Kings that is 4 different sample points.
Event: A single result of an experiment

Example Events:
● Getting a Tail when tossing a coin is an event
● Rolling a "5" is an event.
An event can include one or more possible outcomes:
● Choosing a "King" from a deck of cards (any of the 4 Kings) is an event
● Rolling an "even number" (2, 4 or 6) is also an event
The Sample Space is all possible outcomes.

A Sample Point is just one possible outcome.
And an Event can be one or more of the
possible outcomes.
Figure 18: Sample Space Hey, let's use those words, so you get used to
them:
Example: Alex wants to see
how many times a "double"
comes up when throwing 2
dice.
Each time Alex throws 2 dice is
an Experiment.
It is an Experiment because the
result is uncertain.
The Event Alex is looking for is a
"double", where both dice have the
same number. It is made up of
these 6 Sample Points:
{1,1} {2,2} {3,3} {4,4} {5,5} and {6,6}
The Sample Space is all possible
outcomes (36 Sample Points):
{1,1} {1,2} {1,3} {1,4} ... {6,3} {6,4}
{6,5} {6,6}
These are Alex's Results:
Table 2: Terms of Probability

Experiment Is it a
Double?
{3,4} No
{5,1} No
{2,2} Yes
{6,3} No
After 100 Experiments, Alex has 19 “double” Events
... ... ... is that close to what you would expect?
3. Types of Events
Mutually Exclusive: can't happen at the same time.
Examples:
● Turning left and turning right are Mutually Exclusive (you can't do both at
the same time)
● Tossing a coin: Heads and Tails are Mutually Exclusive
● Cards: Kings and Aces are Mutually Exclusive
What is not Mutually Exclusive:
● Turning left and scratching your head can happen at the same time
● Kings and Hearts, because we can have a King of Hearts!
Like here:
Aces and Kings are Hearts and Kings are

Mutually Exclusive not Mutually Exclusive
(can't be both) (can be both)

Probability
Let's look at the probabilities of Mutually Exclusive events. But first, a definition:
Probability of an Number of ways it can happen

=
event happening Total number of outcomes
Example: there are 4 Kings in a deck of 52 cards. What is the probability

of picking a King?
Number of ways it can happen: 4 (there are 4 Kings)
Total number of outcomes: 52 (there are 52 cards in total)
So, the probability = 4/52 =1/13
Mutually Exclusive:
When two events (call them "A" and "B") are Mutually Exclusive it
is impossible for them to happen together:
P(A and B) = 0
"The probability of A and B together equals 0 (impossible)"
But the probability of A or B is the sum of the individual probabilities:
P(A or B) = P(A) + P(B)
"The probability of A or B equals the probability of A plus the probability of B"
Example: A Deck of Cards

In a Deck of 52 Cards:
● the probability of a King is 1/13, so P(King)=1/13
● the probability of an Ace is also 1/13, so P(Ace)=1/13
When we combine those two Events:
● The probability of a card being a King and an Ace is 0 (Impossible)
● The probability of a card being a King or an Ace is (1/13) + (1/13) = 2/13
Which is written like this:
P(King and Ace) = 0
P(King or Ace) = (1/13) + (1/13) = 2/13

Special Notation
Instead of "and" you will often see the symbol ∩ (which is the "Intersection"
symbol used in Venn Diagrams).
Instead of "or" you will often see the symbol ∪ (the "Union" symbol)
Not Mutually Exclusive

Now let's see what happens when events are not Mutually Exclusive.
Example: Hearts and Kings
Hearts and Kings together is only the
King of Hearts:
But Hearts or Kings is:

● all the Hearts (13 of them)
● all the Kings (4 of them)
But that counts the King of Hearts twice!
So we correct our answer, by subtracting the extra "and" part:
16 Cards = 13 Hearts + 4 Kings − the 1 extra King of Hearts

Count them to make sure this works!
As a formula this is:
P(A or B) = P(A) + P(B) − P(A and B)
"The probability of A or B equals the probability of A plus the probability of B
minus the probability of A and B"

Here is the same formula, but using ∪ and ∩:

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
A Final Example
16 people study French, 21 study Spanish and there are 30 altogether. Work out the
probabilities!
This is definitely a case of not Mutually Exclusive (you can study French AND
Spanish).
Let's say b is how many study both languages:
● people studying French Only must be 16-b
● people studying Spanish Only must be 21-b
And we get:
And we know there are 30 people, so:

(16−b) + b + (21−b) = 30
37 − b = 30
b=7
And we can put in the correct numbers:
So we know all this now:
● P(French) = 16/30
● P(Spanish) = 21/30
● P(French Only) = 9/30
● P(Spanish Only) = 14/30
● P(French or Spanish) = 30/30 = 1
● P(French and Spanish) = 7/30
Lastly, let's check with our formula:
Put the values in:
30/30 = 16/30 + 21/30 − 7/30
Yes, it works!

SUMMARY
Mutually Exclusive
● A and B together is impossible: P(A and B) = 0
● A or B is the sum of A and B: P(A or B) = P(A) + P(B)
Not Mutually Exclusive

● A or B is the sum of A and B minus A and B:
Exhaustive Events
When a sample space is distributed down into some mutually exclusive
events such that their union forms the sample space itself, then such events are
called exhaustive events.
OR,
When two or more events from the sample space collectively than it is known
as collectively exhaustive events.
OR,
When at least one of the events occur compulsorily from the list of events,
then it is also known as exhaustive events.
Solved examples of exhaustive events

Let explain it with very simple example. After looking at these examples, reader
will also get a very clear cut idea about mutually exclusive events.
Example 1: A sample space is given

Sample space = S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
To understand what exhaustive events are?

Let an event x = {1, 2, 3}
Event y = {4, 5, 6}
Event z = {7, 8, 9, 10}

Solution:
Event x, y, z are mutually exclusive events because;
Xnynz=ø
Now check whether the events are exhaustive events or not?
For this, take the union of all events;
x u y u z = {1, 2, 3} u {4, 5, 6} u {7, 8, 9, 10} = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = S
Event x, y & z are exhaustive events, because they form a complete
sample space itself.
Example 2: A coin is tossed. Tell whether the events are exhaustive

events or not?
Where,
Event X = If Head will appear
Event Y = If tail will appear
Solution:
Both events together are exhaustive events, because one will occur during the
conduct of an experiment.
Equally Likely
If all the outcomes of a sample space have the same chance of occurrence,
then it is known as equally likely outcomes. It is not necessary that the outcomes
are equally likely, but during an experiment we shall assume that the outcomes are
equally likely outcomes in many cases.
Equally Likely Outcomes Cases

Equally likely assumptions are applied in the following cases;
1. Playing a Card: In a deck, there are 52 ordinary playing cards. In this all 52
cards are of the same size and therefore, it is assumed as equally likely
outcome of each card i.e., 1/52
2. Tossing a Coin: During tossing a single coin, there are two possible
outcomes i.e., head and tail. It is always assumed that both are equally likely
and each has a chance/probability of occurrence 1/2. If not, then the criteria

should be mentioned. In case of tossing more than one coin it is assumed that
on all the coins, head and tail are equally likely.
3. Throwing a Dice: There are six “6” possible outcomes when rolling a single
dice. In this case all the six outcomes are assumed to be equally likely
outcome. Each has a probability of occurrence 1/6.
4. Drawing Balls from a Bag: This is the last case in which probability of
occurrence is assumed as equally likely. For example a ball is selected
randomly from a bag having balls of different colors. In this case it is
assumed that each ball in the bag has an equal outcome.
Not Equally Likely Outcomes

When a sample space consists of outcomes that don’t have an equal chance of
occurrence, then the resultant outcomes are said to be not equally likely outcomes.
Approaches of Probability – the classical definition:

number of ways the event can occur
number of outcomes in S
Let the sample space (denoted by S) be the set of all possible distinct
outcomes to an experiment. The probability of some event is provided all points
in S are equally likely. For example, when a die is rolled the probability of getting a
2 is 1/6 because one of the six faces is a 2.
The relative frequency definition: The probability of an event is the
proportion (or fraction) of times
the event occurs in a very long
(theoretically infinite) series of
repetitions of an experiment or
process. For example, this
definition could be used to
argue that the probability of
getting a 2 from a rolled die
is 1/6.
Figure 19: Relative Frequency

Axiomatic Probability Theory

Axiomatic probability theory, although it is often frightening to beginners, is
the most general approach to probability, and has been employed in tackling some
of the more difficult problems in probability. We start with a set of axioms, which
serve to define a probability space. Although these axioms may not be
immediately intuitive, be assured that the development is guided by the more
familiar classical probability theory.
Let S be the sample space of a random experiment. The probability P is a real
valued function whose domain is the power set of S and range is the interval [0,1]
satisfying the following axioms:
i. For any event E, P (E) ≥ 0
ii. P (S) = 1
iii. If E and F are mutually exclusive events, then P(E ∪ F) = P(E) + P(F).
4. Random Variables
A Random Variable is a set of possible values from a random experiment.
Example: Tossing a coin- we could get Heads or Tails.

Let's give them the values Heads=0 and Tails=1 and we have a Random Variable
"X":
In short:
X = {0, 1}
Note: We could have chosen Heads=100

and Tails=150 if we wanted! It is our
choice.
So:
● We have an experiment (such as tossing a coin)
● We give values to each event
● The set of values is a Random Variable
Not Like an Algebra Variable

In a variable, like x, is an unknown value:

Example: x + 2 = 6
In this case we can find that x=4
But a Random Variable is different ...
A Random Variable has a whole set of values ...
... and it could take on any of those values, randomly.
Example: X = {0, 1, 2, 3}
X could be 1, 2, 3 or 4, randomly.
And they might each have a different probability.
Sample Space: A Random Variable's set of

values is the Sample Space.
Example: Throw a die once

Random Variable X = "The score shown on
the top face".
X could be 1, 2, 3, 4, 5 or 6
So the Sample Space is {1, 2, 3, 4, 5, 6}
Probability
We can show the probability of any one value using this style:
P(X = value) = probability of that value

Example (continued): Throw a die once
X = {1, 2, 3, 4, 5, 6}
In this case they are all equally likely, so the probability of any one is 1/6
● P(X = 1) = 1/6
● P(X = 2) = 1/6
● P(X = 3) = 1/6
● P(X = 4) = 1/6
● P(X = 5) = 1/6
● P(X = 6) = 1/6
Note that the sum of the probabilities = 1, as it should be.

Example: Toss three coins.

X = "The number of Heads" is the Random Variable.
In this case, there could be 0 Heads (if all the coins land Tails up), 1 Head, 2
Heads or 3 Heads. So the Sample Space = {0, 1, 2, 3}
But this time the outcomes are NOT all equally likely.
The three coins can land in eight possible ways:
X = “number of Heads”
HHH 3
HHT 2
HTH 2
HTT 1
THH 2
THT 1
TTH 1

TTT 0
Looking at the table we see just 1 case of Three Heads, but 3 cases of Two Heads,
3 cases of One Head, and 1 case of Zero Heads. So:
● P(X = 3) = 1/8
● P(X = 2) = 3/8
● P(X = 1) = 3/8
● P(X = 0) = 1/8
Example: Two dice are tossed.

The Random Variable is X = "The sum of the scores on the two dice".
Let's make a table of all possible values:
1st Die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
2nd
Die
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
There are 6 × 6 = 36 of them,

and the Sample Space = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}

Let's count how often each value occurs, and work out the probabilities:
● 2 occurs just once, so P(X = 2) = 1/36
● 3 occurs twice, so P(X = 3) = 2/36 = 1/18
● 4 occurs three times, so P(X = 4) = 3/36 = 1/12
● 5 occurs four times, so P(X = 5) = 4/36 = 1/9
● 6 occurs five times, so P(X = 6) = 5/36
● 7 occurs six times, so P(X = 7) = 6/36 = 1/6
● 8 occurs five times, so P(X = 8) = 5/36
● 9 occurs four times, so P(X = 9) = 4/36 = 1/9
● 10 occurs three times, so P(X = 10) = 3/36 = 1/12
● 11 occurs twice, so P(X = 11) = 2/36 = 1/18
12 occurs just once, so P(X = 12) = 1/36
SUMMARY
 A Random Variable is a set of possible values from a random experiment.
 The set of possible values is called the Sample Space.
 A Random Variable is given a capital letter, such as X or Z.
 Random Variables can be discrete or continuous.
Random Variables can be either
● Discrete Data can only take certain values (such as 1,2,3,4,5)
● Continuous Data can take any value within a range (such as a person's
height)
Discrete and Continuous Random Variables

A variable is a quantity whose value changes.
A discrete variable is a variable whose value is obtained by counting.
Examples:
 number of students present
 number of red marbles in a jar
 number of heads when flipping three coins
 students’ grade level

A continuous variable is a variable whose value is obtained by measuring.
Examples:
 height of students in class
 weight of students in class
 time it takes to get to school
 distance traveled between classes
A random variable is a variable whose value is a numerical outcome of a random

phenomenon.
 A random variable is denoted with a capital letter
 The probability distribution of a random variable X tells what the possible
values of X are and how probabilities are assigned to those values
 A random variable can be discrete or continuous
A discrete random variable X has a countable number of possible values.
Example: Let X represent the sum of two dice.

Then the probability distribution of X is as follows:
X 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P(X)
36 36 36 36 36 36 36 36 36 36 36
To graph the probability distribution of a discrete random variable, construct

a probability histogram.

A continuous random variable X takes all values in a given interval of numbers.

 The probability distribution of a continuous random variable is shown by
a density curve.
 The probability that X is between an interval of numbers is the area under
the density curve between the interval endpoints
 The probability that a continuous random variable X is exactly equal to a
number is zero
Means and Variances of Random Variables

The mean of a discrete random variable, X, is its weighted average. Each
value of X is weighted by its probability.
To find the mean of X, multiply each value of X by its probability, then add
all the products.
The mean of a random variable X is called the expected value of X.
5. Probability Mass Function

A discrete random variable is a random variable that can take on any value
from a discrete set of values. The set of possible values could be finite, such as in the
case of rolling a six-sided die, where the values lie in the set {1,2,3,4,5,6}. However,
the set of possible values could also be countably infinite, such as the set of
integers {0,1,−1,2,−2,3,−3,…}. The requirement for a discrete random variable is
that we can enumerate all the values in the set of its possible values, as we will
need to sum over all these possibilities.
For a discrete random variable X, we form its probability distribution
function by assigning a probability that X is equal to each of its possible values. For
example, for a six-sided die, we would assign a probability of 1/6 to each of the six
options. In the context of discrete random variables, we can refer to the probability
distribution function as a probability mass function. The probability mass
function P(x) for a random variable X is defined so that for any number x, the value
of P(x) is the probability that the random variable X equals the given number x, i.e.,

P(x)=Pr(X=x).
Often, we denote the random variable of the probability mass function with a
subscript, so may write
PX(x)=Pr(X=x).
For a function P(x) to be valid probability mass function, P(x) must be non-
negative for each possible value x. Moreover, the random variable must take on
some value in the set of possible values with probability one, so we require
that P(x) must sum to one. In equations, the requirements are
P(x)≥0 for all x∑xP(x)=1,
where the sum is implicitly over all possible values of X.
Example:
Experiment: Toss a fair coin 3 times
Sample Space: S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}
Random variable X is the number of tosses.
Thus X : S → R looks like this
X(HHH) = 3
X(HHT) = X(HTH) = X(THH)=2
X(HTT) = X(THT) = X(TTH)=1
X(TTT) = 0
Thus, Range(X) = {0,1,2,3} and
P(X = 0) =1/8, P(X = 1) =3/8 , P(X = 2) = 3/8, P(X = 3) =1/8
Hence the probability mass function is given by
P(0) =1/8
P(1) =3/8
P(2) =3/8
P(3) =1/8
6. What is a Probability Distribution?

A probability distribution is a table or an equation that links each outcome of
a statistical experiment with its probability of occurrence.

Probability Distributions
An example will make clear the relationship between random variables and
probability distributions. Suppose you flip a coin two times. This simple statistical
experiment can have four possible outcomes: HH, HT, TH, and TT. Now, let the
variable X represent the number of Heads that result from this experiment. The
variable
0, 1, or 2. In this example, X is a random variable; because its value is determined
by the outcome of a statistical experiment.
A probability distribution is a table or an equation that links each
outcome of a statistical experiment with its probability of occurrence. Consider the
coin flip experiment described above. The table below, which associates each
outcome with its probability, is an example of a probability distribution.
Number of
Probability
heads
0 0.25
1 0.50
2 0.25
The above table represents the probability distribution of the random variable X.
7. Binomial Probability Distribution

To understand binomial distributions and binomial probability, it helps to
understand binomial experiments and some associated notation; so we cover those
topics first.
Binomial Experiment
A binomial experiment is a statistical experiment that has the following
properties:
 The experiment consists of n repeated trials.
 Each trial can result in just two possible outcomes. We call one of these
outcomes a success and the other, a failure.
 The probability of success, denoted by P, is the same on every trial.
 The trials are independent; that is, the outcome on one trial does not affect
the outcome on other trials.

Consider the following statistical experiment. You flip a coin 2 times and
count the number of times the coin lands on heads. This is a binomial experiment
because:
 The experiment consists of repeated trials. We flip a coin 2 times.
 Each trial can result in just two possible outcomes - heads or tails.
 The probability of success is constant - 0.5 on every trial.
 The trials are independent; that is, getting heads on one trial does not affect
whether we get heads on other trials.
Notation
The following notation is helpful, when we talk about binomial probability.
 x: The number of successes that result from the binomial experiment.
 n: The number of trials in the binomial experiment.
 P: The probability of success on an individual trial.
 Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
 n!: The factorial of n (also known as n factorial).
 b(x; n, P): Binomial probability - the probability that an n-trial binomial
experiment results in exactly x successes, when the probability of success on
an individual trial is P.
 nCr: The number of combinations of n things, taken r at a time.
Binomial Distribution
A binomial random variable is the number of successes x in n repeated
trials of a binomial experiment. The probability distribution of a binomial random
variable is called a binomial distribution.
Suppose we flip a coin two times and count the number of heads (successes).
The binomial random variable is the number of heads, which can take on values of
0, 1, or 2. The binomial distribution is presented below.

Number of Probabili
heads ty
0 0.25
1 0.50
2 0.25
The binomial distribution has the following properties:

 The mean of the distribution (μx) is equal to n * P .
 The variance (σ2x) is n * P * ( 1 - P ).
 The standard deviation (σx) is sqrt[ n * P * ( 1 - P ) ].
Binomial Formula and Binomial Probability

The binomial probability refers
to the probability that a binomial
experiment results in exactly x
successes. For example, in the above
table, we see that the binomial
probability of getting exactly one head in
two coin flips is 0.50.
Given x, n, and P, we can compute
the binomial probability based on the
binomial formula:
Graph 1: Binomial Distribution Plotting
Binomial Formula. Suppose a binomial experiment consists of n trials and

results in x successes. If the probability of success on an individual trial is P,
then the binomial probability is:
b(x; n, P) = nCx * Px * (1 - P)n - x
or,
b(x; n, P) = { n! / [ x! (n - x)! ] } * Px * (1 - P)n – x
What is a Binomial Distribution? Real Life Examples

Many instances of binomial distributions can be found in real life. For example,

● If a new drug is introduced to cure a disease, it either cures the disease (it’s
successful) or it doesn’t cure the disease (it’s a failure).
● If you purchase a lottery ticket, you’re either going to win money, or you
aren’t.
Basically, anything you can think of that can only be a success or a failure can be
represented by a binomial distribution.
8. Poisson Distribution
A Poisson distribution is the probability distribution that results from a
Poisson experiment.
Attributes of a Poisson Experiment
A Poisson Experiment is a statistical experiment that has the following
properties:
▪ The experiment results in outcomes that can be classified as successes or
failures.
▪ The average number of successes (μ) that occurs in a specified region is
known.
▪ The probability that a success will occur is proportional to the size of the
region.
▪ The probability that a success will occur in an extremely small region is
virtually zero.
Note that the specified region could take many forms. For instance, it could be a
length, an area, a volume, a period of time, etc.
Notation
The following notation is helpful, when we talk about the Poisson distribution.
▪ e: A constant equal to approximately 2.71828. (Actually, e is the base of the
natural logarithm system.)
▪ μ: The mean number of successes that occur in a specified region.
▪ x: The actual number of successes that occur in a specified region.
▪ P(x; μ): The Poisson probability that exactly x successes occur in a Poisson
experiment, when the mean number of successes is μ.

Poisson Distribution
A Poisson random variable is the number of successes that result from a
Poisson experiment. The probability distribution of a Poisson random variable is
called a Poisson distribution.
Given the mean number of successes (μ) that occur in a specified region, we
can compute the Poisson probability based on the following formula:
Poisson Formula. Suppose we conduct a Poisson experiment, in which the

average number of successes within a given region is μ. Then, the Poisson
probability is:
P(x; μ) = (e-μ) (μx) / x!
where x is the actual number of successes that result from the experiment, and e
is approximately equal to 2.71828.
The Poisson distribution has the following properties:

 The mean of the distribution is equal to μ .
 The variance is also equal to μ .
Examples of Poisson Distribution

● Arrival of customers per hour at Super market
● The number of telephone calls received in a particular switch board received
per hour of the day.
● Number of deaths per day due to specific disease.
● Number of defective material a packing manufacturer may have.
● The number of printing mistake per page of the book.
Continuous Random Variable and Probability Density Function

A random variable X is said to be continuous if it can take all possible values
between two certain limits. In case of continuous Random variable we introduce a
function f(x) in the range from a to b such that it satisfies two conditions:
i. f(x)≥0
ii. ∫ f(x) dx=1 if a≤x≤b

Any function satisfying both the conditions (i) and (ii) may be accepted as density
function.
A continuous random variable is a random variable that can take on any
value from a continuum, such as the set of all real numbers or an interval. We
cannot form a sum over such a set of numbers. (There are too many, since such a
continuum is uncountable.) Instead, we replace the sum used for discrete random
variables with an integral over the set of possible values.
For a continuous random variable X, we cannot form its probability
distribution function by assigning a probability that X is exactly equal to each
value. The probability distribution function we must use in the case is called
a probability density function, which essentially assigns the probability that X is
near each value. In probability theory, a probability density function (PDF),
or density of a continuous random variable, is a function that describes the
relative likelihood for this random variable to take on a given value. The probability
of the random variable falling within a particular range of values is given by
the integral of this variable’s density over that range—that is, it is given by the
area under the density function but above the horizontal axis and between the
lowest and greatest values of the range. The probability density function is
nonnegative everywhere, and its integral over the entire space is equal to one.
9. Normal Distribution
The normal distribution is the most widely known and used of all
distributions. Because the normal distribution approximates many natural
phenomena so well, it has developed into a standard of reference for many
probability problems.
Graph 2: Normal Distribution Plotting

I. Characteristics of the Normal distribution

 Symmetric, bell shaped
 Continuous for all values of X between -∞ and ∞ so that each conceivable
interval of real numbers has a probability other than zero.
 -∞ ≤ X ≤ ∞
 Two parameters, µ and σ. Note that the normal distribution is actually a
family of distributions, since µ and σ determine the shape of the distribution.
 The rule for a normal density function is

 x   2
1
f ( x,  ,  )  e 2 2
2 2
 The notation N(µ, σ 2 ) means normally distributed with mean µ and variance
σ2 . If we say X ∼ N(µ, σ2 ) we mean that X is distributed N(µ, σ2 ).
 About 2/3 of all cases fall within one standard deviation of the mean, that is
P(µ - σ ≤ X ≤ µ + σ) = .6826
 . • About 95% of cases lie within 2 standard deviations of the mean, that is
P(µ - 2σ ≤ X ≤ µ + 2σ) = .9544
II. Why is the normal distribution useful?

 Many things actually are normally distributed, or very close to it. For
example, height and intelligence are approximately normally distributed;
measurement errors also often have a normal distribution
 The normal distribution is easy to work with mathematically. In many
practical cases, the methods developed using normal theory work quite well
even when the distribution is not normal.
 There is a very strong connection between the size of a sample N and the
extent to which a sampling distribution approaches the normal form. Many
sampling distributions based on large N can be approximated by the normal
distribution even though the population distribution itself is definitely not
normal.

The Standardized Normal Distribution
Graph 3: Standardized Normal Distribution Plotting

2
1  z2
f ( z)  e
2
General Procedure
As you might suspect from the formula for the normal density function, it
would be difficult and tedious to do the calculus every time we had a new set of
parameters for µ and σ. So instead, we usually work with the standardized normal
distribution, where µ = 0 and σ = 1, i.e. N(0,1). That is, rather than directly solve a
problem involving a normally distributed variable X with mean µ and standard
deviation σ, an indirect approach is used.
1. We first convert the problem into an equivalent one dealing with a normal
variable measured in standardized deviation units, called a standardized
normal variable. To do this, if X ∼ N(µ, σ2), then
2. A table of standardized normal values can then be used to obtain an answer
in terms of the converted problem.
3. If necessary, we can then convert back to the original units of measurement.
To do this, simply note that, if we take the formula for Z, multiply both sides
by σ, and then add µ to both sides, we get X = Zσ + µ
4. The interpretation of Z values is straightforward. Since σ = 1, if Z = 2, the
corresponding X value is exactly 2 standard deviations above the mean. If Z =
-1, the corresponding X value is one standard deviation below the mean. If Z
= 0, X = the mean, i.e. µ.

Normally Distributed
1. The values of mean, median and mode should be close to each other.
2. The values of skewness and kurtosis should be close to zero.
3. The standard deviation should be low
4. Median should lie exactly in between the upper and lower quartile
10. Q-Q Plot

The quantile-quantile or q-q plot is
an exploratory graphical device used to
check the validity of a distributional
assumption for a data set. In general, the
basic idea is to compute the theoretically
expected value for each data point based
on the distribution in question. If the data
indeed follow the assumed distribution,
then the points on the q-q plot will fall
approximately on a straight line. The q-q
plot provides a visual comparison of the
sample quantiles to the corresponding
theoretical quantiles. Graph 4: Q-Q Plot
11. Normal Probability Plot

The normal probability plot is a graphical technique for assessing whether or
not a data set is approximately normally distributed.
The data are plotted against a theoretical normal distribution in such a way
that the points should form an approximate straight line. Departures from this
straight line indicate departures from normality.

Sampling CHAPTER
Theory 3
1. Sampling Theory
In statistics, quality assurance, & survey methodology, sampling is
concerned with the selection of a subset of individuals from within a statistical
population to estimate characteristics of the whole population.
Each observation measures one or more properties (such as weight, location,
color) of observable bodies distinguished as independent objects or individuals.
ADVANTAGES OF THE SAMPLING TECHNIQUE

1. Much cheaper.
2. Saves time.
3. Much reliable.
4. Very suitable for carrying out different surveys.
5. Scientific in nature.
Population
A typical statistical investigation is interested studying the various
characteristics relating to items on individuals belonging to a group. the group of
individuals under study is known as the population on the universe. A population
containing a finite number of objects is called a finite population, population
containing infinite on a very large number of objects is called an infinite population.

Chapter 3: SAMPLING THEORY | 67
Sample
A finite representative subset of a population is called a sample. It is selected
from a population with the objectives of investing its population and principles.
There are two types of Sampling-Probabilistic and Nonprobabilistic
Probability Sampling
It is a procedure of drawing sample from a population. It enables us to draw a
conclusion about the characteristics of the population after studying only these
objects included in the sample. The theory that provided guidelines for choosing
sample from population is called the sampling theory. the theory aims at obtaining
the optimum results in respect of the characteristics of the population, within the
available sources at our disposal in terms of time, manpower and money. Secondly,
the theory of sampling aims at providing us with the idea of the best possible
estimator of the population through proper construction of the population.
PROBABILITY SAMPLING METHODS ARE OF THE

FOLLOWING TYPES
Simple Random Sampling

Simple random sampling (SRS) is a technique in which the sample is so
drawn that each and every unit in the population has an equal and independent
chance of being included in the sample. In this procedure of sampling, each sample
point included in the sample. Such a sample where the observations are drawn in a
purely random manner is known as random samples.

 Very useful in the cases involving a homogeneous population.

 Is further of two types – simple random sampling with
replacement(SRSWR) and the other one is the simple random sampling
without replacement(SRSWOR)
 If the units of a sample are drawn one by one from the population in such a
way that after every drawing the selected unit is returned to the population
then this is called as the simple random sampling with replacement
(SRSWR)
The ‘n’ units of the sample are drawn from the population in such
a way that at each drawing, each of the ‘n’ numbers of the population
gets the same probability 1/N of being selected. Hence this method is
called the simple random sampling with replacement. Clearly, the
same unit of population may occur more than one in a sample. Thus,
there are Nn samples, regard being to the order in which n sample
units occur and each such sample has the probability 1/Nn.
 And if the units of a sample are drawn one by one from the population in
such a way that after every drawing the selected unit is not returned to the
population then this is called as the simple random sampling without
replacement (SRSWOR)
The ‘n’ members of the sample are drawn one by one but the
members once drawn are not returned back to the population at each
stage remaining amount of population is given the same probability of
being included in the sample. This method is called SRSWOR.
Therefore, under SRSWOR at any rth number of draw there remains
(N-r+1) units and each unit has the probability of 1/(N-r+1) of being
drawn.
It is noted that if one takes all the N individuals al at a time
from the population giving equal probability to each of the observation
then the total number of possible sample is NCn i.e., combination of n
members out of N members of the population will form the total
number of possible sample in simple random sampling without
replacement.

Random sampling can be performed with the help of certain

specific methods and some of these methods are
A. Lottery Method
 A lottery is drawn by writing a number or the names of various units and
then putting them in a container.
 They are completely mixed and then certain numbers are picked up from the
container.
 Those picked are taken up for the sampling.
B. Stratified Random Sampling

This type of probability sampling method is one of the most commonly used
methods and involves the division of the whole population into a number of strata.
These strata are very much exclusive and also very exhaustive in the nature.
From each of these strata, a simple random sample is drawn, by this; the number of
the samples drawn from each of the sample becomes proportional to their respective
strata size. When the population is heterogeneous in nature, this type of sampling
plays a very critical role. The stratification in this method is performed in such an
elegant way that the variance between the strata is high and data within each
stratum is very small.
Figure 20: Segmentation of Stratified Random Sampling

C. Systematic Random Sampling

This method involves the formation of a sample in a very systematic manner,
involves the arrangement of the units in the population in a serial manner. A major
point to be kept in mind here is that the population should be finite in nature and
also should be defined very clearly.
After this, from the first K units, one unit is selected at random, this unit and
also every K th unit onwards from the serially listed populations forms a systematic
sample. This method is very simple and convenient in use as it saves a lot of time.
Figure 21: Systematic Random Sampling
D. Non Probabilistic Sampling

Most researchers are bounded by time, money and workforce and because of
these limitations, it is almost impossible to randomly sample the entire population
and it is often necessary to employ another sampling technique, the non-probability
sampling technique.
In contrast with probability sampling, non-probability sample is not a
product of a randomized selection processes. Subjects in a non-probability sample
are usually selected on the basis of their accessibility or by the purposive personal
judgment of the researcher.

TYPES OF NONPROBABILITY SAMPLING
Convenience Sampling
Convenience sampling is probably the most common of all sampling
techniques. With convenience
sampling, the samples are selected
because they are accessible to the
researcher. Subjects are chosen
simply because they are easy to
recruit. This technique is considered
easiest, cheapest and least time
consuming.
Figure 22: Convenience Sampling
Judgmental Sampling
Judgmental sampling is more
commonly known as purposive
sampling. In this type of sampling,
subjects are chosen to be part of the
sample with a specific purpose in
mind. With judgmental sampling, the
researcher believes that some subjects
are more fit for the research compared
to other individuals. This is the reason
why they are purposively chosen as
subjects. Figure 23: Judgemental Sampling
2. What is sampling bias?

Bias is a systematic error that can prejudice your evaluation findings in some
way. So, sampling bias is consistent error that arises due to the sample selection.
For example, a survey of high school students to measure teenage use of illegal
drugs will be a biased sample because it does not include home schooled students or
dropouts. A sample is also biased if certain members are underrepresented or
overrepresented relative to others in the population. For example, distributing a
questionnaire at the end of a 3-4 day conference is likely to include more people who
are committed to the conference so their views would be overrepresented.

Interviews with people who walk by a certain location is going to over-represent

healthy individuals or those who live near the location. Selecting a sample using a
telephone book will underrepresented people who cannot afford a telephone, do not
have a telephone, or do not list their telephone numbers.
Sampling bias can occur any time your sample is not a random sample. If it is
not random, some individuals are more likely than others to be chosen. Always
think very carefully about which individuals are being favoured and how they differ.
Population Parameter and Sample Statistic

A population parameter is an unknown numerical factor on future of the
population. The primary interest of any survey lies in proving the values of the
different measures of the population distribution of a variable of interest. the
measures of population distribution involves its mean, standard deviation etc.
which is calculated on the basis of the population values of the variable. in other
words, the parameter is a functional form of all the population units.
An estimator is a measure computed on the basis of sample values. It is a
functional form of all sample observation providing a representative value of the
collected sample.
A statistic is a characteristic of a sample. Inferential statistics enables you
to make an educated guess about a population parameter based on a statistic
computed from a sample randomly drawn from that population. Statistic is an
estimator with a distribution to it.
Properties of Estimators
To choose between estimating principles, we look into the properties satisfied
by them. These properties are classified into two groups- small sample properties
and large sample properties.
No hard and fast rule to distinguish between small and large samples →
working definition → a small sample has <=30 observations while the large sample
has more than 30 observations.

SMALL SAMPLE PROPERTIES

● Unbiasedness
An estimator hat is said to be an unbiased estimator of if its mean or
expected value is equal to the value of true population parameter , i.e, E( hat)= .
The unbiased property reflects on the accuracy of the estimator. In everyday terms,
this means that if repeated samples of a given size are drawn, and hat compared for
each sample, the average of such hat values would be equal to However, if
E( hat) , then hat is said to be biased and the extent of bias for is measured by
E( hat)- .
● Efficiency
hat is an efficient estimator if the following two conditions are satisfied together:
I. is unbiased and
hat
II. Var ( hat) <= Var ( *)

An efficient estimator is also called as a “minimum variance unbiased estimator”
(MVUE) or “best unbiased estimator”.
● Linearity
An estimator is said to have the property of linearity if it is possible to
express it as a linear combination of sample observations. Linearity is associated
with linear (i.e), additive) calculation rather than multiplicative or non-linear
calculation.
● Mean-Squared error (MSE)

Sometimes a difficult choice problem arises while comparing two estimators.
Suppose we have two different estimators of which one has lower bias, but higher
variance, compared with the other.
If (bias hat)>(bias *) but Var ( hat) < Var ( *) then we check the mean squared error
MSE( hat)= E[ hat- ]2 = Var ( hat) + (bias hat)2
We accept the estimator which has lower MSE.

LARGE SAMPLE PROPERTIES

These properties relate to the distribution of an estimator when the sample
size is large, approaches to infinity.
● Asymptotic Unbiasedness
C is an asymptotically unbiased estimator of ( hat) if:
Lim n→ E( hat)=
This means that the estimator ( hat), which is otherwise biased, becomes
unbiased as the sample size approaches to infinity.
If an estimator is unbiased, it is also asymptotically unbiased, but the reverse
is not necessarily true.
● Consistency
Whether or not an estimator is considered is understood by looking at the
behaviour of its bias and variance as the sample size approaches to infinity.
If the increase in sample size reduces bias (if there were one) and variance of
the estimate, and this continues until both bias and variance become zero an n →
, then the estimator is said to be consistent.
So, ( hat) is a consistent estimator if
E[ hat- ]=0
and Var ( hat) =0
NOTE: Unbiasedness of an estimator suggests that the expected

value of an estimator is equal to the value of the population parameter
for a given sample size. However this property does not speak about
the behaviour of the estimation over huge sample sizes. Apart from an
estimator being unbiased, it is also necessary that the estimator
behaves identically over huge sample sizes.
It has already been seen that sample mean and sample median are consistent
estimators for . It is to be remembered that A.M or sample mean is affected by the
existence of extreme values, however sample median will be free from any such
effects of outliers because sample median would give the middlemost value of the

distribution and the outliers would lie at the extreme ends. Despite of this fact, it is
seen that sample mean is more preferred as an estimator for population mean than
sample median. This is because sample mean as estimator is seen to contain all
information about the population parameter.
This property is know as sufficiency. An estimator based on sample
observations is a sufficient estimator for a parameter if it contains all information
on the sample regarding the population parameter. in this sense, sufficiency is the
most important property for the choice of estimator.
Table 3: Showing the comparison between sample Statistics and Population Parameters
Estimation in Statistics
In statistics, estimation refers to the process by which one makes inferences
about a population, based on information obtained from a sample.
Point Estimate vs. Interval Estimate

Statisticians use sample statistics to estimate population parameters. For example,
sample means are used to estimate population means; sample proportions, to
estimate population proportions.
An estimate of a population parameter may be expressed in two ways:
▪ Point Estimate
A point estimate of a population parameter is a single value of a statistic. For
example, the sample mean x is a point estimate of the population mean μ. Similarly,
the sample proportion p is a point estimate of the population proportion P.

▪ Interval Estimate
An interval estimate is defined by two numbers, between which a population
parameter is said to lie. For example, a < x < b is an interval estimate of the
population mean μ. It indicates that the population mean is greater than a but less
than b.
Figure 24: Showing the Point Estimate and Interval Estimate
Major Limitations of Sampling

The advantages of sampling can be ensured in the presence of the scientific
manner of selecting samples, appropriate sample design and sample size. In other
words the prospect of sampling is highly conditioned by the presence of those
factors. In the absence of these, the effectiveness of this technique is impaired
seriously.
● If a sample survey is not properly planned, the result obtained will not be
reliable and quite often it might be misleading.
● An efficient sampling procedure requires the services of qualified personnel,
sophisticated equipments and statistical technique, in the absence of these
the results of the survey might be misleading.
● Sampling procedures cannot be used if we want to obtain information about
each and every unit of the population.
Sampling Distribution
Say ‘n’ be a sample drawn from a finite population of size ‘N’. Then the total
number of sample is NCn = k. For each of these ‘k’ samples, we can compute some
statistic t(x1,x2,x3,.......,xn). The set of the values of the statistic so obtained , one for

each sample, constitutes the sampling distribution of the statistic. Eg: t1,t2,t3….tk
determines the sampling distribution of the statistic t. In other words , statistic t
may be regarded as a random variable which can take the values t1,t2,t3….tk.
Sampling distributions are mostly continuous in nature. The most common
types of sampling distribution are:
● Gamma distribution
● Exponential distribution
● Chi square distribution
● t-test
● F-test
3. Testing of Hypothesis
The entire process of statistical reference is mainly inductive in nature,i.e., it
is based on deciding the characteristics of the population on the basis of sample
study. Such decision always involves an element of risk. That is the risk of taking
wrong decisions. It is here that the modern theory of probability plays a vital role
and the statistical technique that helps us at arriving at the criterion for such
dimension is known as the testing of hypothesis.
Hypothesis is a statistical statement on a conjecture about the value of a
parameter. the basic hypothesis being tested is called the null hypothesis (H0). It wis
sometimes regarded as representing the current state of knowledge and belief about the
value being tested. In a test the null hypothesis is constructed with an alternative
hypothesis (H1).
● When a hypothesis is completely satisfied, then it is called a simple hypotheses.
There are two types of statistical hypothesis.
● When all factors of a distribution are not known, then the hypothesis is
known as a composite hypothesis.
Test of a statistical hypothesis

It is a two way action decision after observing a random sample from the
given population. The two contains being occupation or rejection of hypothesis
under consideration. Therefore, a test is a rule which decides the entire sample
space into 2 subsets.
● A region in which the data is consistent with (H0).
● A region in which the data is inconsistent with (H0).

The actual decision is based on the values of the suitable function of the data,
the test states,the set of all possible values of a test statistic which is consistent
with (H0) is the acceptance region and all those values of the test statistic which is
inconsistent with (H0) is called the critical region. One important condition which
must be kept in mind for efficient working of a test statistic is that the distribution
must be well specified.
The truth or fallacy of a statistical hypothesis is based on the information
contained in a sample. The rejection or the acceptance of the hypothesis is
contingent on the consistency or the inconsistency of the (H0) with the sample
observation. Therefore, it should be clearly borne in mind that acceptance of a
statistical hypothesis is due to the insufficient evidence provided by the sample to
reject it and it does not necessary mean that it is true.
Errors associated with the testing of hypothesis

A researcher arrives at a decision to accept or reject a null hypothesis (H0)
after inspecting a sample from the given population. As such, an element of risk of
erroneous decisions always involved four possible mutually disjoint and exhaustive
decisions are:
1. Reject (H0) when actually it is false
2. Accept (H0) when it is true
3. Reject (H0) when it is true
4. Accept (H0) when it is false
Decisions in 1 and 2 are correct. While the decisions in 3 and 4 are wrong
decisions. Two errors are most likely to occur in the test of a hypothesis, namely
type1 error and type 2 error.
For example, suppose we wanted to determine whether a coin was fair and
balanced. A null hypothesis might be that half the flips would result in Heads and
half, in Tails. The alternative hypothesis might be that the number of Heads and
Tails would be very different. Symbolically, these hypotheses would be expressed as
H0: p = 0.5
Ha: p <> 0.5
Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails.
Given this result, we would be inclined to reject the null hypothesis. That is, we
would conclude that the coin was probably not fair and balanced.

Type I ERROR and Type II ERROR

In statistical hypothesis testing,
Type I error is the incorrect rejection of a true null hypothesis (a "false

positive"), type I error occurs when the null hypothesis is rejected even when it it
true. The maximum size of the type I error is known as the level of significance.
This is defined as the probability of committing the type I error. We can write it as
P{rejecting (H0)/(H0) is true} = . Commonly used level of significance is 5% and 1%. If a
5% level of significance is used , then it implies that in 5 samples out of 100, we are
likely to reject a correct (H0). Level of significance is fixed in advance before collecting
the sample information.
Type II error is the failure to reject a false null hypothesis (a "false

negative"), it occurs when the null hypothesis is accepted even when it is false.
Probability of committing type II error . P{accepting (H0)/(H0)is false} = . The type II
error can be used to determine the power of a test P( ). P( )= 1-. The smaller is the
size of the type II error, greater is the power of the set.
More simply stated, a type I error is detecting an effect that is not present,
while a type II error is failing to detect an effect that is present.
Table of Error Types

Tabularised relations between truth/falseness of the null hypothesis and outcomes
of the test
Null hypothesis (H0) is
Valid/True Invalid/False
Correct
Type I Error
Reject Inference
False Positive
Judgement of True Positive
Null Hypothesis
(H0) Correct
Fail to Reject Type II Error
Inference
(accept) False Negative
True Negative
Table 4: Table of Error Types

Example:
Hypothesis: "The evidence produced before the court proves that this man is
guilty."
Null Hypothesis (H0): "This man is innocent."
A type I error occurs when convicting an innocent person. A type II error
occurs when letting a guilty person go free.
A positive correct outcome occurs when convicting a guilty person. A negative
correct outcome occurs when letting an innocent person go free.
Null hypothesis (H0) Null hypothesis (H0)

is valid: Innocent is invalid: Guilty
Reject H0 Type I error Correct outcome

I think he is False positive True positive
guilty! Convicted! Convicted!
Don't reject H0 Correct outcome Type II error

I think he is True negative False negative
innocent! Freed! Freed!
Table 5: Showing possible outcomes of a statistical hypothesis
4. Significance Level
A Type I error occurs when the researcher rejects a null hypothesis when it
is true. The probability of committing a Type I error is called the significance level,
and is often denoted by α.
The researchers are curious about the level of significance in their studies
and research, this means they attempted to find the chance that their statistical
test with go against their data and hypothesis - even if the hypothesis was actually
true.
The significance level α is the probability that the test statistic will fall in the
critical region when the null hypothesis is actually true.
Scientists usually turn these results around indicating that there is only a
5% chance that the results were due to statistical errors and chance. They will
refute the null hypothesis and support the alternative. Falsifiability ensures that
the hypothesis is never completely accepted, only that the null is rejected.

Popular levels of
significance are 10%
(0.1), 5% (0.05), 1%
(0.01), 0.5% (0.005),
and 0.1% (0.001). If a
test of significance
gives a p-value lower
than the significance
level α, the null
hypothesis is rejected. Table 6: Table value of tests at different level of significance
5. P-Value
The p-value is the exact value of committing type I error. It is used as an
alternative to rejection points to provide the smallest level of significance at which
the null hypothesis would be rejected.
The level of marginal significance within a statistical hypothesis test,
representing the probability of the occurrence of a given event. The p-value is used
as an alternative to rejection points to provide the smallest level of significance at
which the null hypothesis would be rejected. The smaller the p-value, the stronger
the evidence is in favor of the alternative hypothesis.
P-values are calculated using p-value tables, or spreadsheet /
statistical software.
Figure 25: Showing The P-Value

For Example
This concept may sound confusing and impractical, but consider a simple
example - suppose you work for a company that produces running shoes:
You need to plan production for the number of pairs of shoes your company
should make in each size for men and for women. You don't want to base your
production plans on the anecdotal evidence that men usually have bigger feet than
women, you need hard data to base your plans on. Therefore, you should look at a
statistical study that shows the correlation between gender and foot size.
If the report's p-value was only 2%, this would be a statistically significant
result. You could reasonably use the study's data to prepare your company's
production plans, because the 2% p-value indicates there is only a 2% chance that
the connection between foot size and gender was the result of chance/error. On the
other hand, if the p-value was 20%, it would not be reasonable to use the study as a
basis for your production plans, since there would be a 20% chance that the
relationship presented in the study could be due to random chance alone.
Statistical Significance
To determine if an observed outcome is statistically significant, we compare
the values of alpha and the p -value. There are two possibilities that emerge:
● The p-value is less than or equal to alpha (p<= ). In this case we reject the
null hypothesis. When this happens we say that the result is statistically
significant. In other words, we are reasonably sure that there is something
besides chance alone that gave us an observed sample.
● The p-value is greater than alpha (p> ). In this case we fail to reject the null
hypothesis. When this happens we say that the result is not statistically
significant. In other words, we are reasonably sure that our observed data
can be explained by chance alone.
The implication of the above is that the smaller the value of is, the more
difficult it is to claim that a result is statistically significant. On the other hand, the
larger the value of alpha is the easier is it to claim that a result is statistically
significant. Coupled with this, however, is the higher probability that what we
observed can be attributed to chance.

Steps of Testing a Hypothesis

● If we want to test the significance of the difference between a statistic and a
parameter or between two sample statistic, then we set up the null
hypothesis(H0) that the difference is not significant. This means that the
difference is just due to sampling fluctuations.
Eg: if we want to test if a particular drug is effective then we shall set up the
hypothesis that the drug is not effective. For testing if there is any difference
between the average sales of two pizza companies, we set up the hypothesis
that there is no significant difference between their sales.
● If we want to test any statement about the population, we want to set up the
null hypothesis that it is true.
Eg: if we want to find out if the population mean has a specified value, then
(H0)= = 0.
● Set up an alternative hypothesis (H1). Any hypothesis which is
complementary to the null hypothesis is called the alternative hypothesis.
This enables us to decide whether we have a one tailed or two tailed test.
● Choose the appropriate level of significance depending on the reliability of
the estimates and permissible risks. This is to be decided before the sample is
drawn i.e, is to be fixed in advance.
● Compute the appropriate test statistic. Some of the commonly used
distributions in obtaining the test statistic are normal, chi-square, t and F
test.
6. Confidence Interval
Statisticians use a confidence interval to express the degree of uncertainty
associated with a sample statistic. A confidence interval is an interval
estimate combined with a probability statement.
For example, suppose a statistician conducted a survey and computed an
interval estimate, based on survey data. The statistician might use a confidence
level to describe uncertainty associated with the interval estimate. He/she might
describe the interval estimate as a "95% confidence interval". This means that if we
used the same sampling method to select different samples and computed an
interval estimate for each sample, we would expect the true population parameter
to fall within the interval estimates 95% of the time. In the language of hypothesis
testing, the 100(1 − α)% confidence interval established in is known as the region of
acceptance (of the null hypothesis) and the region(s) outside the confidence interval
is (are) called the region(s) of rejection (of H0) or the critical region(s). As noted

previously, the confidence limits, the endpoints of the confidence interval, are also
called critical values.
A term used in inferential statistics that measures the probability that a
population parameter will fall between two set values. The confidence interval can
take any number of probabilities, with the most common being 95% or 99%.
In other words, a confidence interval is the probability that a value will fall
between an upper and lower bound of a probability distribution. For example, given
a 99% confidence interval, stock XYZ's return will fall between -6.7% and +8.3%
over the next year. In layman's terms, we are 99% confident that the returns of
holding XYZ stock over the next year will fall between -6.7% and +8.3%.
Figure 26: Showing Acceptance Region and Critical Region

Parametric CHAPTER
Tests 4
1. Parametric Tests
In the literal meaning of the terms, a parametric statistical test is one that
makes assumptions about the parameters (defining properties) of the population
distribution(s) from which one's data are drawn, while a non-parametric test is
one that makes no such assumptions.
Parametric Assumptions
● Interval or ratio scale of measurement (approximately interval)
● Random sampling from a defined population
● Samples are independent/dependent (varies by statistic)
● Characteristic is normally distributed in the population
● Population variances are equal (if two or more groups/variables in
the design)
2. Z Test
A Z-test is any statistical test for which the distribution of the test statistic
under the null hypothesis can be approximated by a normal distribution.
Suppose that in a particular geographic region, the mean and standard
deviation of scores on a reading test are 100 points, and 12 points, respectively. Our
interest is in the scores of 55 students in a particular school who received a mean
score of 96. We can ask whether this mean score is significantly lower than the
regional mean — that is, are the students in this school comparable to a simple
random sample of 55 students from the region as a whole, or are their scores
surprisingly low?

Chapter 4: PARAMETRIC TESTS | 86
Assumptions
 The parent population from which the sample is drawn should be normal
 The sample observations are independent, i.e., the given sample is random
 The population standard deviation σ is known.
Example: Blood glucose levels for obese patients have a mean of 100 with a
standard deviation of 15. A researcher thinks that a diet high in raw cornstarch
will have a positive or negative effect on blood glucose levels. A sample of 30
patients who have tried the raw cornstarch diet have a mean glucose level of 140.
Test the hypothesis that the raw cornstarch had an effect.
Step 1: State the null hypothesis: H0:μ=100

Step 2: State the alternate hypothesis: H1:≠100
Step 3: State your alpha level. We’ll use 0.05 for this example. As this is a two-
tailed test, split the alpha into two.
0.05/2=0.025
Step 4: Find the z-score associated with your alpha level. You’re looking for the
area in one tail only. A z-score for 0.75(1-0.025=0.975) is 1.96. As this is
a two-tailed test, you would also be considering the left tail (z=1.96)
x  0
Z
/ n
Step 5: Find the test statistic using this formula:
z=(140-100)/(15/√30)=14.60.
Step 6: If Step 5 is less than -1.96 or greater than 1.96 (Step 3), reject the null
hypothesis. In this case, it is greater, so you can reject the null.
3. T Test
A t-test is any statistical hypothesis test in which the test statistic follows a
Student's t distribution if the null hypothesis is supported. Among the most
frequently used t-tests are:
● A one-sample location test of whether the mean of a normally distributed
population has a value specified in a null hypothesis.

● A two sample location test of the null hypothesis that the means of two
normally distributed populations are equal.
● A test of the null hypothesis that the difference between two responses
measured on the same statistical unit has a mean value of zero.
● A test of whether the slope of a regression line differs significantly from zero.
PROCEDURE
Set up the hypothesis:
A. Null Hypothesis: assumes that there are no significant differences between
the population mean and the sample mean.
B. Alternative Hypothesis: assumes that there is a significant difference
between the population mean and the sample mean.
i. Calculate the standard deviation for the sample by using

this formula:
S
(X  X ) 2
n 1
Where,
S = Standard deviation
X = Sample mean
n = number of observations in sample
ii. Calculate the value of the one sample t-test, by using this
formula:
X 
t
S
N
Where,
t = one sample t-test value
= population mean

iii. Calculate the degree of freedom by using this formula:

V=n-1
Where,
V = degree of freedom
4. Hypothesis Testing
In hypothesis testing, statistical decisions are made to decide whether or not
the population mean and the sample mean are different. In hypothesis testing, we
will compare the calculated value with the table value. If the calculated value is
greater than the table value, then we will reject the null hypothesis, and accept the
alternative hypothesis.
Assumptions:
1. Dependent variables should be normally distributed.
2. Samples drawn from the population should be random.
3. Cases of the samples should be independent.
4. We should know the population mean.
Conditions for one sample t test:

Most t statistics have the form t= Z∕s.
 Z follows a standard normal distribution under the null hypothesis or the
parent population from which the sample is drawn should be normal
 The sample observations are independent, i.e., the given sample is random
 The population standard deviation σ is unknown
Two Independent Samples T Test

The independent two-sample t-test is used to test whether
population means are significantly different from each other, using the
means from randomly drawn samples.
Any statistical test that uses two samples drawn independently of each other
and using t-distribution, can be called a 'two-sample t-test'.

Assumptions
Along with the independent single sample t-test, this test is one of the most
widely tests. However, this test can be used only if the background assumptions are
satisfied.
● The populations from which the samples have been drawn should be normal-
appropriate statistical methods exist for testing this assumption .One needs
to note that the normality assumption has to be tested individually and
separately for the two samples. It has however been shown that minor
departures from normality do not affect this test - this is indeed an
advantage.
● The standard deviation of the populations should be equal i.e. σX2 = σY2 =
σ2, where σ2 is unknown. This assumption can be tested by the F-test.
● An F-test (Snedecor and Cochran, 1983) is used to test if the variances of

two populations are equal. This test can be a two-tailed test or a one-tailed
test. The two-tailed version tests against the alternative that the variances
are not equal. The one-tailed version only tests in one direction, that is the
variance from the first population is either greater than or less than (but not
both) the second population variance. The choice is determined by the
problem. For example, if we are testing a new process, we may only be
interested in knowing if the new process is less variable than the old process.
● Samples have to be randomly drawn independent of each other. There is

however no requirement that the two samples should be of equal size - often
times they would be unequal though the odd case of equal size cannot be
ruled out.

Conceptual Examples
Question Strategy
Does the presence Begin with a "subject pool" of seeds of the type of plant
of a certain kind in question. Randomly sort them into two groups, A
of mycorrhizal and B. Plant and grow them under conditions that are
fungi enhance the identical in every respect except one: namely, that the
growth of a seeds of group A (the experimental group) are grown
certain kind of in a soil that contains the fungus, while those of group
plant? B (the control group) are grown in a soil that does not
contain the fungus. After some specified period of time,
harvest the plants of both groups and take the relevant
measure of their respective degrees of growth. If the
presence of the fungus does enhance growth, the
average measure should prove greater for group A than
for group B.
Do two types of Begin with a subject pool of college students, relatively

music, type-I and homogeneous with respect to age, record of academic
type-II, have achievement, and other variables potentially relevant
different effects to the performance of such a task. Randomly sort the
upon the ability of subjects into two groups, A and B. Have the members of
college students to each group perform the series of mental tasks under
perform a series of conditions that are identical in every respect except
mental tasks one: namely, that group A has music of type-I playing
requiring in the background, while group B has music of type-
concentration? II.(Note that the distinction between experimental and
control group does not apply in this example.) Conclude
by measuring how well the subjects perform on the
series of tasks under their respective conditions. Any
difference between the effects of the two types of music
should show up as a difference between the mean levels
of performance for group A and group B.
Do two strains of With this type of situation you are in effect starting out
mice, A and B, with two subject pools, one for strain A and one for
differ with respect strain B. Draw a random sample of size Na from pool A
to their ability to and another of size Nb from pool B. Run the members of
learn to avoid an each group through a standard aversive-conditioning
aversive stimulus? procedure, measuring for each one how well and
quickly the avoidance behavior is acquired. Any
difference between the avoidance-learning abilities of
the two strains should manifest itself as a difference
between their respective group means.

Hypothesis Testing
Choose the Appropriate Hypotheses.

There are two assumptions for the following test of comparing two
independent means:
1. The two samples are independent and
2. Each sample is randomly sampled from a population that is approximately
normally distributed.
Below are the possible null and alternative hypothesis pairs:
Research Are the means of Is the mean of Is the mean of

Question group 1 and group group 1 greater group 1 less
2 different? than the mean than the mean
of group 2? of group 2?
Null μ1−μ2=0 μ1−μ2=0 μ1−μ2=0

Hypothesis, H0
Alternative μ1−μ2≠0 μ1−μ2>0 μ1−μ2<0

Hypothesis, Ha
Type of Two-tailed, non- Right-tailed, Left-tailed,

Hypothesis Test directional directional directional
Calculate an Appropriate Test Statistic

This will be a t test statistic. The calculations for these test statistics can get
quite involved. Below you are presented with the formulas that are used, however,
in real life these calculations are performed using statistical software (e.g., Minitab
Express).
Recall that test statistics are typically a fraction with the numerator being
the difference observed in the sample and the denominator being the standard
error.
The standard error of the difference between two means is different
depending on whether or not the standard deviations of the two groups are similar.

Pooled Standard Error Method

(Similar Standard Deviations)
If the two standard deviations are similar (neither is more than twice of the
other), then the pooled standard error is used:

Unpooled Standard Error Method

(Differing Standard Deviations)
If the two standard deviations are not similar (one is more than twice of the
other), then the unpooled standard error is used:
The degrees of freedom are found using a complicated approximation formula:
NOTE: If one is performing hand calculations using the UNPOOLED method, the
choice of degrees of freedom can be made by choosing the smaller of n1−1 and n2−1.

Determine a P-Value associated with the test statistic.

The t test statistic found in Step 2 is used to determine the p value.
Decide between the null and alternative hypotheses.

If p ≤ α reject the null hypothesis. If p > α fail to reject the null hypothesis.
Let’s solve a sum!
Problem: Do males and females differ in their test scores for exam 2? The mean
test score for females is 27.1 (s=2.57, n=19), and the mean test score for males is
26.7 (s=3.63, n=20)
Step 1: State the hypotheses
H0: µ1 - µ2 = 0 (µ1 = µ2)
H1: µ1 - µ2 ≠ 0 (µ1 ≠ µ2)
This is a two-tailed test (no direction is predicted)
Step 2: Set the criterion
•α=?
• df = n1 + n2 – 2 = ?
• Critical value for the t-test?
Step 3: Collect sample data, calculate x and s
We know the mean test score for females is 27.1 (s=2.57, n=19), and the mean
test score for males is 26.7 (s=3.63, n=20)
Step 4: Compute the t-statistic
Calculate the estimated standard error of the difference:

5. Paired Sample T-Test

Paired sample t-test is a statistical technique that is used to compare two
population means in the case of two samples that are correlated. Paired sample t-
test is used in ‘before-after’ studies, or when the samples are the matched pairs, or
when it is a case-control study. For example, if we give training to a company
employee and we want to know whether or not the training had any impact on the
efficiency of the employee, we could use the paired sample test. We collect data
from the employee on a seven scale rating, before the training and after the
training. By using the paired sample t-test, we can statistically conclude whether
or not training has improved the efficiency of the employee. In medicine, by using
the paired sample t-test, we can figure out whether or not a particular medicine will
cure the illness.
Steps:
1. Set up hypothesis: We set up two hypotheses. The first is the null
hypothesis, which assumes that the mean of two paired samples are equal. The
second hypothesis will be an alternative hypothesis, which assumes that the means
of two paired samples are not equal.
2. Select the level of significance: After making the hypothesis, we choose
the level of significance. In most of the cases, significance level is 5%, (in medicine,
the significance level is set at 1%).
3. Calculate the parameter: To calculate the parameter we will use the
following formula:

d
t
s2 / n
Where d bar is the mean difference between two samples, s² is the sample
variance, n is the sample size and t is a paired sample t-test with n-1 degrees of
freedom.
Where d bar is the mean difference between two samples, s² is the sample
variance, n is the sample size and t is a paired sample t-test with n-1 degrees of
freedom. An alternate formula for paired sample t-test is:
d
t
n(  d 2 )  (  d ) 2
n 1
Testing of hypothesis or decision making

After calculating the parameter, we will compare the calculated value with
the table value. If the calculated value is greater than the table value, then we will
reject the null hypothesis for the paired sample t-test. If the calculated value is less
than the table value, then we will accept the null hypothesis and say that there is
no significant mean difference between the two paired samples.
Assumptions
1. Only the matched pairs can be used to perform the test.
2. Normal distributions are assumed.
3. The variance of two samples is equal.
6. Cases must be independent of each other.
Example:
Trace metals in drinking water affect the flavour and an unusually high
concentration can pose a health hazard. Ten pairs of data were taken measuring
zinc concentration in bottom water and surface water.
Does the data suggest that the true average concentration in the bottom water
exceeds that of surface water?

Zinc concentration in Zinc concentration in

bottom water surface water
0.43 0.415
0.266 0.238
0.567 0.39
0.531 0.41
0.707 0.605
0.716 0.609
0.651 0.632
0.589 0.523
0.469 0.411
0.723 0.612
Thus, we conclude that the difference may come from a normal distribution.
Step 1. Set up the hypotheses:

H 0 : d  0
H a : d  0
Where ‘d’ is defined as the difference of bottom-surface.
Step 2. Write down the significance level   0.05 .

Step 3. What is the critical value and the rejection region?

  0.05, df  9
t0.05  1.833
Rejection region: t  1.833
Step 4. Compute the value of the test statistic:
d 0.0804
t *  s  0.0523  4.86
d
n 10
Step 5. Check whether the test statistic falls in the rejection and determine
whether to reject H 0 .
t *  4.86  1.833
reject H 0
Step 6. State the conclusion in words.

At   0.05, we conclude that, on average, the bottom zinc concentration is
higher than the surface zinc concentration.

Association CHAPTER
Between 5
Variables
1. Association between Variables
Very frequently social scientists want to determine the strength of the
association of two or more variables. For example, one might want to know if
greater population size is associated with higher crime rates or whether there are
any differences between numbers employed by sex and race. For categorical data
such as sex, race, occupation, and place of birth, tables, called contingency tables,
that show the counts of persons who simultaneously fall within the various
categories of two or more variables are created. The Bureau of the Census reports
many tables in this form such as sex by age by race or sex by occupation by region.
For continuous data such as population, age, income, and housing the strength of
the association can be measured through correlation statistics.
Chi-Square Test for Independence:

This lesson explains how to conduct a chi-square test for independence.
The test is applied when you have two categorical variables from a single
population. It is used to determine whether there is a significant association
between the two variables.
For example, in an election survey, voters might be classified by gender (male
or female) and voting preference (Democrat, Republican, or Independent). We could
use a chi-square test for independence to determine whether gender is related to
voting preference.

Chapter 5: ASSOCIATION BETWEEN VARIABLES | 100
When to Use Chi-Square Test for Independence

The test procedure described in this lesson is appropriate when the following
conditions are met:
▪ The sampling method is simple random sampling.
▪ The variables under study are each categorical.
▪ If sample data are displayed in a contingency table, the expected frequency
count for each cell of the table is at least 5.
This approach consists of four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results.
State the Hypotheses:

Suppose that Variable A has r levels, and Variable B has c levels. The null
hypothesis states that knowing the level of Variable A does not help you predict the
level of Variable B. That is, the variables are independent.
H0: Variable A and Variable B are independent.

Ha: Variable A and Variable B are not independent.
The alternative hypothesis is that knowing the level of Variable A can help
you predict the level of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are
related; but the relationship is not necessarily causal, in the sense that one
variable "causes" the other.
Formulate an Analysis Plan

The analysis plan describes how to use sample data to accept or reject the
null hypothesis. The plan should specify the following elements.
▪ Significance level. Often, researchers choose significance levels equal to 0.01,
0.05, or 0.10; but any value between 0 and 1 can be used.
▪ Test method. Use the chi-square test for independence to determine whether
there is a significant relationship between two categorical variables.

Analyze Sample Data

Using sample data, find the degrees of freedom, expected frequencies, test
statistic, and the P-value associated with the test statistic. The approach described
in this section is illustrated in the sample problem at the end of this lesson.
▪ Degrees of freedom. The degrees of freedom (DF) is equal to:
DF = (r - 1) * (c - 1)
where r is the number of levels for one categorical variable, and c is the
number of levels for the other categorical variable.
▪ Expected frequencies. The expected frequency counts are computed
separately for each level of one categorical variable at each level of the other
categorical variable. Compute r * c expected frequencies, according to the
following formula.
Er,c = (nr * nc) / n
where Er,c is the expected frequency count for level r of Variable A and
level c of Variable B, nris the total number of sample observations at level r of
Variable A, nc is the total number of sample observations at level c of
Variable B, and n is the total sample size.
Test statistic. The test statistic is a chi-square random variable (  )

2
▪
defined by the following equation.
 2 = Σ [ (Or,c - Er,c)2 / Er,c ]

where Or,c is the observed frequency count at level r of Variable A and
level c of Variable B, and Er,c is the expected frequency count at level r of
Variable A and level c of Variable B.
▪ P-value. The P-value is the probability of observing a sample statistic as
extreme as the test statistic. Since the test statistic is a chi-square, use
the Chi-Square Distribution Calculator to assess the probability associated
with the test statistic. Use the degrees of freedom computed above.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher
rejects the null hypothesis. Typically, this involves comparing the P-value to
the significance level, and rejecting the null hypothesis when the P-value is less
than the significance level.

TEST YOUR UNDERSTANDING
Problem
A public opinion poll surveyed a simple random sample of 1000 voters.
Respondents were classified by gender (male or female) and by voting preference
(Republican, Democrat, or Independent). Results are shown in the contingency
table below.
Voting Preferences
Row
Republic Democr Independe total
an at nt
Male 200 150 50 400
Female 250 300 50 600
Column
450 450 100 1000
total
Is there a gender gap? Do the men's voting preferences differ significantly

from the women's preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We
work through those steps below:
▪ State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
H0: Gender and voting preferences are independent.
Ha: Gender and voting preferences are not independent.
▪ Formulate an analysis plan. For this analysis, the significance level is
0.05. Using sample data, we will conduct a chi-square test for independence.
▪ Analyze sample data. Applying the chi-square test for independence to
sample data, we compute the degrees of freedom, the expected frequency
counts, and the chi-square test statistic. Based on the chi-square statistic and
the degrees of freedom, we determine the P-value.

DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c = (nr * nc) / n

E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180
E1,3 = (400 * 100) / 1000 = 40000/1000 = 40
E2,1 = (600 * 450) / 1000 = 270000/1000 = 270
E2,2 = (600 * 450) / 1000 = 270000/1000 = 270
E2,3 = (600 * 100) / 1000 = 60000/1000 = 60
 2 = Σ [ (Or,c - Er,c)2 / Er,c ]

 2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40
+ (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/60
 2 = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60
 2 = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
where DF is the degrees of freedom, r is the number of levels of gender, c is
the number of levels of the voting preference, nr is the number of observations
from level r of gender, nc is the number of observations from level c of voting
preference, n is the number of observations in the sample, Er,c is the expected
frequency count when gender is level r and voting preference is level c, and
Or,c is the observed frequency count when gender is level r voting preference
is level c.
The P-value is the probability that a chi-square statistic having 2 degrees of
freedom is more extreme than 16.2.
We use the Chi-Square Distribution Calculator to find P (  2 > 16.2) = 0.0003.
▪ Interpret results. Since the P-value (0.0003) is less than the significance
level (0.05), we cannot accept the null hypothesis. Thus, we conclude that
there is a relationship between gender and voting preference.
2. Scatterplots
A scatterplot is used to graphically represent the relationship between two
variables. Explore the relationship between scatterplots and correlations, the
different types of correlations, how to interpret scatterplots, and more.

SCATTERPLOT
Imagine that you are interested in studying patterns in individuals with
children under the age of 10. You collect data from 25 individuals who have at least
one child. After you've collected your data, you enter it into a table.
You try to draw conclusions about the data from the table; however, you find
yourself overwhelmed. You decide an easier way to analyze the data is by
comparing the variables two at a time. In order to see how the variables relate to
each other, you create scatterplots.
So what is a scatterplot? A scatterplot is a graph that is used to plot the
data points for two variables. Each scatterplot has a horizontal axis (x-axis) and a
vertical axis (y-axis). One variable is plotted on each axis. Scatterplots are made up
of marks; each mark represents one study participant's measures on the variables
that are on the x-axis and y-axis of the scatterplot.

Most scatterplots contain a line of best fit, which is a straight line drawn
through the center of the data points that best represents the trend of the data.
Scatterplots provide a visual representation of the correlation, or relationship
between the two variables.
One of the most commonly used formulas in stats is Pearson’s correlation
coefficient formula. In fact, if you’re taking a basic stats class, this is the one you’ll
probably use:
n
   x  x  y  y  
i i
r i 1
n n
  xi  x   y  y 
2 2
i
i 1 i 1
3. Types of Correlation
All correlations have two properties: strength and direction. The strength of
a correlation is determined by its numerical value. The direction of the correlation
is determined by whether the correlation is positive or negative.
● Positive correlation: Both variables move in the same direction. In other
words, as one variable increases, the other variable also increases. As one
variable decreases, the other variable also decreases.
o i.e., years of education and yearly salary are positively correlated.
● Negative correlation: The variables move in opposite directions. As one
variable increases, the other variable decreases. As one variable decreases,
the other variable increases.
o i.e., hours spent sleeping and hours spent awake are negatively
correlated.
Figure 27: Showing Types of Correlation

All positive correlations have scatterplots that move in the same

direction as the positive correlation in the image above. All negative
correlations have scatterplots that move in the same direction as the
negative correlation in the image above.
No Correlations
What does it mean to say
that two variables have no
correlation? It means that there is
no apparent relationship between
the two variables.
For example, there is no
correlation between shoe size and
salary. This means that high scores
on shoe size are just as likely to
occur with high scores on salary as
they are with low scores on salary.
If your line of best fit is horizontal or vertical like the scatterplots on the top row, or
if you are unable to draw a line of best fit because there is no pattern in the data
points, then there is little or no correlation.
Strength
The strength of a correlation indicates how strong the relationship is between
the two variables. The strength is determined by the numerical value of the
correlation. A correlation of 1, whether it is +1 or -1, is a perfect correlation. In
perfect correlations, the data points lie directly on the line of fit. The further the
data are from the line of fit, the weaker the correlation. A correlation of 0 indicates
that there is no correlation. The following should be considered when determining
the strength of a correlation:
The closer a positive correlation lies to +1, the stronger it is.
i.e., a correlation of +.87 is stronger than a correlation of +.42.
The closer a negative correlation is to -1, the stronger it is.
i.e., a correlation of -.84 is stronger than a correlation of -.31.

When comparing a positive correlation to a negative correlation, only look at the

numerical value. Do not consider whether or not the correlation is positive or
negative. The correlation with the highest numerical value is the strongest.
i.e., a correlation of -.80 is stronger than a correlation of +.55.
If the numerical values of a correlation are the same, then they have the same
strength no matter if the correlation is positive or negative.
i.e., a correlation of -.80 has the same strength as a correlation of +.80.
Interpretations of Scatterplots
So what can we learn from scatterplots? Let's create scatterplots using some
of the variables in our table. Let's first compare age to Internet use. Now let's put
this on a scatterplot. Age is plotted on the y-axis of the scatterplot and Internet
usage is plotted on the x-axis.
We see that there
is a negative correlation
between age and Internet
usage. That means that
as age increases, the
amount of time spent on
the Internet declines,
and vice versa. The
direction of the
scatterplot is a negative
correlation!In the upper
right corner of the scatter
plot, we see r = -.87.
Since r signifies the
correlation, this means
that our correlation is -.87. Figure 28: Scatterplot between Age and Internet usage
Partial Correlation
A correlation between two variables in which the effects of other variables are
held constant is known as partial correlation.
The partial correlation for 1 and 2 with controlling variable 3 is given by:
r12.3 = (r12 – r13 r23) / [√ (1 – r132) √ (1 – r232)].

Partial correlation is the relationship between two variables while controlling

for a third variable. The purpose is to find the unique variance between two
variables while eliminating the variance from a third variables.
Simple correlation does not prove to be an all-encompassing technique
especially under the above circumstances. In order to get a correct picture of the
relationship between two variables, we should first eliminate the influence of other
variables.
Examples:
1. Study of partial correlation between price and demand would involve studying
the relationship between price and demand excluding the effect of money supply,
exports, etc.
2. We might find the ordinary correlation between blood pressure and blood
cholesterol might be a high, strong positive correlation. We could potentially find
a very small partial correlation between these two variables, after we have taken
into account the age of the subject. If this were the case, this might suggest that
both variables are related to age, and the observed correlation is only due to their
common relationship to age.

Concept of CHAPTER
ANOVA 6
(Analysis of Variance)
1. Introduction
Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test
the equality of two or more population (or treatment) means by examining the
variances of samples that are taken. ANOVA allows one to determine whether the
differences between the samples are simply due to random error (sampling errors)
or whether there are systematic treatment effects that causes the mean in one
group to differ from the mean in another. Briefly, ANOVA is used for testing
overall significance of regression.
2. One-Way ANOVA
A One-Way ANOVA (Analysis of Variance) is a statistical technique
by which we can test if three or more means are equal. It tests if the value
of a single variable differs significantly among three or more levels of a
factor.
We can say we have a framework for one-way ANOVA when we have a single
factor with three or more levels and multiple observations at each level.
In this kind of layout, we can calculate the mean of the observations within
each level of our factor.
The concepts of factor, levels and multiple observations at each level can be
best understood by an example.

Chapter 6: CONCEPT OF ANOVA | 110
Assumptions
For the validity of the results, some assumptions have been checked to hold
before the technique is applied. These are:
Assumptions of ANOVA:
(i) All populations involved follow a normal distribution.
(ii) All populations have the same variance (or standard deviation).
(iii) The samples are randomly selected and independent of one another.
Advantages
One of the principle advantages of this technique is that the number of
observations need not be the same in each group.
Additionally, layout of the design and statistical analysis is simple.
Factor and Levels - An Example

Let us suppose that the Human Resources Department of a company desires
to know if occupational stress varies according to age.
The variable of interest is therefore occupational stress as measured by a
scale. The factor being studied is age. There is just one factor (age) and hence a
situation appropriate for one-way ANOVA.
Further suppose that the employees have been classified into three groups
(levels):
● less than 40
● 40 to 55
● above 55
These three groups are the levels of factor age - there are three levels here.
With this design, we shall have multiple observations in the form of scores on
Occupational Stress from a number of employees belonging to the three levels of
factor age. We are interested to know whether all the levels i.e. age groups have
equal stress on the average.
Non-significance of the test statistic (F-statistic) associated with this
technique would imply that age has no effect on stress experienced by employees in

their respective occupations. On the other hand, significance would imply that
stress affects different age groups differently.
Hypothesis Testing
Formally, the null hypothesis to be tested is of the form:
H0: All the age groups have equal stress on the average or μ1 = μ2 = μ3 ,
where μ1, μ2, μ3 are mean stress scores for the three age groups.
The alternative hypothesis is:
H1: The mean stress of at least one age group is significantly different.
NOTE: One-way Anova and T-Test

The one-way ANOVA is an extension of the independent two-sample t-test.
In the above example, if we considered only two age groups, say below 40 and
above 40, then the independent samples t-test would have been enough although
application of ANOVA would have also produced the same result.
In the example considered above, there were three age groups and hence it was
necessary to use one-way ANOVA.
Often the interest is on acceptance or rejection of the null hypothesis. If it is
rejected, this technique will not identify the level which is significantly different.
One has to perform t-tests for this purpose.
This implies that if there
exists difference between the
means, we would have to carry
out 3C2 independent t-tests in
order to locate the level which is
significantly different. It would
be kC2 number of t-tests in the
general one-way ANOVA design Table 7: One Way ANOVA Table
with k levels.

3. Two-Way ANOVA
A Two-Way ANOVA is useful when we desire to compare the effect of
multiple levels of two factors and we have multiple observations at each
level.
One-Way ANOVA compares three or more levels of one factor. But some
experiments involve two factors each with multiple levels in which case it is
appropriate to use Two-Way ANOVA.
Assumptions
The assumptions in both versions remain the same:
 normality
 independence and
 equality of variance.
Advantages
● An important advantage of this design is it is more efficient than its one-way
counterpart. There are two assignable sources of variation - age and gender
in our example - and this helps to reduce error variation thereby making this
design more efficient.
● Unlike One-Way ANOVA, it enables us to test the effect of two factors at the
same time.
● One can also test for independence of the factors provided there are more
than one observation in each cell. The only restriction is that the number of
observations in each cell has to be equal (there is no such restriction in case
of one-way ANOVA).
Factors and Levels - An Example

A Two-Way ANOVA is a design with two factors.
Let us suppose that the Human Resources Department of a company desires
to know if occupational stress varies according to age and gender.
The variable of interest is therefore occupational stress as measured by a scale.
There are two factors being studied - age and gender.

Further suppose that the employees have been classified into three groups
or levels:
● age less than 40,
● 40 to 55
● above 55
In addition employees have been labeled into gender classification (levels):
● male
● female
In this design, factor age has three levels and gender two. In all, there are 3 x
2 = 6 groups or cells. With this layout, we obtain scores on occupational stress from
employee(s) belonging to the six cells.
Testing for Interaction

There are two versions of the Two-Way ANOVA.
The basic version has one observation in each cell - one occupational stress
score from one employee in each of the six cells.
The second version has more than one observation per cell but the number of
observations in each cell must be equal. The advantage of the second version is it
also helps us to test if there is any interaction between the two factors.
For instance, in the example above, we may be interested to know if there is
any interaction between age and gender.
This helps us to know if age and gender are independent of each other - they
are independent if the effect of age on stress remains the same irrespective of
whether we take gender into consideration.
Hypothesis Testing
In the basic version there are two null hypotheses to be tested.
● H01: All the age groups have equal stress on the average
● H02: Both the gender groups have equal stress on the average.
In the second version, a third hypothesis is also tested:
● H03: The two factors are independent or that interaction effect is not present.
The computational aspect involves computing F-statistic for each hypothesis.

Table 8: Two-Way ANOVA Table
TSS, RSS and ESS

(Total Sum of Squares, Residual Sum of Squares and
Explained Sum of Squares)
Consider the diagram below. Yi is the actual observed value of the dependent
variable, y-hat is the value of the dependent variable according to the regression
line, as predicted by our regression model. What we want to get is a feel for is the
variability of actual y around the regression line, ie, the volatility of ϵ. This is given
by the distance yi minus y-hat. Represented in the figure below as RSS. The figure
below also shows TSS and ESS – spend a few minutes looking at what TSS, RSS
and ESS represent.
Figure 29: Showing Total SS, Residual SS and Explained SS

Factor CHAPTER
Analysis 7
In analytics, we always have a motive of explaining the causation of an event.
To describe the causal relationship, we use regression analysis. Linear regression is
an important tool for predictive analytics. There are some set of assumptions that
we need to check for before applying linear regression on the data. Multicollinearity
is one of the most important conditions that statisticians check for going for
analysis.
Multicollinearity is a state of very high intercorrelations or inter-
associations among the independent variables. It is therefore a type of disturbance
in the data, and if present in the data the statistical inferences made about the data
may not be reliable.
There are certain reasons why multicollinearity occurs

● It is caused by an inaccurate use of dummy variables.
● It is caused by the inclusion of a variable which is computed from other
variables in the data set.
● Multicollinearity can also result from the repetition of the same kind of
variable.
● Generally, occurs when the variables are highly correlated to each other.
Multicollinearity can result in several problems.

These problems are as follows:
● The partial regression coefficient due to multicollinearity may not be
estimated precisely. The standard errors are likely to be high.

Chapter 7: FACTOR ANALYSIS | 116
● Multicollinearity results in a change in the signs as well as in the magnitudes

of the partial regression coefficients from one sample to another sample.
● Multicollinearity makes it tedious to assess the relative importance of the
independent variables in explaining the variation caused by the dependent
variable.
In the presence of high multicollinearity, the confidence intervals of the
coefficients tend to become very wide and the statistics tend to be very small. It
becomes difficult to reject the null hypothesis of any study when multicollinearity is
present in the data under study.
There are certain signals which help the researcher to

detect the degree of multicollinearity.
One such signal is if the individual outcome of a statistic is not significant
but the overall outcome of the statistic is significant. In this instance, the
researcher might get a mix of significant and insignificant results that show the
presence of multicollinearity. Suppose the researcher, after dividing the sample into
two parts, finds that the coefficients of the sample differ drastically. This indicates
the presence of multicollinearity. This means that the coefficients are unstable due
to the presence of multicollinearity. Suppose the researcher observes drastic change
in the model by simply adding or dropping some variable. This also indicates that
multicollinearity is present in the data.
Multicollinearity can also be detected with the help of tolerance and its
reciprocal, called variance inflation factor (VIF). If the value of tolerance is less than
0.2 or 0.1 and, simultaneously, the value of VIF 10 and above, then the
multicollinearity is problematic.
Remedial Measures
● Increasing sample size
● Transformation of variables
● Dropping variables
These measures are difficult to apply in real life data. Most of the time
increasing sample size becomes a huge matter of money. In predictive analysis
dropping variable is not a suitable way of handling data. So, we go for FACTOR
ANALYSIS.

Curse of Dimensionality
For understanding factor analysis, it is extremely important to understand
what is “curse of dimensionality”. The number of samples per variable increase
exponentially with the number of variables to maintain a given level of accuracy is
called the "Curse of Dimensionality." Grouping is a fundamental theory.
Homogeneous items will have similar characteristics. If there are too many
variables without homogeneous characteristics, then grouping becomes difficult.
When we have a data set with heterogeneous variables we call it a n-
dimensional data set where all points become very unique so we cannot group them.
When we say n-dimensional, that means there are n variables. When n is very
large, the variance measures do not work properly. Variance becomes an ineffective
and inefficient parameter.
Sparse Data Set or Sparse Matrix

A sparse data set is a dataset where most of the observations are blank. For
example, if you look for some words related to biology in a collection of 10000 books,
only the books related to biology will have those. For the remaining books, there
will be no presence. So if you consider the words as individual variables, and
documents as rows, you can build a data set consisting of 0 and 1.
 1 indicates the word is present in the document.
 0 indicates that word is not present in that document.
In this dataset, most of the observations will be 0. This is an example of
sparse data set.
Factor and cluster analysis is very useful in data sets of this type. The
machine learning technique used in this analysis is called Latent Semantic Analysis
(LSA). This is similar to factor analysis. (Google uses this tool in search engine
along with many others).
Next topic that we should concentrate is the understanding of the hidden
factor. Psychologists use a set of questionnaire to understand how the mind's work,
before treatment. They have an idea of the illness for which they use a set of simple
questions which lead to the inference that they feel is true.
Let’s take a very simple example: Suppose a girl likes a guy, she will
never go to him and ask him directly, so uses a set of indirect questions to know his
feelings for her.

Singular Value Decomposition

A factor is always hidden. We believe that factor is influencing some
variables which we can measure directly. So by studying these explicit variables, we
can get an idea about the underlying factor. This is what we call FACTOR
ANALYSIS in Mathematics or statistics. One of the tools used for such calculations
is SVD (Singular Value Decomposition).
Factors are not explicitly mentioned in the questionnaire. Users may feel that
3 questions are same so 3 questions may be clustered or clubbed.
Example: Bonus and Direct salary are similar for employee perspective.
While applying SVD nothing is added or subtracted from different
perspective, just the preference point changes. The information or data does not
change. A new system is created by changing the reference point.
Graphically:
Figure 30: Regression Line
Principal Component analysis under SVD is way where total information is

split into different direction. In some cases, the information is more distributed
compared to another. Information gets distributed according to the quantity. Along
some dimension information is more and some dimension it is less.

Figure 31: Segmentation of a regression line through PCA
The above figure shows the rotation technique used in PCA and dropping low
variation axis in Factor Analysis.
The basic difference between PCA and Factor Analysis

PCA retains same information that is given in the original data. Before
applying SVD dimension was 50 and after application it stays 50.
Factor Analysis removes component with information which is less relevant,
it may happen that only 30 dimension is retained in a data set of 50 dimensions or
variables.
1. Principle Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses
an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called
principal components. The number of principal components is less than or equal
to the number of original variables. This transformation is defined in such a way
that the first principal component has the largest possible variance (that is,
accounts for as much of the variability in the data as possible), and each succeeding
component in turn has the highest variance possible under the constraint that it is
orthogonal to (i.e., uncorrelated with) the preceding components.
Principal Component Analysis (PCA) is a dimension-reduction tool that can
be used to reduce a large set of variables to a small set that still contains most of
the information in the large set.
• Principal component analysis (PCA) is a mathematical procedure that
transforms a number of (possibly) correlated variables into a (smaller)
number of uncorrelated variables called principal components.

• The first principal component accounts for as much of the variability in the
data as possible, and each succeeding component accounts for as much of the
remaining variability as possible.
2. Eigen Values and Eigen Vector

An eigenvalue is a number, telling you how much variance there is in the
data in that direction. The eigenvector with the highest eigenvalue is therefore the
principal component.
3. Pre-Requisites of Factor Analysis

Correlation Matrix Check:
 Is it a combination of high and low correlations?
 Correlation can have two outcomes.
 Causation and Spurious correlation
Factor Analysis is mainly applied to avoid spurious correlation which occurs
due to time regression or time factor.
·
Bartlett’s Test of Sphericity
It is a test statistics used to examine the hypothesis that the variables are
uncorrelated in the population. In other words, the population correlation matrix is
an identity matrix.
·
Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy
The KMO measure of sampling adequacy is an index to examine the
appropriateness of factor analysis. High values (0.5 to 1) indicate factor analysis is
appropriate. Values below 0.5 imply that factor analysis is not appropriate.
Exploratory Factor Analysis

EFA is based on the common factor model. Within the common factor model,
a function of common factors, unique factors, and errors of measurements expresses
measured variables. Common factors inﬂuence two or more measured variables,

while each unique factor inﬂuences only one measured variable and does not
explain correlations among measured variables.
Flowchart 1: Exploratory Factor Analysis
4. Factor Loadings
To obtain a principal component, each of the weights of an Eigen vector is
multiplied by the square root of the principal component ‘s associated Eigen value.
These newly generated weights are called factor loadings and represent the
correlation of each item with the given principal component.
5. Deciding The Number of Factors

 A Priori Criterion: Number of Factors to extract is pre-decided
 Eigen Value Criterion:
 Mineigen Criterion: We decide the floor of Eigen value. If the floor is 0.6
and there are 3 Eigen values above that mark, then we are looking for 3
factors.
 Proportional and Cumulative
Variance:
We consider how much information
is explained by an individual factor
and on aggregate by the selected
factors.
 Scree Plot: This is basically
graphical presentation of
proportional variance
Figure 32: Screeplot

6. What are factor loadings?

Factor loadings represent how much a factor explains a variable in factor
analysis.
For example, a credit card company creates a survey to assess customer
satisfaction. The survey is designed to answer questions in three categories:
timeliness of service, accuracy
of the service, and
courteousness of phone
operators. For each survey
question, examine the highest
(positive or negative) loadings
to determine which factor
affects that question the most.
In the following table, questions
1-3 load on factor 1, questions
4-5 load on factor 2, and
questions 6-8 load on factor 3. Figure 33: Factor Loading Matrix
Loadings can range from -1 to 1. Loadings

close to -1 or 1 indicate that the factor
strongly affects the variable. Loadings close
to zero indicate that the factor has a weak
effect on the variable.
7. Problem of Factor
Loadings
Initially, the weights are distributed
across all the variables. So it is not possible
to understand the underlying factor of one
or more variables. To remove this problem,
we apply rotation to the axes.

Rotate the Factors

When more than one factor is
retained, unrotated factors cannot be
interpreted in most cases. Rotation does
not affect the mathematical fit of the
solution!!!!!!
Two types of rotations:
a) Orthogonal rotation: The
factors are uncorrelated (= orthogonal)
b) Oblique rotation: The factors
may (or may not) be correlated Figure 34: Factor Rotations
Orthogonal Rotations
VARIMAX (simplifies factors) dilutes the problem of tie. When there is a tie
between two factor correlation values, we can twist or rotate the origin forcefully
then variable will be correlated to one factor more. This approach maximizes the
variance. As it is an orthogonal system even after Varimax the component will be
perpendicular to each other.
Oblique Rotations
PROMAX: The problem
with oblique rotation is that it
makes the factors correlated.
Varimax rotation is used in
principal component analysis so
that the axes are rotated to a
position in which the sum of the
variances of the loadings is the
maximum possible.
Figure 35: Oblique Rotations

Cluster CHAPTER
Analysis 8
1. Cluster Analysis
Cluster analysis is a multivariate method which aims to classify a sample of
subjects (or objects) on the basis of a set of measured variables into a number of
different groups such that similar subjects are placed in the same group. An
example where this might be used is in the field of psychiatry, where the
characterisation of patients on the basis of clusters of symptoms can be useful in the
identification of an appropriate form of therapy. In marketing, it may be useful to
identify distinct groups of potential customers so that, for example, advertising can
be appropriately targeted.
WARNING ABOUT CLUSTER ANALYSIS

Cluster analysis has no mechanism for differentiating between relevant and
irrelevant variables. Therefore, the choice of variables included in a cluster analysis
must be underpinned by conceptual considerations. This is very important because
the clusters formed can be very dependent on the variables included.
Approaches to Cluster Analysis

There are a number of different methods that can be used to carry out a
cluster analysis; these methods can be classified as follows:
Hierarchical methods
Agglomerative methods, in which subjects start in their own separate
cluster. The two ’closest’ (most similar) clusters are then combined and this is done
repeatedly until all subjects are in one cluster. At the end, the optimum number of
clusters is then chosen out of all cluster solutions.

Chapter 8: CLUSTER ANALYSIS | 125
Divisive methods, in which all subjects start in the same cluster and the
above strategy is applied in reverse until every subject is in a separate cluster.
Agglomerative methods are used more often than divisive methods, so this handout
will concentrate on the former rather than the latter.
Non-hierarchical methods
(often known as k-means clustering methods)
2. Types of data and measures of distance

The data used in cluster analysis can be interval, ordinal or categorical.
However, having a mixture of different types of variable will make the analysis
more complicated. This is because in cluster analysis you need to have some way of
measuring the distance between observations and the type of measure used will
depend on what type of data you have. A number of different measures have been
proposed to measure ’distance’ for binary and categorical data.
For interval data the most common distance measure used is the Euclidean
distance.
3. Similarity
Euclidean distance
In general, if you have p variables X1, X2, . . . , Xp measured on a sample of n
subjects, the observed data for subject i can be denoted by xi1, xi2, . . . , xip and the
observed data for subject j by xj1, xj2, . . . , xjp. The Euclidean distance between
these two subjects is given by
n
d E ( x, y )  ( x1  y1 )  ( x2  y2 )  ...  ( xn  yn ) 
2 2 2
(x  y )
i 1
i i
2
Cosine similarity
It is a measure of similarity between two non-zero vectors of an inner product
space that measures the cosine of the angle between them. The cosine of 0° is 1, and
it is less than 1 for any other angle. It is thus a judgment of orientation and not
magnitude: two vectors with the same orientation have a cosine similarity of 1, two
vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a
similarity of -1, independent of their magnitude. Cosine similarity is particularly

used in positive space, where the outcome is neatly bounded in [0,1]. The name
derives from the term "direction cosine": in this case, note that unit vectors are
maximally "similar" if they're parallel and maximally "dissimilar" if they're
orthogonal (perpendicular). It should not escape the alert reader's attention that
this is analogous to cosine, which is unity (maximum value) when the segments
subtend a zero angle and zero (uncorrelated) when the segments are perpendicular.
4. Hierarchical Agglomerative Methods

Within this approach to cluster analysis there are a number of different
methods used to determine which clusters should be joined at each stage. The main
methods are summarised below.
Nearest Neighbour Method (Single Linkage Method) – in this method
the distance between two clusters is defined to be the distance between the two
closest members, or neighbours. This method is relatively simple but is often
criticised because it doesn’t take account of cluster structure and can result in a
problem called chaining whereby clusters end up being long and straggly. However,
it is better than the other methods when the natural clusters are not spherical or
elliptical in shape. 2
Furthest Neighbour Method (Complete Linkage Method) – in this case
the distance between two clusters is defined to be the maximum distance between
members — i.e. the distance between the two subjects that are furthest apart. This
method tends to produce compact clusters of similar size but, as for the nearest
neighbour method, does not take account of cluster structure. It is also quite
sensitive to outliers.
Average (between groups) Linkage Method – the distance between two
clusters is calculated as the average distance between all pairs of subjects in the
two clusters. This is considered to be a fairly robust method.
Centroid Method – here the centroid (mean value for each variable) of each
cluster is calculated and the distance between centroids is used. Clusters whose
centroids are closest together being merged. This method is also fairly robust.
Ward’s Method – in this method all possible pairs of clusters are combined
and the sum of the squared distances within each cluster is calculated. This is then
summed over all clusters. The combination that gives the lowest sum of squares is
chosen. This method tends to produce clusters of approximately equal size, which is
not always desirable. It is also quite sensitive to outliers. Despite this, it is one of
the most popular methods, along with the average linkage method. It is generally a

good idea to try two or three of the above methods. If the methods agree reasonably
well then the results will be that much more believable.
Selecting the optimum number of clusters

As stated above, once the
cluster analysis has been carried
out it is then necessary to select
the ’best’ cluster solution. There
are a number of ways in which this
can be done, some rather informal
and subjective, and some more
formal. The more formal methods
will not be discussed in this
handout. Below, one of the
informal methods is briefly Figure 36: Dendogram
described. When carrying out a hierarchical cluster analysis, the process can be
represented on a diagram known as a dendrogram. This diagram illustrates which
clusters have been joined at each stage of the analysis and the distance between
clusters at the time of joining. If there is a large jump in the distance between

clusters from one stage to another then this suggests that at one stage clusters that
are relatively close together were joined whereas, at the following stage, the clusters
that were joined were relatively far apart. This implies that the optimum number of
clusters may be the number present just before that large jump in distance. This is
easier to understand by actually looking at a dendrogram.
5. Non-hierarchical or k-means clustering methods

In these methods the desired number of clusters is specified in advance and
the ’best’ solution is chosen. The steps in such a method are as follows:
1. Choose initial cluster centres (essentially this is a set of observations that are
far apart — each subject forms a cluster of one and its centre is the value of
the variables for that subject).
2. Assign each subject to its ’nearest’ cluster, defined in terms of the distance to
the centroid.
3. Find the centroids of the clusters that have been formed
4. Re-calculate the distance from each subject to each centroid and move
observations that are not in the cluster that they are closest to.
5. Continue until the centroids remain relatively stable.
Non-hierarchical cluster analysis tends to be used when large data sets are
involved. It is sometimes preferred because it allows subjects to move from one
cluster to another (this is not possible in hierarchical cluster analysis where a
subject, once assigned, cannot move to a different cluster). Two disadvantages of
non-hierarchical cluster analysis are:
 it is often difficult to know how many clusters you are likely to have and
therefore the analysis may have to be repeated several times and
 it can be very sensitive to the choice of initial cluster centres. Again, it may
be worth trying different ones to see what impact this has.
One possible strategy to adopt is to use a hierarchical approach initially to
determine how many clusters there are in the data and then to use the cluster
centres obtained from this as initial cluster centres in the non-hierarchical method.

Linear CHAPTER
Regression
9
1. Linear Regression
In a cause and effect relationship, the independent variable is the cause,
and the dependent variable is the effect. Least squares linear regression is a
method for predicting the value of a dependent variable Y, based on the value of an
independent variable X.
Yi = ( 0 + 1Xi) + εi
Here Yi is
the outcome that
we want to
predict and Xi is
the i-th score on
the predictor
variable. The
intercept and
the slope 1 are
the parameters
in the model and
are known as
regression
Figure 37: Population Regression Line
coefficients.
There is a residual term εi which represents the difference between the score
predicted by the line and the i-th score in reality of the dependent variable. This
term is the proof of the fact that our model will not fit perfectly the data collected.
With regression we strive to find the line that best describes the data.

Chapter 9: LINEAR REGRESSION | 130
Ingredients:
 Y is a continuous response variable (dependent variable).
 X is an explanatory or predictor variable (independent variable).
 Y is the variable we’re mainly interested in understanding, and we want to
make use of x for explanatory reasons.
Assumptions of Simple Linear Regression

● An unilinear between an independent and dependent variable can be
represented by a linear regression.
● The independent variable must be non-stochastic in nature, i.e. the variable
doesn‘t have any distribution associated with it.
● The model must be linear in parameters not necessarily in variables.
● The independent variable should not be correlated with the error term.
● The error terms must be independent of each other, i.e. occurrence of one
error term should not influence the occurrence of other error terms.
Correlation Regression
Regression examines the relationship between
Correlation examines the relationship between
one dependent variables and one or more
two variables using a standardized unit.
independent variables. Calculations may us
However, most applications use raw units as an
either raw unit values, or standardized units as
input.
input.
The calculation is NOT symmetrical. So one

The calculation is symmetrical, meaning that variable is assigned the dependent role (the
the order of comparison does NOT change the values being predicted) and one or more the
result. independent role (the values hypothesized to
impact the dependent variable).
Regression shows the effect of one unit change

Correlation coefficients indicate the strength of
in an independent variable on the dependent
a relationship.
variable.
Linear regression using raw unit measurement

Correlation removes the effect of different scales can be used to predict outcomes. For
measurement scales. Therefore, comparison example, if a model shows that spending more
between different models is possible since the money on advertising will increases sales,
rho coefficient is in standardized units. then one can say that for every added $ in
advertising our sales will increase by β.

2. The Least Squares Regression Line

Linear regression finds the straight line, called the least squares
regression line or LSRL, that best represents observations in a bivariate data set.
Suppose Y is a dependent variable, and X is an independent variable. The
population regression line is:
Y = Β0 + Β1X
where Β0 is a constant, Β1 is the regression coefficient, X is the value of the
independent variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated
by:
ŷ = b0 + b1x +
where b0 is a constant, b1 is the regression coefficient, x is the value of the
independent variable, ŷ is the predicted value of the dependent variable and is the
eroor term.
3. The Method of Least Square

Ordinary Least Square technique is a technique of estimating the regression
parameters such that the sum squares of error are minimised. But why are we
minimising the sum squares of errors and not the sum of the errors? The reason is
that the sum of errors may add up to zero but the actual error might be very high.
By squaring the errors, we can get rid of the signs.

For e.g. Yi = a + b.Xi where “a*” = estimated intercept for 'a' and “b*” =
estimated slope coefficient for 'b'. The estimated regression equation will be: Yi * =
a* + b*. Xi.
It is important to validate that a* represents the intercept in the population.
If 'a' differs from 'a*' then there must be sufficient evidences to show that the
difference is just an observable difference and not a significant difference. The
difference observed will just be a result of sampling fluctuation.
In other words, the estimated intercept (a*) is the value of Y if X =0. So,
intuitively what does the intercept capture? It captures those factors apart from X,
which can influence the variable Y. It will capture the average behaviour of Y which
will not be captured by X. So, if we think in the frame work of testing hypothesis
framework then:
H0: a = a*
v/s
H1: a <> a*.
Then this would hold if H0 is accepted at the specified level of significance.
Therefore, if the average of Y Be Y-bar and the average of X Be X-bar. a* = (Y-bar)-
b*(X-bar). Similarly, we can also state this for b*.
The estimated slope (b*): Therefore, when the total variation in set X is Var
(X), the total variation in Y w.r.t X is Cov (X, Y) when the total variation in set X is
1, the total variation in Y w.r.t X is {Cov (X, Y)/ V(X)} = b*. Therefore, the estimated
parameters are:
a* = (Y-bar)- b*(X-bar)
b* = {Cov (X, Y)/V(X)
Using the OLS estimates we can define the

Best-fit Regression Line:
Best fit regression line is the regression equation which best represents the data at
hand. So for the best fit regression line the goodness of fit measures (R2, Adjusted
R2 and F-statistic) will be the highest among any other models constructed and the
error terms will be white noise. So the equation of the best fit regression line will
be:
Yi* = a* + b* Xi = (Y-bar)-b*(X-bar) + b* Xi = (Y-bar) + b*(Xi-(X-bar))
This is the best fit Regression Line for a single variable regression model.

4. Goodness of Fit of the Model

R2 : The coefficient of determination (denoted by R2) is a key output of
regression analysis. It is interpreted as the proportion of the variance in the
dependent variable that is predictable from the independent variable.
 The coefficient of determination ranges from 0 to 1.
 An R2 of 0 means that the dependent variable cannot be predicted from the
independent variable.
 An R2 of 1 means the dependent variable can be predicted without error from
the independent variable.
 An R2 between 0 and 1 indicates the extent to which the dependent variable
is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is
predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so
on.
The formula for computing the coefficient of determination for a linear regression
model with one independent variable is given below.
Coefficient of Determination
The coefficient of determination (R2) for a linear regression model with one
independent variable is:
2
   ( X  X )(Y Y )  
r  squared     X n Y  
 
 
where N is the number of observations used to fit the model, Σ is the summation
symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for
observation i, y is the mean y value, σx is the standard deviation of x, and σy is the
standard deviation of y.

If you know the linear correlation (r) between two variables, then the
coefficient of determination (R2) is easily computed using the following formula:
R2 = r2.
Adjusted R2
The adjusted R-squared is a modified version of R-squared that has
been adjusted for the number of predictors in the model. The adjusted R-
squared increases only if the new term improves the model more than would be
expected by chance. It decreases when a predictor improves the model by less than
expected by chance.
Example:
A fund has a sample R-squared value close to 0.5 and it is most likely offering
higher risk-adjusted returns with the sample size of 50 for 5 predictors.
Given,
Sample size = 50
Number of predictors = 5
Sample R -square = 0.5
To Find,
Adjusted R square value

Solution:
Substitute the values in the formula,
R2adjusted = 1 - ( 1 - 0.52)(50 - 1) / (50 - 5 -1)
= 1 - (0.75) x (49 / 44)
= 1 - 0.8352
= 0.1648
Figure 38: Graphically explaining the Goodness of Fit
Multiple Linear Regression Model

A statistical technique that uses several explanatory variables to predict the
outcome of a response variable. The goal of multiple linear regression (MLR) is to
model the relationship between the explanatory and response variables.
The model for MLR, given n observations, is:
yi = B0 + B1xi1 + B2xi2 + ... + Bpxip + Ei
where i = 1,2, ..., n
MLR is often used to determine how many specific factors such as the price of
a commodity, interest rates, and particular industries or sectors, influence the price
movement of an asset. For example, the current price of oil, lending rates, and the
price movement of oil futures, can all have an effect on the price of an oil company's

stock price. MLR could be used to model the impact that each of these variables has
on stock's price.
Multiple Linear Regression: Assumptions

 Linearity in parameters
 Random Sampling from the population
 No perfect Collinearity
 Zero Conditional Mean
 Homoskedasticity
Firstly, linear regression needs the relationship between the independent
and dependent variables to be linear. It is also important to check for outliers
since linear regression is sensitive to outlier effects. The linearity assumption can
best be tested with scatter plots, the following two examples depict two cases, where
no and little linearity is present.
Secondly, the linear regression
analysis requires all variables to be
multivariate normal. This assumption
can best be checked with a histogram
and a fitted normal curve or a Q-Q-Plot.
Normality can be checked with a
goodness of fit test, e.g., the
Figure 39: Scatter Plot Kolmogorov-Smirnov test. When the
data is not normally distributed a non-
linear transformation, e.g., log-
transformation might fix this issue,
however it can introduce effects of
multicollinearity.
Thirdly, linear regression assumes
that there is little or no multicollinearity
in the data.
Figure 40: Frequency Polygon and Q-Q Plot
Multicollinearity occurs when the independent variables are not

independent from each other. A second important independence assumption is that
the error of the mean has to be independent from the independent variables.

Multicollinearity might be tested with 4 central criteria

Correlation matrix – when computing the matrix of Pearson's
Bivariate Correlation among all independent variables the correlation coefficients
need to be smaller than 1.
Tolerance – the tolerance measures the influence of one independent
variable on all other independent variables; the tolerance is calculated with an
initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first
step regression analysis. With T < 0.1 there might be multicollinearity in the data
and with T < 0.01 there certainly is.
Variance Inflation Factor (VIF) – the variance inflation factor of
the linear regression is defined as VIF = 1/T. Similarly with VIF > 10 there is an
indication for multicollinearity to be present; with VIF > 100 there is certainly
multicollinearity in the sample.
Condition Index – the condition index is calculated using a factor
analysis on the independent variables. Values of 10-30 indicate a mediocre
multicollinearity in the linear regression variables, values > 30 indicate strong
multicollinearity.
If multicollinearity is found in the data centering the data, that is deducting
the mean score might help to solve the problem. Other alternatives to tackle the
problems is conducting a factor analysis and rotating the factors to insure
independence of the factors in the linear regression analysis.
Fourthly, linear regression analysis requires that there is little or no
autocorrelation in the data. Autocorrelation occurs when the residuals are not
independent from each other. In other words when the value of y(x+1) is not
independent from the value of y(x). Thus for instance typically occurs in stock
prices, where the price is not independent from the previous price.
While a scatterplot allows you to check for autocorrelations, you can test the
linear regression model for autocorrelation with the Durbin-Watson test. Durbin-
Watson d tests the null hypothesis that the residuals are not linearly
autocorrelated. While d can assume values between 0 and 4, values around 2
indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that

Figure 41: Autocorrelation Scale
there is no autocorrelation in the data, however the Durbin-Watson test only

analyses linear autocorrelation and only between direct neighbors, which are first
order effects.
The last assumption the linear regression analysis makes is
homoscedasticity. The scatter plot is good way to check whether homoscedasticity
(that is the error terms along the regression are equal) is given. If the data is
heteroscedastic the scatter plots looks like the following examples:
Figure 42: Showing Homoskedasticity Figure 43: Showing Heteroskedasticity
The Goldfeld-Quandt Test can test for heteroskedasticity.

The test splits the data in high and low value to see if the samples are
significantly different. If homoscedasticity is present, a nonlinear correction might
fix the problem.
Figure 44: Heteroskedastic Plotting

Logistic CHAPTER
Regression
10
1. Logistic Regression
The binary logistic model is used to predict a binary response based on one or
more predictor variables (features). That is, it is used in estimating the parameters
of a qualitative response model. The probabilities describing the possible outcomes
of a single trial are modeled, as a function of the explanatory (predictor) variables,
using a logistic function. Frequently (and hereafter in this article) "logistic
regression" is used to refer specifically to the problem in which the dependent
variable is binary—that is, the number of available categories is two—while
problems with more than two categories are referred to as multinomial logistic
regression, or, if the multiple categories are ordered, as ordinal logistic regression.
Logistic regression measures the relationship between the categorical
dependent variable and one or more independent variables, which are usually (but
not necessarily) continuous, by estimating probabilities. Thus, it treats the same set
of problems as does probit regression using similar techniques; the first assumes
a logistic function and the second a standard normal distribution function.
The purpose of logistic regression

The crucial limitation of linear regression is that it cannot deal with DV’s
that are dichotomous and categorical. Many interesting variables in the business
world are dichotomous: for example, consumers make a decision to buy or not buy, a
product may pass or fail quality control, there are good or poor credit risks, an
employee may be promoted or not. A range of regression techniques have been
developed for analysing data with categorical dependent variables, including logistic
regression and discriminant analysis (DA).
Logistic regression predicts the probability of an outcome that can only have
two values (i.e. a dichotomy).

Chapter 10: LOGISTIC REGRESSION | 140
The prediction is based on the use of one or several predictors (numerical and
categorical).
A linear regression is not appropriate for predicting the value of a binary
variable for two reasons:
● A linear regression will predict values outside the acceptable range (e.g.
predicting probabilities outside the range 0 to 1)
● Since the dichotomous experiments can only have one of two possible values
for each experiment,
the residuals will not be normally distributed about the predicted line. On the other
hand, a logistic regression produces a logistic curve, which is limited to values
between 0 and 1. Logistic regression is similar to a linear regression, but the curve
is constructed using the natural logarithm of the “odds” of the target variable,
rather than the probability. Moreover, the predictors do not have to be normally
distributed or have equal variance in each group.
Figure 45: Gompertz Curve
In the logistic regression the constant (b0) moves the curve left and right and
the slope (b1) defines the steepness of the curve. By simple transformation, the
logistic regression equation can be written in terms of an odds ratio.
Finally, taking the natural log of both sides, we can write the equation in
terms of log-odds (logit) which is a linear function of the predictors. The coefficient
(b1) is the amount the logit (log-odds) changes with a one unit change in x.

As mentioned before, logistic regression can handle any number of numerical

and/or categorical variables.
Why can’t we use OLS Technique?

Since the dependent variable is dichotomous we cannot predict a numerical
value for it using logistic regression, so the usual regression least squares
deviations criteria for best fit approach of minimizing error around the line of best
fit is inappropriate. Instead, logistic regression employs binomial probability theory
in which there are only two values to predict: that probability (p) is 1 rather than 0,
i.e. the event/person belongs to one group rather than the other. Logistic regression
forms a best fitting equation or function using the maximum likelihood method,
which maximizes the probability of classifying the observed data into the
appropriate category given the regression coefficients.
Like ordinary regression, logistic regression provides a coefficient ‘b’, which
measures each IV’s partial contribution to variations in the DV. The goal is to
correctly predict the category of outcome for individual cases using the most
parsimonious model. To accomplish this goal, a model (i.e. an equation) is created
that includes all predictor variables that are useful in predicting the response
variable.
There are two main uses of logistic regression:

The first is the prediction of group membership. Since logistic regression
calculates the probability of success over the probability of failure, the results of the
analysis are in the form of an odds ratio.

Logistic regression also provides knowledge of the relationships and

strengths among the variables (e.g. marrying the boss’s daughter puts you at a
higher probability for job promotion than undertaking five hours unpaid overtime
each week).
2. Odds Ratio
Example: Getting heads Example: Getting a 1 in
Definition
in a 1 flip of a coins a single roll of a dice
# of times something
happens = 1/5 = 0.2
Odds = 1/1 = 1 (or 1:1)
# of times it does NOT (or 1:5)
happen
# of times something
= 1/6 = .16
Probability happens = 1/2 = .5 (or 50%)
(or 16%)
# of times it could happen
Let's begin with probability. Probabilities range between 0 and 1. Let's say
that the probability of success is .8, thus
p = .8
Then the probability of failure is
q = 1 - p = .2
Odds are determined from probabilities and range between 0 and infinity. Odds are
defined as the ratio of the probability of success and the probability of failure. The
odds of success are
odds(success) = p/(1-p) or p/q = .8/.2 = 4,
that is, the odds of success are 4 to 1. The odds of failure would be
odds(failure) = q/p = .2/.8 = .25.
This looks a little strange but it is really saying that the odds of failure are 1
to 4. The odds of success and the odds of failure are just reciprocals of one another,
i.e., 1/4 = .25 and 1/.25 = 4. Next, we will add another variable to the equation so
that we can compute an odds ratio.

Example:
This example is adapted from Pedhazur (1997). Suppose that seven out of
10 males are admitted to an engineering school while three of 10 females are
admitted. The probabilities for admitting a male are,
p = 7/10 = .7 q = 1 - .7 = .3
If you are male, the probability of being admitted is 0.7 and the probability of not
being admitted is 0.3.
Here are the same probabilities for females,
p = 3/10 = .3 q = 1 - .3 = .7
If you are female it is just the opposite, the probability of being admitted is 0.3
and the probability of not being admitted is 0.7.
Now we can use the probabilities to compute the odds of admission for both males
and females,
odds(male) = .7/.3 = 2.33333
odds(female) = .3/.7 = .42857
Next, we compute the odds ratio for admission,
OR = 2.3333/.42857 = 5.44
Thus, for a male, the odds of being admitted are 5.44 times larger than the odds
for a female being admitted.
3. Model Fit and the Likelihood Function

Just as in linear regression, we are trying to find a best fitting line of sorts
but, because the values of Y can only range between 0 and 1, we cannot use the
least squares approach. The Maximum Likelihood (or ML) is used instead to find
the function that will maximize our ability to predict the probability of Y based on
what we know about X. In other words, ML finds the best values for formula.
Likelihood just means probability. It always means probability under a
specified hypothesis.
In logistic regression, two hypotheses are of interest: the null hypothesis,
which is when all the coefficients in the regression equation take the value zero, and
the alternate hypothesis that the model with predictors currently under
consideration is accurate and differs significantly from the null of zero, i.e. gives
significantly better than the chance or random prediction level of the null
hypothesis.

We then work out the likelihood of observing the data we actually did observe
under each of these hypotheses. The result is usually a very small number, and to
make it easier to handle, the natural logarithm is used, producing a log likelihood
(LL). Probabilities are always less than one, so LL’s are always negative. Log
likelihood is the basis for tests of a logistic model.
The likelihood ratio test is based on –2LL ratio. It is a test of the
significance of the difference between the likelihood ratio (–2LL) for the researcher’s
model with predictors (called model chi square) minus the likelihood ratio for
baseline model with LOGISTIC REGRESSION.
4. The Likelihood Ratio Test

This tests the difference between –2LL for the full model with predictors and
–2LL for initial chi-square in the null model only a constant in it. Significance at
the .05 level or lower means the researcher’s model with the predictors is
significantly different from the one with the constant only (all ‘b’ coefficients being
zero). It measures the improvement in fit that the explanatory variables make
compared to the null model. Chi square is used to assess significance of this ratio.
When probability fails to reach the 5% significance level, we retain the null
hypothesis that knowing the independent variables (predictors) has no increased
effects (i.e. make no difference) in predicting the dependent.
Instead of using the deviance (–2LL) to judge the overall fit of a model,
however, another statistic is usually used that compares the fit of the model with
and without the predictor(s). This is similar to the change in R2 when another
variable has been added to the equation.
But here, we expect the deviance to decrease, because the degree of error in
prediction decreases as we add another variable. To do this, we compare the
deviance with just the intercept (–2LL null referring to –2LL of the constant-only
model) to the deviance when the new predictor or predictors have been added (–
2LLk referring to –2LL of the model that has k number of predictors). The
difference between these two deviance values is often referred to as G for goodness
of fit.

Testing of Individual Estimated Parameters

1. Wald Statistic
Alternatively, when assessing the contribution of individual predictors in a
given model, one may examine the significance of the Wald statistic. The Wald
statistic, analogous to the t-test in linear regression, is used to assess the
significance of coefficients. The Wald statistic is the ratio of the square of the
regression coefficient to the square of the standard error of the coefficient and is
asymptotically distributed as a chi-square distribution.
B 2j
Wj 
SEB2 j
2. Hosmer–Lemeshow Test
The Hosmer–Lemeshow test uses a test statistic that asymptotically follows
a distribution to assess whether or not the observed event rates match expected
event rates in subgroups of the model population.
The Hosmer–Lemeshow test statistic is given by:
G (Og  Eg ) 2
H 
g 1 N g g (1   g )
Here Og, Eg, Ng, and πg denote the observed events, expected events,
observations, predicted risk for the gth risk decile group, and G is the number of
groups. The test statistic asymptotically follows a  2 distribution with G − 2
degrees of freedom. The number of risk groups may be adjusted depending on how
many fitted risks are determined by the model. This helps to avoid singular decile
groups.
Statistics Related to Log-likelihood

AIC (Akaike Information Criterion) = -2log L + 2(k + s),
k is the total number of response level minus 1 and s is the number of explanatory
variables.
SC (Schwarz Criterion) = -2log L + (k + s)Σjfj ,
fj is the frequency of the j th observation.

R2 for Logistic Regression

In logistic regression, there is no true R2 value as there is in OLS regression.
However, because deviance can be thought of as a measure of how poorly the model
fits (i.e., lack of fit between observed and predicted values), an analogy can be made
to sum of squares residual in ordinary least squares. The proportion of unaccounted
for variance that is reduced by adding variables to the model is the same as the
proportion of variance accounted for, or R2.
2 LLnull  2 LLk
2
Rlogistic 
2 LLnull
SStotal  SSresidual SSregression
2
ROLS  
SStotal SStotal
Where the null model is the logistic model with just the constant and the k
model contains all the predictors in the model.
In SPSS, there are two modified versions of this basic idea, one developed by
Cox & Snell and the other developed by Nagelkerke. The Cox and Snell R-square is
computed as follows:
Cox & Snell Pseudo-R2

2/ n
 2 LLnull 
R  1 
2

 2 LLk 
Because this R-squared value cannot reach 1.0, Nagelkerke modified it. The
correction increases the Cox and Snell version to make 1.0 a possible value for R-
squared.
Nagelkerke Pseudo-R2
2/ n
 2 LLnull 
1 
 2 LLk 
R 
2
1  (2 LLnull ) 2/ n
Concordant, Discordant and Tied Pairs

Let us consider the following table.

Observation 1 2 3 4
Outcome Success Failure Failure Failure
P(Y=Success) 0.67 0.33 0.67 0.90
In this table, we are working with unique observations. The model was
developed for Y = Success. So it should show high probability for the observation
where the real outcome has been Success and a low probability for the observation
where the real outcome has been No.
Consider the observations 1 and 2

Here the real outcomes are Success and Failure respectively, and the
probability of the Success event is greater than the probability of the No event.
Such pairs of observations are called Concordant Pairs. This is in contrast to the
observations 1 and 4. Here we get the probability of the Failure is greater than the
probability of Yes. But the data was modelled for P(Y = Success). Such a pair is
called a Discordant Pair. Now consider the pair 1 and 3. The probability values
are equal here, although we have opposite outcomes. This type of pair is called a
Tied Pair. For a good model, we would expect the number of concordant pairs to be
fairly high.
Related Measures
Let nc, nd and t be the number of concordant pairs, discordant pairs and
unique observations in the dataset of N observations. Then (t - nc - nd ) is the
number of tied pairs.
c  (nc  0.5(t  nc  nd )) / t
Somers' D  (nc  nd ) / t
Goodman-Kruskal Gamma  (nc  nd ) / (nc  nd )
Kendall's Tau-a  (nc  nd ) / (0.5 N ( N  1))

In ideal case, all the yes events should have very high probability and the no
events with very low probability as shown in the left chart. But the reality is
somehow like the right chart. We have some yes events with very low probability
and some no events with very high probability.
Figure 46: Showing the concentration of values
The following table can help with the cut-off probability
Will the customer be a no-defaulter of loan?

Event
Yes is the event.
The opposite of Event. In previous example,

Non Event
No is the Non-Event
For a probability level, prediction is an

Correct Event event and observed outcome is also an
event.
For a probability level, prediction is a non-

Correct Non Event event and observed outcome is also a non-
event.
For a probability level, prediction is an

Incorrect Event
event but observed outcome is a non-event.
For a probability level, prediction is a non-

Incorrect Non Event
event but observed outcome is an event.

Percentage of correct predictions out of total

Correct
predictions
Measures the ability to predict an event

correctly, calculated as:
Sensitivity
(Correctly predicted as events / Total
number of observed events )* 100
Measures the ability to predict a non-event

correctly, calculated as:
Specificity
(Correctly predicted as non-events / Total
number of observed non-events )* 100
(Incorrectly predicted as event/Total

False Positive
prediction as Event) * 100
(Incorrectly predicted as non-event/Total

False Negative
prediction as Non-Event) * 100
Receiver Operating Characteristic Curves (ROC)

In statistics, a receiver operating characteristic (ROC), or ROC curve,
is a graphical plot that illustrates the performance of a binary classifier system as
its discrimination threshold is
varied. The curve is created by
plotting the true positive
rate against the false positive
rate at various threshold settings.
(The true-positive rate is also
known as sensitivity in biomedical
informatics, or recall in machine
learning. The false-positive rate is
also known as the fall-out and can
be calculated as 1 -specificity).
Figure 47: ROC Curve

Time Series CHAPTER
Analysis
11
1. Definition of Time Series: An ordered sequence of values of a variable
at equally spaced time intervals.
Applications: The usage of time series models is twofold:

● Obtain an understanding of the underlying forces and structure that
produced the observed data
● Fit a model and proceed to forecasting, monitoring or even feedback and feed
forward control.
Time Series Analysis is used for many applications such as:

● Economic Forecasting
● Sales Forecasting
● Budgetary Analysis
● Stock Market Analysis
● Yield Projections
● Process and Quality Control
● Inventory Studies
● Workload Projections
● Utility Studies
● Census Analysis
Figure 48: Trend Line
2. Components of a Time Series

There are four components to a time series: the trend, the cyclical variation,
the seasonal variation, and the irregular variation.

Chapter 11: TIME SERIES ANALYSIS | 151
Secular Trend
The trend is the long term pattern of a time series. A trend can be positive or
negative depending on whether the time series exhibits an increasing long term
pattern or a decreasing long term pattern. If a time series does not show an
increasing or decreasing pattern then the series is stationary in the mean. For
example, population increases over a period of time, price increases over a period of
years, production of goods of the country increases over a period of years. These are
the examples of upward trend. The sales of a commodity may decrease over a period
of time because of better products coming to the market. This is an example of
declining trend or downward trend.
Cyclical Variation
The second component of a time series is cyclical variation. A typical business
cycle consists of a period of prosperity followed by periods of recession, depression,
and then recovery with no fixed duration of the cycle. There are sizable fluctuations
unfolding over more than one year in time above and below the secular trend. In a
recession, for example, employment, production and many other business and
economic series are below the long-term trend lines. Conversely, in periods of
prosperity they are above their long-term trend lines.
Seasonal Variation
The third component of a time series is the seasonal component. Many sales,
production, and other series fluctuate with the seasons. The unit of time reported is
either quarterly or monthly.
Patterns of change in a time series within a year

These patterns tend to repeat themselves each year. Almost all businesses
tend to have recurring seasonal patterns. Men’s and boys’ clothing, for example,
have extremely high sales just prior to Christmas and relatively low sales just after
Christmas and during the summer. Toy sales is another example with an extreme
seasonal pattern. More than half of the business for the year is usually done in the
months of November and December. Many businesses try to even out the seasonal
effects by engaging in an offsetting seasonal business. At ski resorts throughout the
country, you will often find golf courses nearby. The owners of the lodges try to rent
to skiers in the winter and golfers in the summer. This is an effective method of

spreading their fixed costs over the entire year rather than a few months. Chart 16–
2 shows the quarterly sales, in millions of dollars, of Hercher Sporting Goods, Inc.
They are a sporting goods company that specializes in selling baseball and
softball equipment to high schools, colleges, and youth leagues. They also have
several retail outlets in some of the larger shopping malls. There is a distinct
seasonal pattern to their business. Most of their sales are in the first and second
quarters of the year, when schools and organizations are purchasing equipment for
the upcoming season. During the early summer, they keep busy by selling
replacement equipment. They do some business during the holidays (fourth
quarter). The late summer (third quarter) is their slow season.
Irregular Variation or Random Component

Many analysts prefer to subdivide the irregular variation into episodic and
residual variations. Episodic fluctuations are unpredictable, but they can be
identified. The initial impact on the economy of a major strike or a war can be
identified, but a strike or war cannot be predicted. After the episodic fluctuations
have been removed, the remaining variation is called the residual variation. The
residual fluctuations, often called chance fluctuations, are unpredictable, and they
cannot be identified. Of course, neither episodic nor residual variation can be
projected into the future.
Since economic cycles are very hard to predict, most time series pattern are
described in terms of trend and seasonality. The irregular or the random events can
be smoothed out by using Simple, Weighted, or Exponential Moving Averages.
Formula: Moving Averages

3. Exponential Moving Average of N Periods

For Exponential Moving Average, a small α indicates, we are giving less
emphasis on recent periods and more on the previous periods, as a result, we get a
slower moving average.
The EMA for a series Y may be calculated recursively:
S1  Y1
for St   .Yt  (1   ).St 1
Where:
● The coefficient α represents the degree of weighting decrease, a constant
smoothing factor between 0 and 1. A higher α discounts older observations
faster.
● Yt is the value at a time period t.
● St is the value of the EMA at any time period t.
Figure 49: Exponential Smoothing
4. How do we make predictions?

The overall idea is that we extract a trend part, adjust the trend for seasonal
component, and make forecast. Now there can be two variations:
Y = T + C + S + e, Or Y = T * C * S * e
Where T = Trend Component
C = Cyclical Component
S = Seasonal Component

and e is the random part

These two variations are respectively known as Additive and Multiplicative Models.
Various Trends
Figure 50: Constant Trend
Figure 51: Linear Trend
Figure 52: Quadratic Trend

You can use the following forecasting methods. For each of these methods,
you can specify linear, quadratic, or no trend.
The stepwise autoregressive method is used by default. This method
combines time trend regression with an autoregressive model and uses a stepwise
method to select the lags to use for the autoregressive process.
The exponential smoothing method produces a time trend forecast, but in
fitting the trend, the parameters are allowed to change gradually over time, and
earlier observations are given exponentially declining weights. Single, double, and
triple exponential smoothing are supported, depending on whether no trend, linear
trend, or quadratic trend is specified. Holt two-parameter linear exponential
smoothing is supported as a special case of the Holt-Winters method without
seasons.
The Winters method (also called Holt-Winters) combines a time trend with
multiplicative seasonal factors to account for regular seasonal fluctuations in a
series. Like the exponential smoothing method, the Winters method allows the
parameters to change gradually over time, with earlier observations given
exponentially declining weights. You can also specify the additive version of the
Winters method, which uses additive instead of multiplicative seasonal factors.
When seasonal factors are omitted, the Winters method reduces to the Holt two-
parameter version of double exponential smoothing.
5. Stochastic Processes
A random or stochastic process is a collection of random variables ordered in
time. An example of the continuous stochastic process is an electrocardiogram and
an example of the discrete stochastic process is GDP.
The dynamic phenomena that we observe in a time series can be grouped into
two classes:
The first are those that take stable values in time around a constant level,
without showing a long term increasing or decreasing trend. For example, yearly
rainfall in a region, average yearly temperatures or the proportion of births
corresponding to males. These processes are called stationary.
A second class of processes are the non-stationary processes, which are
those that can show trend, seasonality and other evolutionary effects over time. For
example, the yearly income of a country, company sales or energy demand is series
that evolve over time with more or less stable trends.

If the time series is non-stationary, then each set of time set data will have
its own characteristics. So we cannot generalize the behaviour of one set to other
sets.
Figure 53: Non- Stationary Process: Variance is Changing
Figure 54: Non-Stationary Process: Mean is Changing
6. Random Walk Model

In each time period, going from left to right, the value of the variable takes
an independent random step up or down, a so-called random walk. If up and down
movements are equally likely at each intersection, then every possible left-to-right
path through the grid is equally likely a priori. A commonly-used analogy is that of
a drunkard who staggers randomly to the left or right as he tries to go forward: the
path he traces will be a random walk.

A simple random walk model

A random walk is defined as a process where the current value of a variable
is composed of the past value plus an error term defined as a white noise (a normal
variable with zero mean and variance one). Algebraically a random walk is
represented as follows:
Xt = Xt-1 + et
The implication of a process of this
type is that the best prediction of y for
next period is the current value, or in
other words the process does not allow to
predict the change. . That is, the change
of y is absolutely random.
Figure 55: Random Walk
A random walk model with drift

A drift acts like a trend, and the process has the following form:
Xt = Xt-1 + et +a
The distinction between stationary and non-stationary stochastic processes
(or time series) has a crucial bearing on whether the trend (the slow long-run
evolution of the time series under consideration) observed in the constructed time
series is deterministic or stochastic. Broadly speaking, if the trend in a time series
is completely predictable and not variable, we call it a deterministic trend, whereas
if it is not predictable, we call it a stochastic trend. To make the definition more
formal, consider the following consider the following model
Xt=β1+ β2 t+β3 Xt-1+ ut
Case 1: Pure Random Walk
If β1=0, β2=0 and β3=1, we get, Xt= X(t-1)+ ut which is nothing but a RWM without
drift and is therefore non-stationary. Again, ΔXt= (Xt - X(t-1) )= ut which is
stationary. Hence, a RWM without drift is a difference stationary process.
Case 2: Random Walk with Drift
If β1≠0, β2=0 and β3=1, we get, Xt= β1+X(t-1)+ ut which is a RWM with drift and is
therefore non-stationary. Again, ΔXt= (Xt - X(t-1) )= β1+ ut which means that Xt
will exhibit a positive (β1>0)or negative (β1<0) trend. Such a trend is called a
Stochastic Trend. Again this is a difference stationary process as ΔXt is stationary.

Case 3: Deterministic Trend

If β1≠0, β2≠0 and β3=0, we get, Xt=β1+ β2 t+ ut which is called a Trend Stationary
Process. Although the mean of the process β1+ β2 t is not constant, its variance is.
Once the values of β1 and β2 are known, the mean can be forecasted perfectly.
Therefore, if we subtract the mean from Xt, the resulting series will be stationary,
hence the name trend stationary. This procedure of removing the (deterministic)
trend is called de-trending.
Case 4: Random Walk with Drift and Deterministic Trend
If β1≠0, β2≠0 and β3=1, we get, Xt=β1+ β2 t+Xt-1+ ut we have a random walk with
drift and a deterministic trend. ΔXt= (Xt - X(t-1) )=β1+ β2 t+ ut, which implies that
Xt is non–stationary.
Case 5: Deterministic Trend with Stationary Component
If β1≠0, β2≠0 and β3<1, we get, Xt=β1+ β2 t+β3 Xt-1+ ut which is stationary around
a deterministic trend.
How To test Stationarity

If the chart is showing an upward trend, it‘s suggesting that the mean of the
data is changing. This may suggest that the data is non – stationary. Such an
intuitive feel is the starting point of more formal tests of stationarity. The other
methods of checking stationarity is Autocorrelation Function or Correlogram and
Unit Root Test.
Autocorrelation Function or Correlogram

Autocorrelation refers to the
correlation of a time series with its own
past and future values. The first-order
autocorrelation coefficient is the simple
correlation coefficient of the first N – 1
observations y1, y2, y3, …, y(N-1)and the
next N – 1 observations, y2, y3, y4, …,
yN. Similarly, we can define higher order
autocorrelation coefficients. So for
different order or lag, we will get
different autocorrelation coefficients. As a
result, we can define the autocorrelation
Figure 56: Correlogram

coefficients as a function of lag. This function is known as Autocorrelation Function.

The graphical presentation of ACF is known as Correlogram. A rule of thumb is to
compute ACF up to one-third to one-quarter the length of the time series. The
statistical significance of any autocorrelation coefficient can be judged by its
standard error. Bartlett has shown that if a time series is purely random, that is, it
exhibits white noise, the sample autocorrelation coefficients follows a normal
distribution with mean = 0 and variance = 1/ Sample Size.
The Unit Root Test

As we learnt from the unit root stochastic process
Xt= ρX(t-1)+ut, where -1≤ρ≤1
Also we learnt that in the case of unit root, ρ=1.
For theoretical reasons, we convert the equation as follows:
Xt - X(t-1)=(ρ-1) X(t-1)+ ut
Or, ΔXt=δX(t-1)+ ut
Where δ=(ρ-1) and Δ is the first difference operator.
Testing
Null hypothesis is that δ=0, that is, ρ=1 which means non-stationarity in time-
series data
But the t value of the estimated coefficient of Xt−1 does not follow the t distribution
even in large samples.
7. Dickey–Fuller Test
The Dickey–Fuller Test tests whether a unit root is present in
an autoregressive model. It is named after the statisticians David Dickey
and Wayne Fuller, who developed the test in 1979. Dickey and Fuller have shown
that, under the null hypothesis, the estimated t value of the coefficient of Xt−1
follows the τ (Tau) Statistic. This test is known as Dickey – Fuller (DF) Test. In
conducting DF test, we assumed that the error terms ut are uncorrelated. But in
case the ut are correlated, Dickey and Fuller have developed a test, known as the
Augmented Dickey–Fuller (ADF) test.

8. Modelling Time Series:

If a time series is stationary we can model it in variety of ways:
Autoregressive Process (AR):

A stochastic process used in statistical calculations in which future values are
estimated based on a weighted sum of past values. An autoregressive process
operates under the premise that past values have an effect on current values. A
process considered AR(1) is the first order process, meaning that the current value
is based on the immediately preceding value. An AR(2) process has the current
value based on the previous two values.
The notation AR(p) indicates an autoregressive model of order p. The AR(p)
model is defined as
p
X t  c   i X t i   t
i 1
where 1 ,.....,  p are the parameters of the model, c is a constant, and  t is white
noise.
Moving Average (MA) Process

In time series analysis, the moving-average (MA) model is a common
approach for modelling univariate time series. The notation MA(q) refers to the
moving average model of order q:
X t     t  1 t 1  ...  q t q

where μ is the mean of the series, the θ1, ..., θq are the parameters of the model and
the εt, εt−1,..., εt−q are white noise error terms. The value of q is called the order of
the MA model.
Autoregressive Integrated Moving Average (ARIMA) Process

In statistics and econometrics, and in particular in time series analysis,
an autoregressive integrated moving average (ARIMA) model is a
generalization of an autoregressive moving average (ARMA) model. These models
are fitted to time series data either to better understand the data or to predict
future points in the series (forecasting). They are applied in some cases where data

show evidence of non-stationarity, where an initial differencing step (corresponding

to the "integrated" part of the model) can be applied to reduce the non-stationarity.
So, given a time series, we first have to difference it d and then apply an
ARMA(p, q) to mod-el it. Then we say the original time series is ARIMA(p, d, q).
Thus, an ARIMA(2, 1, 2) time series has to be differenced once(d=1)before it
becomes stationary and the (first-differenced) stationary time series can be
modelled as an ARMA(2, 2) process, that is, it has two AR and two MA terms. Of
course, if d=0 (i.e., a series is stationary to begin with), ARIMA(p, d=0,q)= ARMA(p,
q). Note that an ARIMA(p, 0, 0) process means a purely AR(p)stationary process; an
ARIMA(0, 0,q) means a purely MA(q) stationary process. Given the values of p, d,
and q, one can tell what process is being modelled.
9. The Box–Jenkins Methodology

The objective of B–J [Box–Jenkins] is to identify and estimate a statistical
model which can be interpreted as having generated the sample data. If this
estimated model is then to be used for forecasting we must assume that the features
of this model are constant through time, and particularly over future time periods.
Thus we must have either a stationary time series or a time series that is stationary
after one or more differencing.
The method consists of four stages:
Identification: Find out the appropriate values of p, d, and q
Estimation: Having identified the appropriate p and q values, the next stage is to
estimate the parameters of the autoregressive and moving average terms included
in the model
Diagnostic Checking: Having chosen a particular ARIMA model, and having
estimated its parameters, we next see whether the chosen model fits the data
reasonably well.
Forecasting
Identification Stage
In third stage, the researcher virtually examines, the time plot of the series;
the autocorrelation function and the partial autocorrelation function. Plotting the
time path of the {yt} sequence provides useful information concerning outliers,
missing values and structural breaks in the data. Non-stationary data have a

pronounced trend; or they appear to meander without a constant mean or variance.

Missing values and outliers can be corrected at this point. There are many formal
checks to identify whether the model in hand is stationary or not. Let us analyse
some of these checks. The most widely used checks are:
● Dickey -Fuller tests
● Augmented Dickey Fuller tests
● Monte Carlo Experiments
● Tests of trends and Structural Breaks
These checks are discussed as follows:

● Dickey Fuller tests:
Let us consider the model : Yt= Yt-1 + t where t is white noise. In the case
of Random walk =1, OLS estimation of this equation produces an estimate of ;
that is biased toward zero. To test stationarity we use the Dickey-Fuller test
statistic.This is defined as:
K(I)= T( hat - 1) ; t(1)= ( hat - 1) / S.E. ( hat )
These statistic do not have the standard normal t- and F- distributions. The
critical values of these statistics are tabulated in Dickey and Fuller test tables. The
relevant time-series is non-stationary under H0, and therefore, the standard t-test
would not be applicable in this situation. One has to apply Dickey-Fuller (DF) test
in this context, provided the alternative hypothesis suggests
H0: ρ = 1;
H1: ρ=0, i.e., stationary and non-autocorrelated error.
● Augmented Dickey Fuller tests:
The augmented Dickey-Fuller test is one that tests for a unit root in a time
series sample. The test is used in statistical research and econometrics, or the
application of mathematics, statistics, and computer science to economic data. The
primary differentiator between the two tests is that the ADF is utilized for a larger
and more complicated set of time series models. The augmented Dickey-Fuller
statistic used in the ADF test is a negative number, and the more negative it is, the
stronger the rejection of the hypothesis that there is a unit root. Of course, this is
only at some level of confidence. That is to say that if the ADF test statistic is
positive, one can automatically decide not to reject the null hypothesis of unit root.
In one example, with three lags, a value of -3.17 constituted rejection at the p-value
of .10.
For the testing of stationarity

H0: ρ = 1;
H1: |ρ| <1, i.e., stationary but autocorrelated error one has to apply augmented
Dickey Fuller (ADF) test.
● Tests for structural breaks:
In performing unit root tests, special care must taken to check whether a
structural change has occurred. When there are structural breaks, the various
Dickey- Fuller tests statistics are biased towards non- rejection of a unit root. A
structural break can be shown graphically:
Figure 57: Structural Break
The large simulated break is useful for iterating the problem of using a
Dickey Fuller test. The straight line shown in the figure highlights the fact that the
series appears to have a deterministic trend. In fact, the straight line is the best-
fitting OLS equation:
Yt = 0 + 1 Yt + t
However, if we estimate the equation :

Yt = 0 + 1 Yt-1 + t .
The estimated value of 0 is necessarily biased towards unity. Upward bias
is because the estimated value of 0 captures property that “ low values” of Yt (i.e.,
those fluctuating around zero) are followed by other “low” values and “high “ values
are followed by other “ high values”. It can be observed that as 0 approaches unity
the time series approaches the random walk with a drift. The solution to the
random walk with a drift term, involves a trend line, i.e.,
Yt = 0 + 0 t+ t

So, the misspecified equation will tend to mimic the trend line, biasing 1
towards unity. The bias in 1 means that the Dickey Fuller is biased towards
accepting the null hypothesis of a unit root even though the series is stationary
within each of the short intervals. Here we need to develop a formal procedure to
test for unit roots in the presence of a structural change at time period t = .
Consider the Null hypothesis of a one- time jump in the level of a unit root process
against the alternative of a one- time change in the intercept of a trend stationary
process. Formally, the null hypothesis will be:
H 0: t = 0+ t-1+ 1 Dp+ t
H 1: t = 0+ 1t + 2 D L+ t
where, Dp represents a pulse dummy variable such that Dp=1 if t = and

otherwise zero, and DL represents a level dummy variable such that DL = 1 if
T> and zero otherwise. The alternative hypothesis posits that the {Yt} sequence is
stationary around the broken trend line. Upto t= , {Yt} is stationary around 0+ 1t,
and beginning at . Yt is stationary around 0+ 1t + 2. As illustrated by the
broken line, there is a one-time increase in the intercept of the trend if 2>0. The
econometric problem is to determine whether an observed series is explained by A
and B.
Estimation: (Model Selection)

One Natural question to ask of any estimated model is:How well does it fit
the data? Adding additional lag for p and q will necessarily reduce the sum of
squares of the estimated residuals. However; adding such lags entails the
estimation of additional coefficients and an associated loss of degree of freedom.
Moreover inclusion of extraneous coefficient will reduce the forecasting performance
of the fitted model. There exists various model selection criterion that trade off at
reduction in the sum of squares of the residuals for a more parsimonious model.
The two most commonly used model selection criterion are-
 Akaike-Information Criteria
 Schwartz Bayesian Criteria
Parsimony is a fundamental idea in the Box-Jenkins approach. Incorporating
greater number of variables will necessarily increase the fit of the model, at the cost
of reducing the degrees of freedom. A parsimonious model fits the data well without
incorporating any needless coefficients. The aim of time-series modelling is to
approximate the true data generating process.
 Akaike-Information Criteria - This criteria is defined as

AIC= T In (Sum of squared residuals) + 2n

 Schwartz Bayesian Criteria - This criteria is defined as
SBC = T In (Sum of squared residuals) + nIn(T)
where,
n= number of parameters estimated (p+q+possible constant term)
T = number of usable observations.
Ideally AIC and SBC will be as small as possible, and while comparing the
models we must keep the number of observations fixed for both. So, a model A will
be a better fit that B, if AIC (or SBC) for model A is less than AIC (or SBC) for
model B, given the number of observation. If an added regressor is a regressor
having no explanatory power,then adding it to the model will cause both AIC and
SBC to rise. In such a case, increasing the number of regressors increases ‘n’ but
does not have the effect of reducing the sum squared residuals.
NOTE: Researchers should be aware that if AIC and SBC select the
same model, they can be confident of their results. However, it is a
matter of caution if these measures select two different models.
The SBC has superior large sample properties. Let (p*, q*) be the true order of
data generating process. Suppose we use AIC and SBC to estimate all ARMA
models of order (p,q) where p p* and q q*. Both AIC and SBC will select models of
orders greater than or equal to (p*, q*) and the sample size approaches infinity. The
SBC is asymptotically consistent while the AIC is biased towards selecting an over-
parameterized model. In small samples AIC can work better than SBC.
Diagnostic Checking and Goodness-Of-Fit

The major goodness-of-fit measures like R2 an average of the residual sum of
squares, but these measures cannot be applied here since these measures increase
with greater number of parameters included in the model. But, time series models
require Parsimony, and hence AIC or SBC are suggested as the Goodness-of-fit
measures. Here, it must be kept in mind that estimates should converge rapidly. If
the estimates fail to converge, the estimated parameters are unstable. Adding
additional observations or two can greatly alter the estimates. Thus, here it is
important to execute a diagnostic checking of the parameter estimates. The
following things are to be checked in the third stage of the Box-Jenkins Approach:-

 To check for outliers and evidence in which periods the data does
not fit well
The standard practice here is to plot residuals to look for outliers and for
evidence periods in which the model does not fit the data well. If all possible ARMA
models show evidence of a poor fit during a reasoning long portion of the sample, it
is wise to consider alternate methods such as:
1. Intervention Analysis
2. Transfer function Analysis
3. Multivariate Estimation Technique
 To check for serial correlation of residuals

Any evidence of serial correlation implies a systematic movement in the {Yt}
sequence that is not accounted for by the ARMA process. So, any of the tentative
models yielding non-random residuals should be eliminated from the consideration.
To check for correlation in residuals we construct the ACF and PACF, for residuals
of the estimated model. We can use the following two techniques to test whether all
autocorrelations are significant or not.
1. Sampling variance of the autocorrelated function
2. Box pierce Q-statistics
Forecasting
The most important use of an ARMA model is to forecast future values of the
{Yt} sequence . Assuming that the true data generating process and the current and
past realisations of { t} and {Yt} are known. The forecasts of the AR(1) model takes
the form :
Yt-1 = a0 + a1 Yt + t+1
Given the coefficients a0 and a1 we can forecast Yt-1 conditional on the time period ‘t’
as :
EtYt+1 = a0 + a1Yt where EtYt+j is a symbol to represent the conditional
expectation of Yt+j given the information available at time t. So,
EtYt+j = E(Yt+j/ Yt , Yt-1, Yt-2 ,....., t , t-1 ,....)
Generalising the expression for the forecasting of the time series values:
EtYt+j = a0 (1+ a1 + a12 + a13 +....+ a1j-1) + a1j Yt

The above equation is called the forecast function. So, if the time series
converges, i.e., |a1| <1, EtYt+j a0/1-a1. For any stationary model, ARMA model; the
conditional expectation forecast of Yt+j converges to the unconditional mean.
The properties of the forecast is such that it will never be accurate. So, every
forecast should be having n error. Let us analyze the properties of the errors:-
● Et(et(j))= 0
● Var (et(j)) = 2 (1/ 1+a12). The variance of the forecast error is increasing in j.
So, we can have more confident on the short-run forecasts, than in the long-
run. As j , the forecast error variance converges to 2 (1/ 1+a12).
The { t} sequence is normally distributed, and hence, we can place confidence
intervals around the forecasts. The one-step- ahead forecast of Yt-1 is a0 + a1Yt and
the forecast error is 2. As such 95% confidence interval of the one-step ahead
forecast can be constructed as:-
a0 + a1Yt 1.96
For generalised analysis to a higher order model-
We assume:
● All coefficients are known
● All variables are subscripted t,t-1,t-2… are known at period ‘t’
● EtYt+j = 0, for j>0, the conditional expectation Yt-1 is: EtYt+1 = a0 + a1Yt +a2Yt-
1+ 1 t
The one-step ahead forecast error is the difference between Yt and EtYt+1,
i.e., between the actual and expected values.
It is too difficult to construct confidence interval for the estimated parameters.
Forecast Evaluation
A major error that is made is in thinking that the one with the best fit is the
one that will forecast the best. Let consider the example of an ARMA(2,1) process.
The one-step ahead error of the forecast is:
e1(1)= Yt-1 - a0 - a1Yt - a2Yt-1 - 1 t = t+1
Rule of Thumb: ARIMA model should never be trusted if the model is estimated
with fewer than 50 observations
Hence, no other ARMA model can provide superior forecasting performance.

There are some tests that needs to be carried out to check the quality of the models,
whose forecasts are to be compared. They are:

Mean-square Prediction Error Test (MSPE)

Folded F-Test is used to identify whether two MSPE are identical or not. The
test value should be unity if the forecast errors from the two models are identical. A
very large value of F will equal unity if the forecast errors from the first model are
substantially larger than those from the second under the null hypothesis of equal
forecasting performance. It has a standard F distribution if it satisfies 3
assumptions:
1. The forecast errors have zero mean and are normally distributed
2. The forecast errors are serially uncorrelated
3. The forecast errors are contemporaneously uncorrelated with each other.
A major problem in forecasting is { t} sequence is normally distributed, does
not imply that the forecasting errors are normally distributed with a mean value of
0. Similarly the forecast may be serially correlated and it is true if we are using a
multi step forecast. Forecast errors from the two step model can also be correlated
with each other and in such a situation the ration of MSPEs does not have a F
distribution. Two major tests are suggested to overcome this problem of
contemporaneously correlated error terms:
1. Granger- New Bold Test
2. Diebold- Mariano Test

DexLab Analytics Business Analytics - Data Science - Study Material PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DexLab Analytics Business Analytics - Data Science - Study Material PDF

Caricato da

Copyright:

Formati disponibili

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

This book is designed to provide information on

Copyright© 2017-2018, DexLab Solutions Corp

All Rights Reserved.

5. Association between Variables 99

6. Concept of ANOVA 109

7. Factor Analysis 115

8. Cluster Analysis 124

9. Linear Regression 129

10. Logistic Regression 139

11. Time Series Analysis 150

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

2. Definition of Business Analytics

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

This company had an edge, however – a longstanding outsourcing

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Descriptive analytics mines data to provide business insights.

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Example: Why is it better to charge an electric vehicle overnight and not

Example: Why do airline prices change every hour?

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Example: Why does Facebook often find your acquaintance as potential

Use Descriptive statistics when you need to understand at an aggregate level

Predictive: Analytics, which use statistical models and forecasts techniques

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Prescriptive: Analytics, which use optimization and simulation algorithms

Figure 1: Different stages Of Analytics

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

4. Confirmatory vs Exploratory Data Analysis

Figure 2: Deductive and Inductive approach

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Ex: Nominal Scale Of Measurement

Figure 3: Nominal Scale of Measurement

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Figure 4: Ordinal Scale of Measurement

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Table 1: Different Scales of Measurement

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

● A classic example of discrete data is a finite subset of the counting numbers,

● Another classic is the spin or electric charge of a single electron. Quantum

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Continuous data represents infinite values (real numbers) in a given interval, so

The real numbers are continuous with no gaps or interruptions. Physically

Graphical Representation of Quantitative Data

Figure 7: Simple Bar diagram Figure 8: Multiple Bar Diagram

Graphical Representation of Quantitative Data

Figure 9: Pie Chart Figure 10: Histogram

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

8. Measures of Central Tendency

This formula is usually written in a slightly different manner using the

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Items 0-10 10-20 20-30 30-40

Based on the above mentioned formula, Arithmetic Mean will be:

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

Geometric mean formula for two numbers is

Harmonic mean is quotient of “number of the given values”

Example: To find the Harmonic Mean of 1,2,3,4,5.

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

In case of data with frequency,

When not to use the mean?

Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved

We again rearrange that data into order of magnitude (smallest first):