Sei sulla pagina 1di 299

Data science camp

Introduction to the core stats class


A.k.a. Why are you here?
Day 1

Please work in pairs today (with your neighbor)


Each pair, please do the following:
1. download the SPSS file from blackboard and save it on your
laptop
2. do NOT yet open it in SPSS
3. please do START SPSS (just starting it, nothing else for now)

Fall 2018, Peter Ebbes

Making better decisions

“Most [[people]] are poor quantitative thinkers. This


widespread innumeracy is the father of zillions of bad
decisions…Numbers convey information,
quantitative information. Decisions are based on
information. When people are innumerate—when
they do not know how to make good use of available
quantitative information—they make uninformed
decisions.”

What the Numbers Say: A Field Guide to Mastering


Our Numerical World

1
Statistics is sexy?

“I keep saying that the sexy job in the next 10


years will be statisticians”

Hal Varian, chief economist @Google

The Skills Companies Need Most in 2018

2
Making better decisions!

Historical data
Demand Time Series

65
60
55
50
Decision
Historical Demand

45
40
Probability based on
(units)

35

model probability
30
25
20
15
10
model
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period

Data science camp Math camp

Quiz master problem: probability to win a car – decision change door


Defective computer chips: P(defective) – quality control
Stockouts: normal probability distribution – optimal capacity?

Getting to know your instructor

• Name: Peter Ebbes


o Nationality: Dutch
• Education:
o MA in marketing and statistics (econometrics)
o PhD in economics
• Current position:
o Associate prof. of marketing
o Prior at OSU, PSU, and Michigan
• Prior work/consulting experience: e.g. Research
International, Intel, USATODAY, Philips, LinkedIn,
Globys, Amplero, Soundcloud
6

3
Current/previous research (snapshot)
• Academic/consulting projects:
o LinkedIn – investigating job seeker status
o Facebook – sampling social networks
o Statistical approaches to improve data quality
• In my research, I typically target the “quantitative” (=geek-
type) academic journals; e.g.
o Product line optimization; lead article with discussions; best paper
award (international journal of research in marketing 2011)
o Customer (value) analysis in heterogeneous markets
(Psychometrika 2012, Management Science 2015)
o Firm performance and the role of marketing (Journal of Marketing
2015, HBR 2015)
o Social effects in CRM campaigns (Journal of Marketing Research
2017)

Course objectives

Provide an introduction to the core statistics class

Basic variable types

How to summarize data numerically and graphically

DIY – brief intro to SPSS

4
Course organization

• Blackboard

• Course syllabus

• SPSS – download it, then get license for free in


office S111

Class material on Blackboard

• Before class:
– preparation guide (readings, case, class
discussion questions and dataset)

• After class:
– pdf of the lecture notes (WYSIWYG)

• After one or a few classes:


– ‘How to’ SPSS guide

10

5
What you can expect from us

• Classes will start and end on time

• Timely evaluations with constructive feedback

• Emails will be returned promptly

• Easy access office hours by appointment

11

What I expect from you

• Be on time – the doors are in the front


• Once class is in session, do not leave (only if absolutely
necessary) – the doors are in the front
• If for some reason you must be late for class or leave
early, let me know in advance
• Submit assignments on time
• No multi-tasking in class (phones, laptops, etc.)
• Ask questions if something is not clear
• Practice, practice and practice

12

6
Mathcamp quiz: (partial) results

• Idea: what if you have to make a decision, and


you have no data? Maybe you have no time to
collect it, or you feel you do not need it.

• You therefore need to rely on your judgment.

• How good are you at evaluating


quantities judgmentally?

13

Today’s lecture
Moral of quiz: be aware of your ability to evaluate numbers
judgmentally! You may want to get some data for decision making…

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

14

7
Example: airline demand over time

Demand Time Series

65
60
55
50
Historical Demand

45
units

40
(units)

35
in 10s

30
25
20
15
10
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period

What factors may affect demand?

15

Logic of probability models

• More often than not the quantities we are interested in will not
be predictable but will exhibit an inherent variation

• It would generally be impossible to measure all the variables


that determine the phenomenon of interest in any setting

• Idea: a realistic model must take into account the possibility of


randomness

• Construct probability models so that they represent the actual


data generating process that lies behind the data. They help
us make better decisions.

16

8
Learning from past observations

Demand Time Series

65
60
55
50
Historical Demand

45
units

40
(units)

35
in 10s

30
25
20
15
10
5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Period

What is the probability that demand is less than 400


passengers tomorrow?

17

Plot the data: empirical distribution


Using a number of past periods, construct a frequency histogram;
these frequencies represent the probability distribution.

Demand Histogram
7% 100%
Frequency
Cumulative Frequency 90%
6%
 = 28.1 80%
Frequency of Occurrence

5%
 = 10.9
70%
Cumulative Frequency

60%
4%
50%
3%
40%

2% 30%

20%
1%
10%

0% 0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
(in 10s units)
Period Demand (units)

18

9
Approximate the data with a probability function
Use a well-defined probability function whose shape resembles
the frequency histogram to probabilistically represent demand.

Demand Histogram
7% 100%
Frequency
Cumulative Frequency 90%
6%
 = 28.1 80%
Frequency of Occurrence

5%
 = 10.9
70%

Cumulative Frequency
60%
4%
50%
3%
40%

2% 30%

20%
1%
10%

0% 0%
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
(in 10s units)
Period Demand (units)

19

Steps in using probability functions

• What type of data are you dealing with: continuous


(quantitative) or discrete (categorical)?

• Based on the observed data (e.g. histogram),


choose an appropriate probability distribution

• Estimate the parameters of the chosen probability


distribution

• Make sure it fits!

• What we need: data, knowledge of statistics, and a


smart computer program
20

10
Organizing data

Data are often organized into a data table

“Cases”
“Records” Variables
“Observations”

21

Linking data with a relational database

Big data!

22

11
Variable types: measurement levels

Categorical Quantitative
No natural numerical meaning Natural numerical meaning

May appear in a data table as a number Already a number

Arithmetic makes no sense Some arithmetic makes sense

Has an appropriate unit

customer ID, gender, ZIP code, brand, Sales, price, stock returns, interest
hair color, class grades (A,B,C), … rate, …

Note: income in categories Note: income in exact amount

Special one: attitude rating scales

In SPSS: nominal or ordinal In SPSS: scale


23

Today’s lecture

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

24

12
Mini case: American Express

American Express managers felt that usage was


slowing, particular in the retail category, typically
largest category

Before making drastic marketing spending, data was


bought to investigate consumer usage of credit cards
over a two year period, relative to the competition

25

Mini case: American Express

Basic business questions (not an exhaustive list):

– What is American express’ market share?


– Are more men or women using credit cards?
– In which category is spent the most?
– Over the course of the two years, are there any trends
in spending? Market share?
–…

26

13
Working with data

You need data analysis software

– SPSS: windows base / point and click


with option to write your own code

– But many others; SAS, stata, eviews,


minitab, excel, …

– If you know how to write your own code,


you have other options such as Matlab,
R, Python, C++, …

“IBM bets on mergers and algorithms for growth” FT (2016)


27

How do you get your data in SPSS?


1. If paper and pencil (hard-copy files), manual data
entry in SPSS.
– Directly into SPSS

– Check preparation guide (two YouTube videos)

2. If electronic, it depends:
– SPSS can import Excel files, text files etc.

– Most online survey services (e.g. Qualtrics) allow you to


export to SPSS directly

– Etc.

28

14
SPSS: two ‘views’

• Data View and Variable View


– To switch in between the two views, click on the
tabs at the bottom of the screen

• Data view: allows you to manually enter or


modify your data

• Variable view: allows you to describe your


“variables” more fully than variable name allows

29

SPSS data view

COLUMNS are variables

ROWS are cases

30

15
SPSS variable view

COLUMNS are
“properties” of your
variables (slides 35—37)

In ‘Variable View’, the ROWS are “variables”

31

Your very first SPSS exercise!

Open in SPSS the data file for the American


Express case

session1&2_credit_card_web.sav
data_science_day_1_credit_card_web.sav

Examine data view and variable view

32

16
33

34

17
‘Variable View’ fields
• Name: short name for the variable (no spaces; no
special characters)
– Try to give it a clever, brief name referring to what the variables
are about, without using special characters or being lengthy:
• Good variable name: “ItemsGrocery” or “items_grocery”
• Bad variable name: “# of grocery items sold?”
– Good practice: always keep a record ID variable (eg ‘obs’)

• Type: type of data for that variable


– If you want to change this, click in the box and then click on
small box with three dots; most relevant are
• Numeric: numbers
• String: words (e.g. text data, open-ended survey questions)

35

‘Variable View’ fields, cont.


• Width: maximum number of characters that can be
entered for that variable
– For numeric; e.g. 8 (default) is often fine (--> 99999999)

– For string; allow for enough to enter all text

• Decimals: number of decimal places for that variable

• Label: longer description of the variable


– Be more specific and elaborate here, special characters okay

[[ For instance, if you work with survey data, you may want to
include wording from the actual question in the survey ]]

– When you create a SPSS table or chart, this “label” will be the
title of the chart or table

36

18
‘Variable View’ fields, cont.
• Values: enter in the words associated with your number
codes for categorical variable
– Only needed when your variable has categories, and the
categories are described by words

– When you analyze the data, these words will appear in the
tables / charts rather than the numeric codes

• Measure: this is the measurement level (nature of the


numbers) of the variable [[ see slide 23 ]]
1. Nominal: numbers are category labels

2. Ordinal: numbers are category labels that reflect order

3. “Scale”: numbers are numeric, quantitative

• Other fields (eg missing, align): don’t worry about them

37

Today’s lecture

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

38

19
Descriptive statistics: describing data

• The start of a (statistical) project: before you do


anything fancy-pancy

• Descriptive statistics – make data usable

– The raw data on slide 33 (data view) is NOT usable


for decision making… at all!

– Uses numerical and graphical methods to summarize


the information in a dataset in a convenient form

39

Descriptive statistics in SPSS


(Also a long list under graphs)

A (very) long list!

It is extremely important to use the correct statistical


technique to obtain meaningful insights for decisions
40

20
How to choose the “correct” statistical
descriptive technique (1)
Generally there is not “one” correct way. Your choice
depends on:

– The purpose of the analyses


• What do we need to know for decision making?

• How will the results be used?

• Who will use the information?

– Statistical mechanics (next slide)

41

How to choose the “correct” statistical


descriptive technique (2)
Statistical mechanics – your choice depends on:
– The scale level of your variables (nominal / ordinal
(categorical) or quantitative; see also slide 23
– How many variables you analyze jointly
• Just one variable at a time (univariate statistical
analyses) – fairly easy; first thing you have to do

• Two variables at a time (bivariate statistical analyses) –


little harder

• More than two variables at a time (multivariate


statistical analyses) – pretty tough!

42

21
Today’s lecture

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

43

Methods to examine categorical data

• Frequency tables

• Pie charts

• Bar/column charts

• Contingency tables (cross-tabs)

• Clustered and segmented bar charts

44

22
Frequency table variable ‘card’

In SPSS: Analyze – Descriptive Statistics – Frequencies

Who can interpret this table?

45

Another example: date primary card issued


(~tenure of the customer)

• Research question: the American Express manager


wondered whether perhaps other brands are
attracting new users

• We could indirectly examine this through the


variable ‘card_date_yr_rec’ ‘card_date_yr’

• This variable captures the tenure of the customer


(i.e. when did (s)he sign up for the card)

• This is a categorical variable – frequency table!

46

23
Another example: date primary card issued
(~tenure of the customer)
In SPSS: Analyze – Descriptive Statistics – Frequencies

What does this table tell us?


47

Another example: date primary card issued


(~tenure of the customer)
• Previous table not useful given research question!

• We need to break it out by brand; i.e. generate a frequency


table of the variable ‘card_date_yr_rec’ for each brand

• Simple approach: create frequency table for only a subset of


the observations, first for American Express users, then for
visa users, etc.

• Lets start with American Express users

• Easy with SPSS! Very useful option!


– ‘Data – Select cases’
– See also Appendix to these lecture notes See ‘how to guide’

48

24
Another example: date primary card issued
(~tenure of the customer)

(only those who have American Express as primary brand)

• Users with an American Express card at time of data


collection, when did they get it?

• Cumulative percent: add up subsequent Valid Percent’s

49

In-class exercise 1

Repeat the same analysis for Discover users


[[ in the interest of time, we will do this only for Discover users ]]

Compare the two frequency tables

What do you conclude?

50

25
In-class exercise 1 (discussion)

(only those who have American Express as primary brand) Discover


users

Users with an American Express vs Discover card at time


of data collection, when did they get it?
51

Graphical representation of frequency tables

It is often easier to look at bar or pie charts than at a


frequency table

Useful techniques to graphically display frequency tables:


bar chart or pie chart

In SPSS: Graphs – Legacy dialogs – Bar (or Pie)

Lets make a chart for the variable ‘card’

52

26
Graphical representation of frequency tables

(compare with frequency table on slide 45)


Of these two graphs, which graph is easier to extract
information from?
53

Caveats about bar and pie charts

These figures are only appropriate if observations fall


into only one of the categories

– Mutually exclusive (disjoint) events [[ math camp ]]

– Pie and bar charts should add to 100%

– These visual representations focus on a single


categorical variable; can be generalized to analyze
combinations of categorical variables

54

27
Revisit the brand-tenure analysis (slide 48)

We looked at two variables “card” and “card_date_yr”


one-at-a-time (univariate statistics)

55

Examining relationships among


two categorical variables

• Contingency tables (cross-tabs) let us examine


patterns among multiple categorical variables

• Cross-tabs can be seen as a “merger” of two


frequency tables

• Popular technique in applied research


– Easy to create and easy to understand
– Typically strong link with managerial insight and/or action

56

28
Cross tab: credit card and date issued
In SPSS: Analyze – Descriptive Statistics – Crosstabs

Cross tabs generate…. lots of numbers! Interpretation?


@Home: check how this cross table relates to frequency
tables on slide 51
57

Facilitation interpretations in cross tabs:


percentages

Column percentage – this cross tab can show us what


the ‘card issue shares’ were in each time period
(conditional probabilities)
Conclusions?
58

29
In class exercise 2

• Research question: what credit card brands are


used by men and women?
– HINT – there are two variables underlying this
research question: ‘gender’ and ‘card’
– Both variables are categorical. We can create a cross
tab to investigate this research question.

• Research question: how does the tenure of men


compare to the tenure of women with respect to
when they got their primary card?
– HINT: what are the variables in this question?

59

Graphical representation of cross tables

• As before, it is often easier to look charts than


tables!

• Useful techniques to graphically display of cross


tabs: clustered bar chart or segmented bar chart

• In SPSS: this is a bit of a challenge!

• @home exercise: study the ‘How to guide’ (will be


on Blackboard) and try to reproduce the graphs on
the next two slides

60

30
Graphically displaying cross tabs

Clustered bar chart (compare to table on slide 58)


61

Graphically displaying cross tabs

Segmented bar
chart

If the conditional distribution is the same across different


categories, the two variables are said to be independent
62

31
Another useful option in SPSS:
recode a categorical variable
• One data transformation that is used quite often is to
recode a categorical (nominal or ordinal) variable
– To collapse categories of a categorical variable in fewer
categories (e.g. some categories are thinly populated, or
for presentation sake)

– But also to make a quantitative variable categorical (e.g.


we want to compare satisfaction levels of those that make
more than $100K vs those that make less than $100K
where income was measured in exact dollars)

• This can be easily done in SPSS:


– Transform – Recode into Different Variables
63

Example: recode the variable ‘card_date_yr’ into a new


one that has only two categories (‘before 2001’ and
‘2001 and after’)
1. Transform – Recode into Different Variables
2. Select the variable ‘card_date_yr’ into the box ‘Numerical
Variable  Output Variable’
3. Type a new variable name in the box ‘Name’ under ‘Output
variable’ (e.g. ‘card_date_yr_rec’) [[ Note: stick to naming
conventions discussed before]], then click ‘Change’
4. Click ‘Old and New Values’
• Tell SPSS how the ‘Old Value’ maps onto the ‘New Value’
• Hint: write this on a piece of paper first
• Fill in those numbers

5. Click ‘OK’
64

32
Recode the variable ‘card_date_yr’ (cont.)
• SPSS added your new variable to the back of your data file
(Data View); at the bottom of the variable list (Variable
View)
• You thought you were done, but you are not!
– Update all the fields under ‘Variable View’ for this new
variable
• Enter a description under ‘Label’ (e.g. ‘Date primary card issued
before 2001 or 2001 or after’)

• Complete ‘Values’ (e.g.: 1=before 2001, 2 = 2001 or after)

• Update the ‘Measure’ (here nominal or ordinal are fine)

– Visually inspect for a few observations that the recoding


worked out well (in Data View) – this is important!

65

Recap of techniques for categorical data

• Frequency tables, bar chart, pie chart


– Useful for looking at the distribution of a single
categorical variable

• Contingency tables, clustered and segmented bar


charts
– Useful for examining potential relationships among
two categorical variables

• Present your results in a way that is consistent with


what you need to know

66

33
Today’s lecture

Part 1: data and probability models

Part 2: Intro SPSS

Part 3: Descriptive statistics

Part 4: Descriptive statistics – categorical variable

Part 5: Descriptive statistics – quantitative variable

67

Analyzing quantitative variables

• Numerical methods
– Central tendency measures
– Dispersion measures
– Correlation (core stats class session 5)
• Visual methods
– Histograms
– Box plots
– Time series plots
– Scatterplots (core stats class session 5)

68

34
APPENDIX

Contents ---

Practice measurement scales (slide 23)

69

Practice measurement scales (slide 23)


What is the scale level / measurement level of the following
variables (answers on next slide)
1. The number of MBA students in a given year at HEC Paris
2. The age of an MBA participant, measured as 1=younger than 25yrs,
2=in between 25-30 yrs, 3=in between 31 and 35 yrs, 4 = in between
36-40 yrs, 5 = over 40 yrs
3. Whether an MBA student lives on campus or off campus
4. The overall satisfaction of the students for mathcamp measured on a
five point satisfaction scale (1=very unsatisfied, 2=unsatisfied,
3=neither satisfied, nor unsatisfied, 4=satisfied, 5=very satisfied)
5. How many times a student went to the gym in the past week, measured
as the exact number (0,1,2,3,… etc. times)
6. How many times a student went to the gym in the past week, measured
as ‘not at all’, ‘1-2’ times, ‘3-4’ times, ‘more than 4 times’
7. The GMAT score of the student applicant

70

35
Practice measurement scales (slide 23)
1. Quantitative variable; numbers of this scale have natural meaning;
multiplications in the context of this construct (number of students)
make sense
2. Ordinal scale (categorical variable); the numbers 1,2,3,4,5 reflect
order with respect to the underlying construct (age)
3. Nominal scale (categorical variable); the numbers (e.g.) 1=off
campus; 2=on campus are just labels
4. Quantitative variable; this can be debated. This is an attitude rating
scale. It seems reasonable to compute an average satisfaction,
hence, some arithmetic for this variable makes sense
5. Quantitative variable; same comment as example 1
6. Ordinal scale (categorical variable); same comment as example 2
7. Quantitative variable. It seems reasonable to compute an average
GMAT score, hence, some arithmetic for this variable makes
sense
71

36
Data science camp
Introduction to the core stats class
Day 2

Fall 2018, Peter Ebbes

Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


2

1
Analyzing quantitative variables

• Numerical methods
– Central tendency measures
– Dispersion measures
– Correlation (core stats class session 5)
• Visual methods
– Histograms
– Box plots
– Time series plots
– Scatterplots (core stats class session 5)

Mini case credit card usage

Research question:

How much did credit card users spend on grocery


items in the two years of observation?

Or, how much did credit card users spend on retail


items in the two years of observation?

2
Numerical methods – central tendencies

Mean (~average), Median, Mode


Use (e.g.) SPSS (see ‘how to guide’)

Interpretation?

Numerical methods – dispersion

Standard deviation, variance, range, inter-quartile range


(IQR)

Range = max-min

IQR = [ perc(25th), perc(75th) ]


= [ 78.24, 495.89 ]
Interpretation? I.e. range contains middle
50% of the data

3
Graphical representation: histogram

Mode = 0.00

Median = 88.42

Mean = 129.28
25th percentile = 0.00

75th percentile = 214.30

Dealing with outliers

Outliers are observations that stand apart from the


majority of observations
– Can heavily influence our analyses and conclusions
– Examine carefully: might be errors
– Watch the tails: heavy-tailed distributions have a
higher likelihood of extreme events (e.g. catastrophic
loss) and need to be modeled in that case
– Should be noted in any conclusions drawn from the
data (e.g. run analysis twice and compare
results/conclusions)

4
Graphical representation: boxplot

A box plot displays the median, IQR, and potential outliers


simultaneously
9

Compare amounts spent groceries and retail

Side-by-side box plots


(two “variables” – amount spent and category)
10

5
Temporal Data

Time series plots can be used to see temporal


patterns in the data (“two” variables: time and rating)

11

Stock Performance

Temporal data can be seen as having “two” variables


(here, stock price and time)
Quiz: who knows what stock this is?

12

6
Mini case: amount spent on groceries

Use multiple lines / graphs to enhance comparisons (next slide)


13

Mini case: amount spent on categories by brand

Green – retail
Brown – travel
Blue – groceries

Insights for
American
Express?

14

7
Recap of techniques for quantitative data

Examining individual variables, one-at-a-time,

– Descriptive statistics (central tendencies and dispersion)

– Histograms

– Box plots (for a single variable)

These are examples of univariate descriptive statistics for


quantitative variables. In the core statistics/business
analytics class we will learn bi/multivariate statistical
techniques for these variable types

(Contrast with recap for categorical variables on day 1 slide 66)

15

Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


16

8
Learning from data for decision making

• Data is often a sample – but we want to know


something about the population

• How can we learn about the population using


statistics?

• How can we support our decisions using statistics?

17

Relevant Contexts of where data is a sample

• Manufacturing
– Quality control

• Marketing
– Online ad copy testing (A/B)

• Finance
– Comparing returns from different investment portfolios

• The AE credit card case study (data science camp


day 1)
– Spending in various categories
18
18

9
Two forms of applied statistics

1. Descriptive statistics

2. Inferential statistics

19

Two forms of applied statistics

1. Descriptive statistics: make data usable


• describing results from a data source

• important first step in data analysis

• this is what we did up till now

2. Inferential statistics: generalize results to a population


• managerial decisions: are for a population not a sample

• use a sample to infer about the population (‘The almighty


truth’)  the sample suggests something about it

• we will never be sure of the ‘exact truth’, but can make a


good educated guess about it using inferential statistics

20

10
The two ‘key’ concepts in statistics

Notation

μ and σ

X S
p

Population: Sample:
-- total of all elements that share -- a subset of a population
some common characteristic(s) -- goal of stats: use sample to
-- goal of stats: learn about it learn about population
(‘Truth’)
21

Opinion polls

9/21/2011 USA Today

22

11
The idea behind sampling

N (BIG) Population
Every sample of size n
is equally likely

n1 n2 nobs
…. n(N choose n)

This is the observed one

T1 T2 T(N choose n)
For each hypothetical sample of
size n you compute a statistic T A histogram of all these hypothetical
T’s would determine the ‘margin of
error’ (previous slide)

23

Two sampling questions to class

• If we want to get a sample, what generally would


we need before we can sample?

• What sampling strategies can you think of?

24

12
Two sampling questions to class

• If we want to get a sample, what generally would


we need before we can sample?
– A definition of a population
– A sampling frame (a list of all members of the
population)
– Sample size

• What sampling strategies can you think of?

25

Sampling strategies – from ideal to worst


[[ in a nutshell ]]

• Simple random sampling – all members of the population


have an equal chance of being chosen
• Cluster sampling – when it is easy to measure groups or
clusters, randomly sample clusters and include everyone
in the cluster (e.g. households, city blocks)
• Stratified sampling – attempt to make sure that the
sample contains different types of people in proportion to
their numbers in the population; select a simple random
sample from each strata
• Convenience sampling – sampling in whatever way is
easy for you to do (ahem)

Note: most basic stats assumes simple random sampling


26

13
Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


27

Mini-case: insurance claims

28

14
Mini-case: insurance claims

WSJ 09/2012

The ABI in a report said that its members detected


139,000 bogus or exaggerated insurance claims
last year, up 5% from 2010. The amount of money
saved by insurers from detecting these bogus
claims was around 983 million pounds ($1.58
billion), up 7% from the previous year.

29

Mini-case: insurance claims

WSJ 09/2012

Most of that was from dishonest car drivers, who


filed for GBP541 million worth of bogus claims.
Another big source of fraud is home insurance,
with dishonest homeowners filing for GBP106
million in fake claims last year.

30

15
Mini-case: insurance claims
• Fairly rich dataset from a large insurance company

• The company wants to better understand it’s fraud


claims in home insurances, and what it can do to better
detect and prevent them

• Some questions to class:

– What is the population in this case?

– How big is the sample?

– What are the key variables?

– What analyses would you perform for claim amount?

31

Key stats for claim amount

What does this tell us?


32

16
Key stats for claim amount

Sample distribution for cost of claim amount


33

Key stats for claim amount

Sample distribution for cost of claim amount


34

17
Two fundamental distributions in statistics
• Population distribution:
– frequency distribution (histogram) of the population elements for a
certain variable (e.g. claim amount); generally a smooth line
– It is unknown but you want to know about it
– The mean of the population distribution is μ (how could you
compute it?)
– The standard deviation of the population distribution is 
• Sample distribution:
– frequency distribution (histogram) of the sample elements for e.g.
claim amount
– it is known once we have our sample
– the mean of the sample distribution is the sample mean 𝑋
– the standard deviation of the sample distribution is S
– usage: to infer (=learn) about the population distribution

35

Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


36

18
Statistical inference: learning about the
population mean

• Many statistical inference problems start with learning


about the population mean µ

• Our best “guess” for the population mean is the sample


mean 𝑋

• Mini-case:
– Sample mean is 73.01 (in 1000$) [[ slide 32 ]]
– Can we conclude µ = 73.01?

• No! So, then how “close” is 𝑋 = 73.01 from unknown µ?

37

Sampling error

• Different samples typically have different means

• Means in samples are generally different from the


(unknown) mean of the population

• The statistician wonders:


– How much variation is there in my sample mean?
– How far is the sample mean from the true population
mean?
– Can we “estimate” the sampling
error/uncertainty?

38

19
Understanding the variability in sample means
Consider the following hypothetical ‘mini’ population (‘the truth’): μ=72.5
Claim ID 1 2 3 4 5 6 7 8 9 10
Claim (1000$) 73 71 69 71 75 78 70 72 72 74

Theoretical exercise: lets draw several (random) samples of size n=2

– sample 1: ID(1, 2) -- (73, 71)  sample mean X = 72  pretty close to μ!


– sample 2: ID(3, 4) -- (69, 71)  sample mean X = 70
– sample 3: ID(5, 6) -- (75, 78)  sample mean X = 76.5  pretty not close to μ!
– sample 4: ID(7, 8) -- (70, 72)  sample mean X = 71
– sample 5: ID(9,10) -- (72, 74)  sample mean X = 73
– sample 6: ID(1, 3) -- (73, 69)  sample mean X = 71
– etcetera until we have all possible combinations of two (45 in total)

These 45 means will form the sampling distribution. Of what form will
this distribution be (think about making a histogram of these 45 sample means)?

39

Sampling distribution and the central limit theorem (CLT)


Definition of sampling distribution for a mean:
– The sampling distribution is the distribution (histogram) of the sample
means (𝑋).
– It is a theoretical distribution: (a) take many samples from a
population, (b) compute 𝑋 for each sample, and then (c) construct a
frequency distribution (histogram) from all computed sample means.
– When the sample size to compute the sample means is large, the
sampling distribution is a normal distribution (bell shaped), with
• Mean: μ (hence, it has the same mean as the population distribution)
• Standard deviation: σ/n
• For practical purposes statisticians use S /n for the standard deviation
of the sampling distribution
• This result holds for virtually any population distribution (magic!)

This result is the Central Limit Theorem (CLT)


40

20
[[ Challenging ]] In-class exercise
Discuss how the sampling distribution gives insight in how
good your sample mean 𝑋 estimates the population mean μ.
Hint:
– Consider the following two situations:
1. Consider a sample of 100 cases from company A. The sample
mean for claims is 72 and the sample standard deviation S = 20.
How does the sampling distribution look like?
2. In a sample of 100 from company B the sample mean of 68. The
standard deviation S = 5. How does the sampling distribution
look like?
– Which estimate (72 or 68) gives you most confidence to infer
about the population mean?

41

Three (!) fundamental distributions in statistics


• Population distribution:
– frequency distribution (histogram) of the population elements for a certain variable (e.g.
height or income); generally a smooth line
– It is unknown but you want to know about it
– The mean of the population distribution is μ (how can you compute it?)
– The standard deviation of the population distribution is 
• Sample distribution:
– frequency distribution (histogram) of the sample elements for e.g. height
– it is known once we have our sample
– the mean of the sample distribution is the sample mean (symbol?)
– the standard deviation of the sample distribution is S
– usage: to infer about the population distribution
• Sampling distribution:
– theoretical frequency distribution of possible sample means
– it is bell-shaped (a normal distribution)
– it has the same mean as the population distribution μ (it is unknown!)
– the standard deviation is S /  n (a.k.a. “standard error of the mean”)
– usage: the get an idea of how good our sample mean estimates the population mean μ

Completes slide 35
42

21
Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


43

Uncertainty in statistics

Statisticians compute statistics, of


course, but what really makes
someone a statistician, and not a
hack, is that (s)he thinks of the
statistic as an estimate of some
unknown quantity (a parameter)

Systems with people (customers,


employees, traders) have much
more uncertainty than (say) an
automated production line

44

22
Big picture
Constructing a confidence interval (CI) for an
unknown parameter

CI = (best guess) ± (“Magic #”) × SE(best guess)

‘Margin of sampling error’

Best guess: comes from the data


“Magic #”: depends on how much certainty you
want

45

Computing a CI for the mean


• You use the following formula:
S
X  Z confidence 
n
‘Margin of sampling error’

• Two steps to compute a confidence interval

– Step 1: obtain the values for 𝑋, S, and n

– Step 2: obtain the value for Zconfidence

• For a 95% confidence interval: Zconfidence corresponds to


the 95% area under the standard normal curve
46

23
Confidence interval: Zconfidence for 95% area

Area is 0.475 Area is 0.475


-- Important to remember --
Standard normal curve has:
(1) mean = 0
(2) std. dev. = 1

Area is 0.025 Area is 0.025

-Zconfidence 0 +Zconfidence

The corresponding Z-value is 1.96 (use probability calculator; e.g.


PQRS)
47

Use probability calculators on a computer or


the web

PQRS 2: see Blackboard (or, http://www.pyqrs.eu/home/#history)

Web: http://homepage.divms.uiowa.edu/~mbognar/applets/normal.html

HEC: http://rstudio-test.hec.fr/probcalc/

[[ FYI: check out appendix for some extra info and exercise on
probability calculations for a normal distribution ]]
48

24
Using PQRS

For 95% confidence interval:100 – 95 = 5% in tails


(hence, 2.5% in left tail and 2.5% in right tail)

This is Z

Probability that Z < 1.96 Probability that Z > 1.96


= =
Area under the normal curve left Area under the normal curve right
of Z = 1.96 of Z = 1.96

49

Using PQRS

For 90% confidence interval:100 – 90 = 10% in tails


(hence, 5% in left tail and 5% in right tail)

This is Z

Probability that Z < 1.65 Probability that Z > 1.65


= =
Area under the normal curve left Area under the normal curve right
of Z = 1.65 of Z = 1.65

50

25
Mini-case: insurance claims
Confidence interval for μ for the variable insurance claims

s s
( X  zc , X  zc )
n n

𝑋 = sample mean = 73.01


𝑆 = sample standard deviation = 144.40
𝑛 = sample size = 4415
𝑍𝑐 = 1.96 for a 95% confidence interval

95% CI is (68.75, 77.27) Interpretation?

51

Mini-case: insurance claims

Interpretation:

“We are 95% sure that the mean claim amount of


all claims at this company is contained in the
interval (68.75, 77.27)”

[[ here, all claims filed at this company constitute “the


population”. We would like to know the population
mean but do not observe a census of all claims filed. ]]
52

26
In-class exercise

Q1 Use a probability calculator to find the Z-value you


need for a 90% and 99% confidence interval

Q2 Compute the 90% confidence interval for claim


amount

53

In class exercise solutions

Q1 𝑍 = 1.645 for a 90% confidence interval and 𝑍 =


2.575 for a 99% confidence interval
.
Q2 upper bound: 73.01 + 1.645 ×   = 73.01 + 1.645 ×
2.173 = 73.01 + 3.575 = 76.59
.
Lower bound: 73.01 − 1.645 ×   = 73.01 − 3.575 =
69.44

90% confidence interval: (69.44, 76.59)

FYI: you can have SPSS compute confidence intervals for


you, see ‘How to guide’ of today’s class!

54

27
Confidence interval – couple of remarks

Mini case insurance claims

– Average claim amount (in sample of size n=4415) is


73.01; standard deviation is 144.40 (both in 1000$)

– CI: (68.75, 77.27)

– Meaning: “We are 95% sure that the true (unknown)


population mean is contained in the interval (68.75,
77.27)”

A confidence interval tells us how large the random


error is and is more informative than a point estimate

55

Confidence interval – remark 1

We can also compute a CI for proportions if it


concerns a dichotomous variable (y/n, M/F, fraud/no
fraud etc.)
– Population proportion:  (it’s value is unknown)
– Sample proportion: p (it’s value is known)

Mini-case example: what is the proportion of fraudulent


claims?
– Variable ‘fraudulent’ is nominal scaled
– Hence, analyze with a frequency distribution (next
slide)

56

28
Confidence interval – remark 1 (cont.)

So, the proportion of fraudulent claims is p=0.105. How


close is it to the true, unknown population proportion ?
p(1  p ) p(1  p)
( p  zc , p  zc )
n n

The 95% CI for  is (0.096, 0.114). Interpretation?


57

Confidence interval interpretation

“We are 95% confident that the proportion of claims


that are fraudulent out of all claims filed at this
company is in between 0.096 and 0.114.”

[[ again, as before, “the population” is the set of all claims


filed at this company. We would like to know the proportion
in the population. ]]

58

29
Confidence interval – remark 2

The standard normal distribution that we need to get


the ‘𝑍’ value to compute confidence intervals is an
approximation.

– For means, we rather use the 𝑡-distribution. However,


when our sample size is large enough (larger than
about 50), the practical differences are minimal

– For proportions, we rather use the binomial


distribution [[ don’t worry about its details ]]. But, again, if
the sample size is large enough (𝑛 ≥ 50 and 𝑛𝑝 ≥ 10
and 𝑛(1 − 𝑝) ≥ 10), the differences are small and the
standard normal distribution is fine

59

Confidence interval – remark 3


• The confidence interval procedure is rather “robust”
– In plain English: the calculations are fairly insensitive to data
artifacts or assumptions
– The central limit theorem holds for any population distribution
(more or less) – magic!
• In case of strong deviations from data that follows a
normal distribution, large(r) sample sizes are important
• Other good practice: transform your data
– The variable cost of claim has a rather long-tailed distribution
(e.g. slides 33 or 34); sample mean 𝑋 and sample standard
deviation 𝑆 are sensitive to outliers
– What would be a good transformation? Try @home (part 7)!

60

30
Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


61

Summary of data science camp main topics


• Basic descriptive and inferential statistics
• Start of a data analytics project: get to know your
data, usually univariate (one-variable-at-a-time),
graph your data (this is descriptive statistics)
• Then, focus on learning and decision making – go
beyond sample / statistical model
– You want conclusions and decisions for an entire
population
– You need to understand sampling variability
– This is inferential statistics [[ will develop in core stats class ]]

62

31
Univariate, descriptive, statistics
• A very important first step [[ and sometimes the only step ]] of
statistical work:
– Get a feel for data and become friends!
– Examine data for accuracy
– Helps decide follow up research/analyses

• What techniques to use? Depends on the measurement /


scale level of your variable(s) [[ day 1, part 3 ]]
– Categorical variable: Numerical (frequency table with
proportions, counts) and / or Graphical (bar or pie chart)
– Quantitative variable: Numerical (central tendency, dispersion, 5
number summary) and / or Graphical (boxplot, histogram)

• Always make sure you know the basic numbers and graphs
for the key variables when you use data for decision making

63

Bivariate, descriptive, statistics


We did a bit of bivariate statistics: analyzing TWO variables
jointly. Again the scale levels of the two variables helps us to
decide what statistical technique could be useful
This is largely underdeveloped at the moment, but we will
continue work on this in the core statistics and business
analytics class
When we have TWO categorical variables:
Numerical: ??
Graphical: ??
[[ e.g. day 1 slides 56—58 ]]
When we have TWO variables, and one variable is time: ??

64

32
Inferential statistics
• We often need to learn something about the “state of the
world” using a sample
• However, the “decision space” lies beyond the data
• Inferential statistics helps out:
– Realization: if I get a different sample, my statistics will
change!
– Important to know for decision making: by how much?
– Fortunately, we only need one sample (and knowledge of
the central limit theorem) to get an idea of this
– How? Compute a confidence interval! (eg day 2 part 5)

• We will further develop this in the core statistics and


business analytics class
65

We learned a bit of SPSS


• How data is stored in SPSS (data and variable view)

• Basic descriptive statistics (numerical and graphical)

• Useful SPSS options

– Select cases (day 1 slide 48)

– Recode a categorical variable (day 1 slides 63—65)

– Compute a new variable (appendix; @home exercise day


2 part 7)

• Review the SPSS ‘How to guides’ of day 1 and 2

• We will practice SPSS a lot in the months to come!


66

33
Final remarks data science camp
• Keep an eye out for the quiz!

– You need to fill it out to get a pass grade for the camp. Your
grade does not matter!

[[ Of course, take the quiz serious. It will signal you where you
are standing. This material is relevant for the core stats class! ]]

• Stats class starts next week Tuesday! We’ll dive right in!

– Keep an eye on your email/blackboard

• Don’t hesitate to contact us if you have questions or


concerns. We are here to help!

67

Today’s lecture

Part 1: Descriptive statistics – quantitative variable

Part 2: Learning from samples

Part 3: Sample distribution

Part 4: Sampling distribution

Part 5: Confidence intervals

Part 6: Wrap up

Part 7: SPSS “lab” (@home practice)


68

34
@home SPSS practice
SPSS work for the credit card case (day 1)
[[ answers will be given in the how to guide ]]

Q1 Of the five categories (groceries, retail, entertainment,


travel, other), which category has, on average, the highest
monthly spending? And which has category the lowest?

Q2 What is, on average, the number of items customers


bought in a given month in the retail category? And, what is
the modal number?

Q3 What can be learned from the histogram for the monthly


number of items purchased in the retail category?

69

@home SPSS practice


SPSS work for the insurance case (day 2)
[[ answers will be given in the how to guide ]]

1. Become friends with the data! Examine the variables


‘claim_type’, ‘coverage’, and ‘edcat’. What did you learn
from these analyses?

2. Compute the log of claim amount. What does the


distribution look like? Is this the population, sample, or
sampling distribution?

[[ HINT: review the appendix of today’s lecture notes to


learn how to compute a new variable in SPSS ]]
[[ Continued on next slide ]]
70

35
@home SPSS practice
SPSS work for the insurance case (day 2)
[[ answers will be given in the how to guide ]]

3. What is the proportion of properties that were rendered


inhabitable? How certain are you about your estimate?

4. Explore whether there is a difference in fraudulent


claims between properties that were rendered
unhabitable and those that were still habitable.

[[ HINT: this question is an example of bivariate,


descriptive, statistical analysis ]]

71

APPENDIX

Contents ---
How to generate a simple random sample in Excel or SPSS
(slide 73)

Another useful option in SPSS: compute a new variable (slides


74 – 77)

Continuous probability distributions (slides 78—80)

The normal probability distribution (slides 81—84)

Normal probability calculations exercise with answers (slides


85—91)

72

36
Simple Random Sample in Excel and SPSS
• To select a simple random sample in Excel:
– Create column in Excel (labeled “random” or similar)
– Use Excel’s =RAND() function to generate a random
number for each observation
– “Freeze” the random number
• Edit-Copy-Paste Special-Values
– Sort by the random number
– Take the first n rows (n=desired random sample size)
• In SPSS, this can be done through the option
“Random sample of cases” in the menu ‘Data—
Select Cases…’
73

Another useful option in SPSS:


Compute Variable
• Often, we need to perform a transformation on our data.
This can be done in SPSS with the compute command.
This command allows us to carry out various functions
on columns of data in our data file
• Common transformations:
– multiplying all values in a column by a constant (e.g. to
express the variable Price in euros instead of dollars)

– taking the logarithm of a variable (e.g. to handle outliers)

– adding up two or more columns

– Etc.

74

37
Compute a new variable in SPSS

• Research question (credit card case, day 1): the manager


would like to know the distribution of monthly $ spent per
item for the retail category
• We now need to compute a new variable: divide
‘spent_retail’ by ‘items_retail’
– Transform – Compute Variable

– In the ‘Target Variable’ box enter the name for the new variable (e.g.
‘spent_per_item_retail’)

– In the ‘Numeric Expression’ box, type in the formula: ‘spent_retail /


items_retail’ (you can select the variables out of the variable list on
the left hand side)

– Click ‘OK’

75

Compute a new variable in SPSS


• SPSS created the new variable ‘spent_per_item_retail’
– Complete the variable definition in ‘Variable View’ (!)
– Inspect for a few cases that the computation worked in Data View (!)
– NOTE the error messages! Do they make sense? Can we ignore
them or did something go wrong?

• See also how to guide of today’s class for a way to avoid


the error messages in this particular example

• You can now analyze this new variable. As this is a


quantitative variable, one could compute the mean and
standard devation, or visualize the distribution in the
sample through a histogram (next slide)

76

38
Histogram of monthly $ spent per item (retail)

A histogram for the newly computed quantitative variable plots


also the mean and standard deviation
77

Probability distributions for a


continuous random variables
A quantitative variable (e.g. day 1 part 1) is sometimes
modeled by means of a continuous random variable, say 𝑋

The probability distribution of 𝑋 is described by a density


curve 𝑓(𝑋)

The probability of any event is the area under the density


curve and above the values of 𝑋 that make up the event
(see next two slides graphically)

In a sample, this density curve may be approximated by a


histogram

78

39
Probability distribution continuous variable

𝑓(𝑋)

Interpretation of the height of the density function: how


closely are the values of the random variable packed at
places on the x-axis.
79

Probability distribution continuous variable

𝑓(𝑋)

Interpretation of the area under the density function:


represents the probability of event 𝐴. The total area under
any density curve is 1.
80

40
Continuous probability model: Normal distribution

• The normal distribution is among the most popular


continuous probability models

• Symmetric, unimodal, bell-shaped curved

• Characterized by two parameters: the mean 𝜇 and


the standard deviation 𝜎

• Importance in statistics and business


– Linear regression (marketing mix, forecasting,
financial beta)
– Confidence intervals (sampling)

81

Normal distribution formalities

For the record, if:


(Notation means 𝑋 is distributed
𝑋~𝑁(𝜇, 𝜎) as Normal with mean 𝜇 and st.dev. 𝜎)

then, the density curve is given by the function


1
𝑓 𝑋 =   𝑒
𝜎 2𝜋
Further, to determine the probability a point lies between
“𝑎” and “𝑏”, you take the integral of this function from 𝑎 to 𝑏.

And, since it is a density curve, the integral over the range


(here, from –infinity to +infinity) equals 1

82

41
Properties of the Normal distribution

The 68-95-99.7 rule for the normal distribution

Interpretation? See math camp!


83

Use probability calculators on a computer or


the web
Distribution
Parameters

Probability 𝑋=0
that X < 0 Probability
that X > 0

Step 1: Select the distribution


Step 2: Specify its parameters (𝜇 and 𝜎 for the normal distribution)
Step 3: to find a probability for a given 𝑋, supply a value for 𝑋; to find a
𝑋 for a given probability, supply the probability

84

42
Normal probability calculations exercise 1&2

Q1 – use a probability calculator for a standard normal


distribution to calculate the probability of observing a Z
value in between -2 and 2

Q2 – use a probability calculator for a standard normal


distribution to calculate the probability of observing a Z
value in between -2 and 1

85

Normal probability calculations exercise 3


The data analytics team at AirFrance (AF) examined
historical data of daily demand for a particular flight route and
found that it has a distribution that is roughly normal with a
mean of 500 and a standard deviation of 100.
Suppose AF allocates enough flights to accommodate 600
passengers. Based on the 68-95-99.7 rule, how likely is it that
the airline offers enough flights to meet demand?

P(Demand<600) = ?
86

43
Normal probability calculations exercise 4&5

As previous exercise, assume that flight demand is


nearly normal with N(500,100).
4. Suppose that we currently allocate enough planes to
meet demand for 450 passengers. Adding another
plane would give us capacity for 600 passengers. How
likely is it that demand will fall between 450 and 600
passengers?
5. Suppose we are only willing to run a 10% chance that
we don’t offer enough flights to meet passenger
demand. How many passengers must we plan to
accommodate?

87

Normal probability calculations solutions 1—3


Note: a standard normal distribution has mean 0 and variance 1.

Q1 – may use 68-95-99 rule; or 𝑃 −2 < 𝑍 < 2 = 0.954 ≈ 0.95


with a probability calculator

Q2 – need to use probability calculator; 𝑃 −2 < 𝑍 < 1 =


𝑃 𝑍 < 1 − 𝑃 𝑍 < −2 = 0.8413 − 0.0228 = 0.8185 ≈ 0.82

Q3 – In this exercise we work with a normal distribution with


mean 500 and standard deviation 100. We can compute the
probability that is asked for with a probability calculator. The
answer is 𝑃 𝑋 < 600 = 0.84.

For an alternative solution using the 68-95-99 rule, see the next
slide.

88

44
Normal probability calculations solutions 1—3

Q3 continued – In this exercise we could also use the 68-95-99


rule (suppose we did not have a probability calculator).

Based on that rule, we know that 68% of the observations is


within 1 standard deviation of the mean, in other words, 68% of
the observations is in the range [400,600].

Because of symmetry of the normal curve, we also know that


34% is within [500,600].

Again, because of symmetry we know that 50% of the


observations is in the range [-infinity,500].

Taking these last two results together, we can conclude that


34%+50%=84% of the observations is in the range [-infinity,
600].

89

Normal probability calculations solution Q4

We need the area under the curve indicated above. This


can be compute using PQRS in two steps:
– Compute the probability 𝑃 𝑋 < 600 = 0.8413
– Compute the probability 𝑃 𝑋 < 450 = 0.3085
– Then 𝑃 450 < 𝑋 ≤ 600 = 0.8413 − 0.3085 = 0.5328

90

45
Normal probability calculations solution Q5
Sometimes we need to find the observed value
corresponding to a given proportion. Here, we are given a
maximum probability (10%) that we don’t offer enough
flights to meet demand. How many passengers must we
plan to accommodate to not exceed this risk?

1. Fill in given
probability here

3. Read off your


2. Click here observed value here
91

46
Statistics and business analytics

Supporting decisions with data


Applications of hypothesis testing

Session 1

Fall 2018, Peter Ebbes



Poll n=732 data scientist (63% industry, 11% academia, 26% other)
Source: Kdnuggets
2

1
Course organization
• Blackboard

• Course syllabus

• Quizzes

• SPSS labs

• Quiz data science?

Grading policy
• Grading scale:

– SPSS computer lab: 30%

– Midterm (individual): 30%

– Final (team w. peer evaluation): 40%

– Total: 100%

• MBA policy: every grade is a “relative grade” rather


than an “absolute grade”

• The average final letter grade (based on total average)


will be a 3.6 (GPA) or lower

2
Last topic data science camp

• Learning from a sample about a population

• Key point?
– different samples lead to different statistics (e.g.
mean, proportion, standard deviation etc.).
– question: ‘how different is different’?

• Solution?
– Sampling distribution  confidence intervals
for means

Today’s lecture

Part 1: Hypothesis: basics

Part 2: 6 steps of hypothesis testing

Part 3: Some remarks hypothesis testing

3
Today’s lecture: supporting decisions with data
Mini-case insurance claims: the accounting department
reports that the average claim last year was $63500.

Management will consider a policy premium change if the


average cost of claims increases by 10% or more from last
year.

Here, $63500 + 10% = $70000 (rounded).

We found in our data (data science camp part 3) that, this


year, the average claim size in our sample is $73000
(rounded).

Would you order a reevaluation of the policy premiums?

Supporting decisions with data


Example: criminal trial – presumption of innocence (“innocent until
proven guilty”)
– The null hypothesis is that you are innocent
– The alternative hypothesis is that you are not innocent (guilty)
– Evidence is necessary to reject the null hypothesis (deference to the null
hypothesis)
Truth
NOT GUILTY GUILTY

Reject the null:


GUILTY False Positive
Correct
Conclusion

(GO TO JAIL) (Type I error)


Our

Not reject
the null: False Negative
Correct
NOT GUILTY (Type II error)
(NO JAIL)
8

4
Steps in Hypothesis Testing

Problem Definition

Step 1 Clearly State the


Null and Alternative
Step 2 Hypotheses
Step 3

How much certainty Conduct the appropriate What data have you
do you want? test collected?

Is the gap between No


the expected and
observed Do not reject null
sufficiently big?
Step 4&5
Managerial conclusion
Reject null hypothesis

Step 6

Testing a hypothesis about a population mean: 6 steps

1. Formulate the null and alternative hypotheses


2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English

Note: you should not do 1&2 in retrospect

10

5
Today’s lecture

Part 1: Hypothesis: basics

Part 2: 6 steps of hypothesis testing

Part 3: Some remarks hypothesis testing

11

Step 1: Formulate the statistical hypothesis for a mean


You need to formulate TWO hypotheses:

The null hypothesis H0


– it specifies that the population mean μ is equal to a single value (the value
that you want to test)
– it is usually the ‘status quo’, the ‘current norm’, or a situation in the past

The alternative hypothesis H1 or HA


– is also a statement about μ; it states that the population mean is different
(unequal, larger, or smaller) from the value specified under H0
– typically, H1 represents our decision alternative: rejecting H0 usually
implies action. We want strong evidence (from data) for that

Important to remember: the hypotheses refer to a specified value of the


population parameter (e.g., μ), not a sample statistic (e.g., 𝑋) or
sample value (e.g. 73 ($1000)).

12

6
Step 1: one-sided vs. two-sided hypothesis tests
• One-sided tests are focused on departures from H0 in a
single direction
– In the claims mini-case, we want to know if the new claims
are 10% higher on average than the old claims to warrant
new policy premiums

• Two-sided tests focus on any departure from H0 (greater


than OR less than)
– Mini-case: are the average insurance claim cost higher or
lower than 70 (in $1000s)?

• Good practice: always use two-sided tests, except in


those cases which are clearly and blatantly one-sided
from their description

13

Step 2: choose the significance level


The significance level is denoted by . It indicates how
certain we are in our decision. You choose this number
(typically =0.10, 0.05, or 0.01), and use this in step 5.

Truth
NOT GUILTY GUILTY

Reject the null: False Positive


GUILTY (Type I error) Correct
Conclusion

(GO TO JAIL) 
Our

Not reject
the null: False Negative
Correct
NOT GUILTY (Type II error)
(NO JAIL)

14

7
Step 2: choose the significance level
The significance level is denoted by . It indicates how
certain we are in our decision. You choose this number
(typically =0.10, 0.05, or 0.01), and use this in step 5.

Truth
𝐻 : 𝜇 = 70 𝐻 : 𝜇 > 70

𝜇 > 70 False Positive


(Type I error) Correct
Conclusion

(new claims
at least 
Our

10% higher)

False Negative
𝜇 = 70 Correct
(Type II error)

15

Step 3: compute a test statistic

• A test-statistic measures how ‘close’ the sample has come


to the null hypothesis
• Remember: goal of hypothesis testing is to use a sample to
prove or disprove an idea about the population
• A well-thought-off test statistic (statisticians figure this out)
follows a well-known distribution such as the normal, t-, or
chi-square distribution
• For testing about a mean, you use the following test-
statistic:
X  0 73.01  70 3.01
Z test  Z test    1.39
S 144.40 2.17
n 4415
• Interpretation?
(numbers are from insurance claims mini
case; e.g. data science camp day 2 part 3)
16

8
Step 4: prepare a statistical decision

• So, we computed that the ‘distance’ between sample and


hypothesis is 𝑍𝑡𝑒𝑠𝑡 = 1.38, which represents the gap
between what we expect (𝐻 ) and observe (data)
– If the sample is close to the null hypothesis, we are willing to
believe the null
– If the sample is far from the null hypothesis, we are not
willing to believe the null
• What’s needed: a precise definition of “close” and “far”
• That definition comes from the following statistical result:
If the null hypothesis is true, than the frequency distribution of all
possible 𝑍𝑡𝑒𝑠𝑡 values (imagine you get many many samples) is a
standard normal distribution

17

Step 4: prepare a statistical decision


-- Important to remember --
Distribution of 𝑍𝑡𝑒𝑠𝑡 Standard normal curve has:
Assumes H0 is true (1) mean = 0
(2) std. dev. = 1

Area is 0.025 Area is 0.025

-2 0 2
All possible 𝑍𝑡𝑒𝑠𝑡 values that you could get (from many many
hypothetical samples) if H0 is true
Consider two situations:
1. If the null hypothesis is true, you are likely to get 𝑍𝑡𝑒𝑠𝑡 values that are close to 0,
say in the white area, and you are unlikely to get 𝑍𝑡𝑒𝑠t values that are far away
from zero, say in the green areas
2. Therefore, it would be unlikely to get a 𝑍𝑡𝑒𝑠t value in the green area from your
sample, if indeed the null hypothesis is true.
18

9
Step 4: prepare a statistical decision
• A precise definition of “far” and “close”: where does
my 𝑍𝑡𝑒𝑠𝑡 value fall under the standard normal curve?
Is it in the tails (green area) or not?

• Statisticians compute the P-value (“probability-


value”) to decide this very precisely.

P-value definition: the probability (assuming that the null


hypothesis is true) of observing a 𝑍𝑡𝑒𝑠𝑡 value that is at least as
contradictory to the null hypothesis as the one computed

• Compute the P-value with standard normal tables or


(e.g.) PQRS
19

Step 4: prepare a statistical decision

Enter your
𝑍𝑡𝑒𝑠𝑡 value in
this box

Pr(Z ≤ 1.39) = Pr(Z > 1.39) =


0.9177 0.0823
if H0 is true if H0 is true

Sloppy: here, the P-value is the probability away from H0


in the direction of H1
(remember: H0 is at 0 under the curve)
– For a ONE-sided test: P-value = 0.0823
– For a TWO-sided test: P-value = 2 × 0.0823 = 0.1645
20

10
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject the
null hypothesis
– What you do: compare the P-value to the chosen significance
level (in step 2) of the test
• The significance level of the test (denoted by ) is the
relative cut-off for deciding if the observed difference is
sufficiently big to reject the null
– You choose this value : typically 1%, 5%, or 10%
• Statistical decision rule:
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the null
hypothesis
[[ Statistical warning: you never accept a null hypothesis! ]]
21

Step 6: make a business conclusion


• The P-value for the test (one-sided) on slide 12 is 0.0823.
• Suppose we had chosen in step 2 a significance level of
(say) 5% = 0.05
• The null hypothesis is rejected or not rejected?

• Business conclusion (for one-side test): our sample with


average claim size of $73000 does not provide enough
evidence that the average claim amount [[ in the population ]] is
larger than $70000 (𝑍𝑡𝑒𝑠𝑡 = 1.39, P-value = 0.08).
• Hence, based on this data, we would not recommend
computing new policy prices
This has to be in your conclusion!
22

11
Step 6: make a business conclusion
• What if we had chosen 𝛼 = 0.10 in step 2 instead (i.e. we
would tolerate a larger type I error in our decision)?
• The null hypothesis is rejected for 𝛼 = 0.10.
• Conclusion: our sample with average claim size of $73000
provides evidence that the average claim amount [[ in the
population ]] is larger than $70000 and (therefor) increased by
more than 10% (𝑍𝑡𝑒𝑠𝑡 = 1.39, P-value = 0.08). We would
recommend a re-evaluation of the policy prices.

• @home: suppose management wanted to know whether the


average claim amount this time period has increased or
decreased from previous time period (i.e. $63500, see slide
7). What would your conclusion be?
23

In sum: supporting decisions with data


• A decision typically applies to a population, but we often
base it of ONE sample
• Crucial: understand the sampling variability
• When a decision involves a comparison to a standard or
status quo, we often can do an hypothesis test
– Does our sample mean of $73000 in claims warrant a new policy
premium?
– Ignorant way of thinking: yes, because it is more than 10% higher
(exceeds $70000)
– Smart way of thinking: the $73000 comes from a sample; different
samples lead to different results. Is the population mean really
different from (here: larger than) $70000?
• Conduct an hypothesis test
– Make sure to ‘test’ for sampling variability
– What is the P-value?
24

12
Today’s lecture

Part 1: Hypothesis: basics

Part 2: 6 steps of hypothesis testing

Part 3: Some remarks hypothesis testing

25

Hypothesis test for proportions

• The previous example was for quantitative (ratio) scaled


data (hence, we can compute a mean – lecture notes
data science camp)

• For categorical data, we can do something similar if it


concerns a dichotomous variable (y/n, M/F, fraud/no
fraud etc.) and the sample size is large enough (n≥50
and np≥10 and n(1-p)≥10)

• You would apply the same 6 steps as before (slide 10),


with two changes: step 1 and step 3 (next slides)

26

13
Hypothesis test for proportions
• Mini-case credit fraud: last year, 8.5% (=0.085) of
the claims were fraudulent. Management wants to
know whether this year there were more or less
fraudulent claims.

• From our analysis (data science camp part 5), we


found in our sample that 10.5% (=0.105) of the
claims is fraudulent.

• Step 1: what are the null and alternative


hypotheses?
– H0: =0.085
– HA: ≠0.085
27

Hypothesis test for proportions

Step 2: same as before; choose a significance


level (typical:  = 5% (=0.05))
Step 3: compute a test-statistic. For proportions,
use the following formula:
(p )
z test 
 (1   )
n
( 0 . 105  0 . 085 ) 0 . 02
   4 . 76
0 . 085 (1  0 . 085 ) 0 . 0042
4415

28

14
Hypothesis test for proportions
Step 4: compute the P-value
– Tells us on a common (probability) scale whether the
sample is “close” or “far” from the null hypothesis
– Curve under a standard normal distribution
– E.g. use PQRS
– Did you find: P-value = 2*0.00 = 0.00?
Step 5: reject or not reject the null hypothesis?
– Compare the P-value to the significance level  in step 2
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the
null hypothesis
– Statistical conclusion?

29

Hypothesis test for proportions

Step 6: business conclusion


– Tell us in plain English what the outcome of the
hypothesis test is
– E.g. There is quite strong evidence (𝑍𝑡𝑒𝑠𝑡 = 4.76, P-
value = 0.00) that the fraudulent rates this year
(10.5%) are different from last year (8.5%).
– Alternative summary: the fraudulent rate this year
(10.5%) is [[statistically]] significantly different from last
year’s fraud rate (8.5%) (𝑍𝑡𝑒𝑠𝑡 = 4.76, P-value =
0.00).
– Recommendation: there is need to investigate further
why the fraud rates appear to be higher (by about 2%)

30

15
Hypothesis testing – remark 1

Hypothesis testing can also be done directly with


SPSS; we will discuss in the lab session.
– [[ But it would be good to complete these steps by
hand calculations for your own learning ]]

Most tests for means obtained with a computer


package will show the results for a t-test. We have
discussed here the Z-test.
– For large enough samples (𝑛 > 50), the results are
(nearly) identical; for smaller samples t-test is better
– Curve your test-statistic (step 3) under a 𝑡-distribution
with 𝑛 − 1 degrees of freedom (e.g. PQRS)

31

Hypothesis testing – remark 2


• Similar to computing a confidence interval for a mean,
the procedure to test an hypothesis about a mean could
be sensitive to ‘outliers’.
– However, this procedure is rather robust to long tailed data when
sample size is large (as is the case in our example)

– Use the t-distribution to be on the “safe side” (e.g. previous slide)

• Alternatively, the test procedure may be repeated on the


log scale, i.e. first take the logarithm of the variable, then
redo all calculations and compare the results
– Challenging exercise: carry out this test for claim amount

– [[ hint: in SPSS, first transform your data on a log-scale. Then


write down the null hypothesis in log-form, and carry out the 6
steps of hypothesis testing ]]
32

16
Statistical vs. practical significance – remark 3
• Statistical significance based on the rule “P-value < ”
(step 5) is at best a rule-of-thumb and at worst bad practice

• The threshold 𝛼 clues us to what the type 1 error might be.


But for practical purposes there is no difference between a
P-value of 0.051 and 0.049.

• You should always report the actual P-value. The decision


maker can then interpret the result in terms of practical
significance rather than statistical significance.

• While low P-values support evidence against the null


hypothesis, it is often a good idea in practice to use them in
conjunction with other statistical inferential approaches,
such as confidence intervals
33

Your statistically significant other..

34

17
Another illustration of type I and II errors
(slide 8)

Next class meeting: additional topics on statistical inference


35

18
Statistics and business analytics

Additional topics on inferential statistics


Beyond the basics: sample sizes and other
sampling challenges

Session 2

Fall 2018, Peter Ebbes

Course announcements

• Quiz data science camp


– Take a look at the feedback statements!
– After a few quizzes opportunity to look into the Qs
you missed

1
Today’s lecture

Part 1: Sample size calculations

Part 2: Sampling challenges

Part 3: Investigating distributions for


categorical variables (chi-square test)

Columbus Ohio bike share


• Project goal: bike share in Columbus Ohio?
• Project kick off late Fall 2011
• Collaboration with Ohio State University,
ConsiderBiking.org and the mayor’s office of Columbus
• Employ a survey:
– Purchase intent scale (5 pnt (interval) scale)
– Industry rule-of-thumb: 80% of “definitely buy” and 30% of
“probably buy” actually end up buying (…)

2
Columbus Ohio bike share
• Project goal: bike share in Columbus Ohio?
• Project kick off late Fall 2011
• Collaboration with Ohio State University,
ConsiderBiking.org and the mayor’s office of Columbus
• Employ a survey:
– Purchase intent scale (5 pnt (interval) scale)
– Industry rule-of-thumb: 80% of “definitely buy” and 30% of
“probably buy” actually end up buying (…)

• Mayor asks: how large should your sample be for us to


make a reliable decision?
• Class discussion: what would you say?

Sample size determination for a single outcome

• When data is expensive (in terms of time or


money), then you need to plan ahead to get just
the data you will need to make a decision.

• This turns out to be pretty easy!

• You need to work ‘backwards’ from the


confidence interval formula (assume 𝑛 > 50)

3
Confidence intervals: recap
Confidence interval size is a function of three things:
S
X  Z confidence 
n
‘Margin of error’
– the data
Specifically, the standard deviation
– the confidence level
As the confidence level increases (all else equal), the length
of the confidence interval increases.
– the sample size(s)
To control confidence interval length – choose the sample
size appropriately.
7

Sample size determination for a single outcome


S
X  Z confidence 
n
• Step 1: what is the desired confidence level?
– “The probability that the unknown population mean will be
in your interval”
– Most clients will give blank stares – typically we use 95%
– With 95% confidence, Zconfidence = 1.96
• Step 2: what is the smallest difference (above and
below) that has practical importance to you?
– Most clients can give you an answer here
– Call this difference ±𝐵
– Mayor: “quarter point above and below the average” i.e.
𝐵 = 0.25
8

4
Sample size determination for a single outcome

• Step 3: working backwards, the sample size you


need to detect a change of ±𝐵 at the 95%
confidence level is
𝑍 ×𝑆
𝑛=
𝐵

• For our example, 𝐵 = 0.25, 𝑍 = 1.96


• What is 𝑆?
– This is the tricky part: we need to guess what 𝑆 might be
when we haven’t seen any data yet.

How do you know the standard deviation before


you’ve collected data?
• Use historical data
– Sometimes a similar study has been done
before/somewhere else
• Run a pre-test
– Take a small preliminary sample (of size 20, say) just to
get an estimate of the standard deviation
• Guess at what the largest feasible variation could be
– Not recommended, but may be the only feasible option
– Rule-of-thumb: “estimate” the standard deviation as
follows
max possible value − min possible value
𝑠≈
4

10

5
Sample size determination for a single outcome
• From historical data: a similar study ran on the OSU
campus, we had found an S of 1.55, so
. × .
𝑛= = 12.152 = 147.67 ≈ 150
.

• So, a sample of size 150 would get us a good idea of the


true (unknown) population mean.

• Note that this is for “one” population (single outcome). If


an important research question is to contrast e.g. the
students to the working professionals, we have “two”
populations, so we would try to get 150 out of each (i.e.
a total sample size of n=300)

11

Sample size calculation for a proportion


(single outcome)
• Previous sample size calculation was for
quantitative data
• Same principle for categorical data (proportions)
• Critical question bike share:

“Would you recommend bike share to a friend?”


Yes or No

• If we take this question, how large should our


sample be if we allow for an error margin of at most
5% (up and down)?
12

6
Sample size calculation for a proportion
(single outcome)
• Step 1: what is the desired confidence level?
– Take 95% which gives you a 𝑍𝑐𝑜𝑛𝑓 = 1.96
• Step 2: what is the smallest difference (above and
below) that has practical importance to you?
– Here we took 5% up and down, so 𝐵 = 0.05
• Step 3: working backwards from the confidence
interval formula for proportions [[ see lecture notes data
science camp day 2, part 5 ]], use the following formula

×  ×( )
𝑛=

13

Sample size calculation for a proportion


(single outcome)
• Step 4: guess pguess
– Question: what number for pguess gives you
maximum n?
– Use that number to calculate a conservative
sample size
– Did you find…

 
. × . × . .
𝑛= = = 384.16 ≈ 385
. .

14

7
Today’s lecture

Part 1: Sample size calculations

Part 2: Sampling challenges

Part 3: Investigating distributions for


categorical variables (chi-square test)

15

Caution!!

Any formula for inference is correct only


under specific circumstances . . .

• Data must be from a simple random sample (SRS)


– Or can be plausibly thought of as an SRS

– There is no method for inference from haphazardly


collected data

Fancy formulas cannot rescue badly produced data

• Hence, always carefully and critically consider the


steps of data collection
16

8
Data challenges

• Ideally: we want to get a SRS. Sometimes that is


easy, e.g.
– A financial firm has a large clientele with personalized
investment portfolios. Randomly select 150 and
investigate their performance in detail
– A semiconductor manufacturer randomly selects 1%
of products for quality control

• Much harder in other cases, particularly when


people are involved.
– Challenges?

17

Today’s lecture

Part 1: Sample size calculations

Part 2: Sampling challenges

Part 3: Investigating distributions for


categorical variables (chi-square test)

18

9
Check your sampling as part of basic
statistical work
• Bike share on the Ohio State University campus
– A completely separate part of the project involved
sampling students on the OSU campus
– This data was analyzed separately from the
(previous) downtown study
– The sampling was done by students in my class who
sampled ‘on campus’ – not ideal (convenience
sample)!
• When the sample is in, it is good practice to check
some basic demographic variables and compare
those with population demographics (to the extent
these are known, of course)
19

Demographics OSU survey


• Variable gender (nominal) – analyze with…

• From the administration office, we were told across


all (50K+) students, the gender is split 40-60.
• Would you say this sample is representative based
on gender?

20

10
Testing a hypothesis about population proportionS:
6 steps
1. Formulate the null and alternative hypotheses
2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English

21

Step 1: the null and alternative hypotheses


• You have to formulate the null and alternative
hypotheses
• For categorical (nominal and ordinal) data: these are
stated in terms of population proportions!
• For the ‘gender’ variable (in words):
– H0: the proportion of men and women in the population
of OSU students is 0.40 and 0.60, respectively
– Ha: the proportions of [[…]] is NOT 40/60
• In statistical language (population parameters):
– H0:

– Ha:

22

11
Step 2: choose the significance level
Significance level is denoted by . It indicates how certain
we are in our decision. You choose this number (typically
=0.10, 0.05, or 0.01), and use this in step 5.

For many managerial problems, =0.05 is chosen.

23

Step 3: compute a test statistic


• A test-statistic measures how close the sample has come to
the null hypothesis
• A well-thought-off test statistic (statisticians figure this out)
follows a well-known distribution such as the normal, t-, or
chi-square distribution
• For testing about proportions from a frequency table, you
use the Chi-square test-statistic:
“Chi”
 
2 O1  E1  O2  E2 
2

2
 ...
Oi  Observed counts for cell i
E1 E2 Ei  Expected counts for cell i
when H0 is true
• Interpretation?

24

12
Step 3: compute a test statistic
Four steps to compute a chi-square statistic
1. Write down the formula and the symbols
Oi 
O  E 2 O  E2 2  ...
  1 1  2
2
Observed counts for cell i

E1 E2 Ei  Expected counts for cell i


when H0 is true

2. Obtain the values for Oi from your frequency table


O1 = 176 O2 = 220
3. Obtain the values for Ei from your null hypothesis
Warning: this is the tricky part!!
E1 = 0.4 * 396 = 158.4
E2 = 0.6 * 396 = 237.6
4. Fill in the formula above
25

Step 3: compute a test statistic


Filling in the formula – did you find:

 2O1  E1  O2  E2 
2

2

E1 E2
176  158.42  220  237.62 
158.4 237.6
17.6   17.6  3.26
2 2

158.4 237.6

26

13
Step 4: prepare a statistical decision
• When the 2 value you computed is:
– small (‘close to zero’)  your data is close to the null
hypothesis
– large  your data is ‘far away’ from the null hypothesis

• The question you need to answer in step 4:


“When is the 2 value ‘too large’ so that I do not believe
anymore that H0 is true?”

• Statistical result: if the null hypothesis is true, than the


frequency distribution of all possible 2 values (imagine you
get many many samples) is a chi-square distribution

27

Step 4: prepare a statistical decision


• Just as the normal distribution (for the Z-test concerning
population means) is characterized by a mean (=0) and
a standard deviation (=1), the chi-square distribution is
characterized by the ‘degrees of freedom’ (d.f.)

• For this hypothesis test, d.f. = #categories – 1 = 2 – 1 = 1

• Use PQRS to curve your computed chi-square value to


see whether it is in the tails or not – i.e. compute the P-
value

P-value definition: the probability (assuming that the null hypothesis is


true) of observing a value for 2 test that is at least as contradictory to
the null hypothesis as the one actually computed

28

14
Step 4: prepare a statistical decision

Enter your 2
value in this
box

Pr(2 ≤ 3.26) = Pr(2 > 3.26) =


0.93 0.07
if H0 is true if H0 is true
For the 2 test, the P-value is the most right one: P-value = 0.07.
– Should we double it?

Web: http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html
HEC: http://rstudio-test.hec.fr/probcalc/
29

Step 4: prepare a statistical decision


CAUTION
Use the chi-square distribution to curve the chi-
square test value that you computed, requires
that the expected counts are large enough

To compute a valid P-value:

At most 20% of the Ei ’s computed in step 3


can be less than 5

30

15
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject the
null hypothesis
– What you do: compare the P-value to the chosen significance
level (in step 2) of the test
• The significance level of the test (denoted by ) is the
relative cut-off for deciding if the observed difference is
sufficiently big
– You choose this value : typically 1%, 5%, or 10%
• Statistical decision rule:
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the null
hypothesis
• Statistical warning: you never accept a null hypothesis!

31

Step 6: make a business conclusion

• The P-value for the test on slide 29 is 0.07.


• We choose in step 2 a significance level of 5% = 0.05
• Hence, the null hypothesis is… not rejected!

• Business conclusion: our sample (table slide 20) does


not provide evidence that it deviates significantly for
gender from the population (2 = 3.26, P-value=0.07).
Hence, we can argue that it is representative with
respect to the variable gender.

32

16
Try it yourself!
A second key demographic variable in the bike share study
is class standing. This is an ordinal variable. Hence, we
could analyze it with a frequency table. From the OSU
administration office, we know that the relative amount of
students being Freshman, Sophomores, Juniors and
Seniors is equal in the population. Would you say our
sample is representative?

You should find: not representative. Does it matter?

33

What to do if my sample is not


representative?
• Clearly, something went wrong in data collection
• Always carefully check your data sources
• Some solutions:
– Discard the data and get new (better!) data
– Obtain more data (e.g. more freshman)
– Analyze at a disaggregate (segment) level
– Weigh your observations (tricky, but doable; see
appendix)
• Be open and frank about it in your research!
Fancy stats cannot rescue badly produced data
34

17
What’s up next: SPSS lab 1 & quiz 1
• SPSS lab 1 –
– Time and location: same as regular class meetings
– It will be ‘hands-on’; my TA (Alican) and I will be there to help out
– Prepare the lab (e.g. with your team)
• Review lecture notes of sessions 1&2, data science camp
• ‘How to in SPSS’ (pdfs on Blackboard)
• Case: American Express (Data Science camp)
– You will be asked to hand in a brief assignment after the
lab (counts towards final grade); may be done in pairs
• Quiz 1 –
– Same idea as for data science camp
– Open: 1 day before SPSS lab
– Close: 2 days after SPSS lab

35

Appendix

Illustration of how weighting sample data can help


in some cases when the sample is not
representative (slide 34)

36

18
What to do if my sample is not representative?
Example: a sample of 1000 US voters, included 500
African Americans (AA) and 500 non-African
Americans (NAA).

• Poll result: 60% said they would vote Democratic


• Poll conclusion: democrats will win election

• Election result: 45% voted Democratic


• Election conclusion: democrats lost the election to the
republicans

• Hence, poll (= sample) was horribly wrong. Why?

37

What to do if my sample is not representative?


Example: a sample of 1000 US voters, included 500
African Americans (AA) and 500 non-African
Americans (NAA).
Analysis poll – taking a closer look at the polling data:
– AAs were over-represented in the sample and at the same
time strongly democratic (about 80% democrat)
– NAAs were under-represented in the sample and not favoring
democratic (about 40% democrat)
– Taking the simple average in the sample, we get
0.5 × 80 + 0.5 × 40 = 60% (the polling outcome)
– If we know (e.g. from the US census) that 18% of the
population are AAs and 82% are NAAs, what would be a
better overall average?
38

19
What to do if my sample is not representative?
Example: a sample of 1000 US voters, included 500
African Americans (AA) and 500 non-African
Americans (NAA).
– If we know (e.g. US census) that 18% of the population are
AAs and 82% are NAAs, what would be a better overall
average?
– A better average would be the weighted average:
0.18 × 80 + 0.82 × 40 = 47.2%
– We “down-weighted” the AAs data in computing the new
sample average
– This weighted average is much closer to the truth in the
population: 45% (the truth is the outcome of the election,
assuming everybody voted, or that those who voted are
representative of the whole population)
39

20
Statistics and business analytics

Bivariate statistics (comparing means)

Session 3

Fall 2018, Peter Ebbes

Today’s lecture

Part 1: Bivariate statistics

Part 2: Comparing two means (Z/t-test)

Part 3: Comparing multiple means (ANOVA)

1
Bivariate statistical analysis
• For decision making often quite important and is a
stepping stone to multivariate statistics (e.g.
regressions)
• Examples:
– Marketing – did ad A or B generate more
clickthroughs?
– Supply chain – does the temperature affect the sales
of cola?
– Human resources – did men and women had an
equal chance of being promoted in the past year?
– Banking – are homeowners that are single more likely
to default than married homeowners?

What are the variables and scale levels?


3

Recap of techniques for potential relationships


among two variables [[ data science camp ]]
This is called bivariate statistical analyses – given the goal
of the analyses:

[[ Complements the table for univariate statistics – SPSS lab 1 ]]

Today and next class: we will further work on completing this table
4

2
Case insurance claims
• Same case (data) as previous sessions 1&2
• Sample of insurance claims from a large insurer
• Today’s class: can we use demographic information
to help price insurance policies?

o Idea: given our data, investigate whether claim amounts


are different (or not) across demographic groups

o Statistics: get’s a bit more complicated,


because we now relate ‘claim_amount’ to
each of the demographic variables (e.g.
‘gender’, ‘edcat’, ‘retire’ etc.)

Today’s lecture

Part 1: Bivariate statistics

Part 2: Comparing two means (Z/t-test)

Part 3: Comparing multiple means (ANOVA)

3
Comparing two means: means plot

( 73.01 across
all policies;
data science camp
day 2 side 32)

What can we learn from this graph?


7

Are claim amounts from retirees, on average, higher


or lower than claim amounts from non-retirees?
Here we analyzed two variables jointly
– Variable ‘retire’ (categorical)

– Variable ‘claim_amount’ (quantitative)

When one variable is categorical, and the other is quantitative,


we could compare the means of the quantitative variable for
each of the categories of the categorical variable
– Ignorant manager: average claim amounts from claims filed by non-
retirees are higher than average claim amounts from claims filed by
retirees!

– Smart statistician: could the observed difference in average claim


amount be just random sampling error, or could there be a real
difference in average claim amount in the population?

4
Side-by-side box plots helps us see the
within-group variation

Large within-group variation; Small within-group variation


the observed differences (but same centers); the
among the centers may be just observed differences among
sampling variation the centers are more likely to
be significant
9

Testing a hypothesis about a population mean: 6 steps

1. Formulate the null and alternative hypotheses


2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English

10

5
Step 1: Formulate the statistical hypotheses
Step 1: You formulate TWO hypotheses:
• The null hypothesis H0
– For bivariate statistics: stated in terms of no difference or no
relation
– Formal: the two variables are independent
– Example: there is no difference in average claim amounts of
retirees and non-retirees
• The alternative hypothesis H1 or HA
– For bivariate statistics: it states that there is a difference or a
relation

• Question: how are the null and alternative hypotheses written


in statistical terms for the example on slide 7?

• Important to remember: the null hypothesis always refers to


population parameters

11

Step 2: choose the significance level

Significance level is denoted by . It indicates how certain


we are in our decision. You choose this number (typically
=0.10, 0.05, or 0.01), and use this in step 5.

For many managerial problems, =0.05 is chosen.

12

6
Step 3: compute a test statistic
• A test-statistic measures how close the sample has come to
the null hypothesis
• A well-thought-off test statistic (statistician figure this out)
follows a well-known distribution such as the normal, t-, or
chi-square distribution
• For testing the difference between two population means,
use the following formula:

X1  X 2 S12 S 22
Z difference  SX  
sX n1 n 2

• See next slide for example…

13

Step 3: compute a test statistic

Compute Zdifference using the following SPSS table

X  X2 S12 S 22
Z difference  1 SX  
sX n1 n 2

Interpretation?
14

7
Step 4: prepare a statistical decision

• The “distance” between the null hypothesis and the sample


is Zdifference = -11.64
– Is this far or close?
– Answer: quantify with the P-value

• Similar to the Z-test for one mean (session 1), we use the
standard normal distribution to curve the computed Zdifference
value
– E.g. use PQRS or other probability calculators

15

Step 4: prepare a statistical decision

• What is the P-value here?


• Interpretation?

16

8
Step 5: make a statistical decision
• You have to make the final decision: reject or not reject
the null hypothesis
– What you do: compare the P-value to the chosen
significance level (in step 2) of the test

• Statistical decision rule:


– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the
null hypothesis

• The P-value = 0.00 which is LESS than  = 0.05:


You REJECT the null hypothesis

17

Step 6: make a business conclusion

• Our results show that the difference in average claim amount


of claims from retirees and non-retirees is significant (Z =
-11.64, P-value  0.00) [[ A significant difference means: the
data supports that there is a difference in the population. ]]
• The sample suggests that retirees have, on average, lower
claim amounts than non-retirees. A decision to consider more
favorable premiums for retirees could therefore be supported
by data.
18

9
@Home practice

Often, home insurances are obtained for properties that are


not the primary residence. Is there a difference in average
claim amounts for those that claimed for a primary
residence versus those that did not claim for a primary
residence?

You should find: there is not enough evidence in the data to


argue that there is a difference [[ in the population ]]

19

Comparing two means: remark 1


(some other remarks in the appendix)

Means plot
Are claim amounts from customers with different education levels,
on average, the same, or not? How should we proceed to address
this question?
20

10
Today’s lecture

Part 1: Bivariate statistics

Part 2: Comparing two means (Z/t-test)

Part 3: Comparing multiple means (ANOVA)

21

ANalysis Of VAriance

• ANOVAs can be used to answer the question “Do all


groups have the same population mean”?
– One variable is quantitative
– One variable is categorical (more than 2 categories)

• We cannot answer this question just from the means


plot (slide 20) alone; we need to consider the
difference among the sample means along with the
differences within each group

22

11
Side-by-side box plots helps us see
the within-group variation

Large within-group variation; Small within-group variation


the observed differences (but same centers); the
among the centers may be just observed differences among
sampling variation the centers are more likely to
be significant
23

Testing a hypothesis about a population mean: 6 steps

1. Formulate the null and alternative hypotheses


2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English

24

12
Using ANOVA to test equality of population means

• Does education relate to claim amount? (slide 20)


• Step 1: formulate the null hypothesis
H0:
• Step 2: choose significance level (=0.05)
• Step 3: compute test-statistic
– For ANOVA, this is the F-statistic
– Very hard to compute by hand – use SPSS

25

Using ANOVA to test equality of population means


• Step 3 (cont.): F-statistic = 1.78
• Step 4: prepare a statistical decision
– Get the P-value – curve the test statistic under the
appropriate distribution
– For an F-statistic, use the F-distribution
– Fortunately, SPSS does it for us (‘Sig.’ in previous table)
– P-value = 0.13
• Step 5: make a statistical decision
– The null hypothesis is REJECTED / NOT REJECTED
• Step 6: what is the business conclusion
– Even though the mean plot (sample) on slide 20 suggests
that education informs us about the average claim amounts,
there is not enough evidence to support this in the population
(F-statistic = 1.78, P-value = 0.13)
26

13
Using ANOVAs
CAUTION!
More so than any of the techniques we have learned so far,
ANOVA requires us to be more careful about examining
underlying data assumptions
1. Sample should be a random sample (or at least arguably
so)
2. Data should be approximately normally distributed within
each group
3. The variances in the different groups should be
approximately equal

Therefore, ANOVAs should always be accompanied by


graphical and numerical summaries (next two slides)

27

Using ANOVAs Check assumption 2


CAUTION! (previous slide)

Do you feel comfortable that the ANOVA assumptions are correct?


28

14
Using ANOVAs Check assumption 3
CAUTION! (slide 27)

• The variances are fairly similar for the five groups (above
table), however, it is hard to argue that within each group the
data is approximately normal distributed (previous slide)
• Therefore, we should resist the temptation interpreting and
using the previous analyses (slides 25&26).
• Instead, we could consider re-doing the analysis with the
logarithmic transformation! (why?)
29

Re-do the ANOVA with the log transform

Try at home! You should be able to produce the


following table (and graphs on next slides):

Conclusion?
Are the assumptions after the log transform valid?

30

15
Check ANOVAs assumptions

After log transform, data has an (approximate) normal distribution


and equal variance within groups (compare to slide 28)
31

Sample means log claim amount

The observed difference in means is statistically significant (slide 30)

32

16
ANOVA: note on interpretation
• Rejecting the null hypothesis of equal means (e.g. slide
30), does not mean that all of the means are different!

– We have at least one inequality (alternative hypothesis)

• This can be tested through the multiple comparisons


procedure [[ this step is not necessary though; depends
on research question ]]

– Can only be used after rejecting the ANOVA H0 (slide 30)!

– Which pairs of population means differ significantly?

– How to do?

33

ANOVA: multiple comparisons procedure


(snapshot of table)

The P-values for the tests H0: µi = µj are listed in the column ‘Sig.’
(Sloppy: you do the test on slides 10—18 here 5 × 4 = 20 times)
Practice managerial interpretation
34

17
Today’s class in sum
• Statistical inference for bivariate statistical analysis
(analyzing two variables jointly)
• Particularly today: one quantitative variable and one
categorical variable
– Compare two means (t-test)
– Compare more than two means (ANOVA + multiple comparisons)
– Graphically: means plot; side-by-side box plots
• Application: insurance claims
– Are insurance claims from certain demographic groups, on
average, higher, lower, or about the same?
– Claims of retirees are, on average, lower than claims from non-
retirees; claims from clients with the least education are, on
average, lower than from clients with the most education
– These analyses provide a starting point for building pricing
models for segments
35

Appendix

Some additional remarks regarding the t-test for


two means

36

18
Comparing two means: remark A1

• Similar as before, we computed a Z-test (slides


13, 14) but this test is similar to a t-test that
computer packages (e.g. SPSS) typically give you

• For sample sizes that are at least 50 or more,


these two give similar results

• For smaller samples, t-test is better (and verify


that the variable has an approximate normal
distribution; e.g. side-by-side box plots)

37

Comparing two means: remark A2

• In the previous example, we compared means for


two independent groups of data
– The observations in the one group (‘retirees’) were
sampled independently from the observations in the
other group (‘non-retirees’)
– In other words, the two means are computed from a
different and unrelated set of observations

• Some statistical questions ask us to compare two


means that are each computed over the same set of
observations. We need to use a different t-test
formula (i.e. not the one on slide 13, 14)

38

19
Comparing two means: remark A2
• For instance, suppose for this sample of 4415 individuals, we
had also measured the claim amounts from two years ago.
• How would this variable show up in SPSS? Eg consider the
following hypothetical example:

39

Comparing two means: remark A2


• For instance, suppose for this sample of 4415 individuals, we
had also measured the claim amounts from two years ago.
• How would this variable show up in SPSS?

• Research question: did average claim amounts change from


two years ago or not?

– This year: mean = 73.01, std.dev. = 144.40, n=4415

– [[ suppose ]] Two years ago, i.e. the average and standard


deviation of the column ‘claim_amount_2yrs’ (previous slide) is:

mean = 65.30, std. dev. = 167.25, n=4415

– We cannot use formula on slide 13, because the two means are
computed over the same set of observations

40

20
Comparing two means: remark A2
• Idea: create a new variable (see column ‘diff’ on slide 39)

diff = claim_thisyear – claim_twoyearsago

• Then: compute sample mean and standard deviation for


‘diff’ and apply t-test of session 1 to this new variable

• That is, test H0: µdiff = 0 (or any other number) with

diff 0
diff  
diff

• Curve under a standard normal distribution to get P-
value
[[ Discussion book pp392—402 (11th); pp455—459 (12th); pp447—451 (13th);
in SPSS paired samples t-test ]]
41

21
Statistics and business analytics

Bivariate statistics (cross-tabs)

Session 4
“Lots of numbers”
lecture!

Fall 2018, Peter Ebbes

Course announcements
• Quiz 1 results
• Quiz “walk in” office hours (data science camp quiz +
quiz 1)
– Review the feedback given in your result summary!

– When: Tue Oct 16th


• 1700-1730hrs ES3

• 1730-1800hrs ES2

• 1800-1830hrs ES1

– Where: building W1 3rd floor (316 and 317)

• Quiz 2: will open one day before SPSS lab 2 on Mon Oct
15 and close two days after the lab on Thu Oct 18
2

1
Today’s lecture

Part 1: Bivariate statistics

Part 2: Inference for a cross tab

Part 3: Graphics do’s and don’ts

Part 4: Towards business analytics

Previous class

• We started analyzing whether demographic segments


have a tendency to claim more/less on their policy

– This is bivariate statistics (claim_amount combined with


several demographic variables)

• Bivariate statistical analyses are very popular for


decision making, and a stepping stone for further
(advanced) stats

• Usually not the start of a stats project – first become


friends with your data (descriptive, univariate statistics!)

2
Bivariate stats techniques
The purpose of the analysis and the scale level of the
variables help us decide what statistical technique to use

[[ If one variable is “time” we do time series analysis ]]


5

ANOVA: test equality of more than two means


• Bivariate statistics – one categorical variable and one
quantitative variable (see previous slide; session 3)
– Compare the means of the quantitative variable across the
categories of the categorical variable
– Example: average claim amount for customers with pre college,
college, post college education
• ANOVAs are easy to run in any computer package, but you
need to check the underlying assumptions (session 3 slide
27)
– Before running an ANOVA, first examine side-by-side box plots
– If data looks approximately normally distributed (with equal
variances), go for it
– Otherwise, try log transform first, and re-compute the box plots to
see whether things look acceptable after transformation
6

3
Hypothetical ANOVA example 1

Data approximately normally distributed with equal variances.


OK for ANOVA, no need to take the log.
7

Hypothetical ANOVA example 2

In this 2nd example, data is NOT normally distributed. We should NOT


run ANOVA on original variable claim amount, but should log it.
8

4
Hypothetical ANOVA example 2 (cont.)

After taking the log, data is approximately normally distributed


with equal variances. OK to run ANOVA on LN(claim amount).
9

Today’s lecture

Part 1: Bivariate statistics

Part 2: Inference for a cross tab

Part 3: Graphics do’s and don’ts

Part 4: Towards business analytics

10

5
Insurance claims case

What can we say about claims that are fraudulent


versus claims that are not? Are fraudulent claims more
likely to be of a certain type?

– Are men more or less likely to file a fraudulent claim


than women?

– Are claims that render the home uninhabitable more


or less likely to be fraudulent then claims that did not
leave the home uninhabitable?

– How about town size? Are fraudulent claims more


likely to happen in smaller or larger towns?

Variable(s)? Scale levels?


11

Cross tab fraud versus type

Table useful?

12

6
Cross tab fraud versus type

What can we conclude?

13

Testing a hypothesis about a cross tab: 6 steps

1. Formulate the null and alternative hypotheses


2. Choose the significance level
3. Compute the test-statistic
4. Prepare a statistical decision (P-value)
5. Make a statistical decision: reject or not reject the null
hypothesis
6. Make a managerial decision/interpretation: interpret
the statistical decision in ‘plain’ English

14

7
Step 1: Formulate the statistical hypotheses
Step 1: You formulate TWO hypotheses:
• The null hypothesis H0
– For bivariate statistics: stated in terms of no difference or no
relation
– Formal: the variables ‘fraudulent’ and ‘claim_type’ are
independent
– Here: there is no difference in likelihood (~probability) for a claim
to be fraudulent across the different claim types
• The alternative hypothesis H1 or HA
– For bivariate statistics: it states that there is a difference or there
is a relation between the variables

• Note: there is no convenient way to write H0 in statistical


notation for cross-tabs – so we don’t do that
• Important to remember: the null hypothesis always refers to
the population

15

Step 2: choose the significance level

Significance level is denoted by . It indicates how certain


we are in our decision. You choose this number (typically
=0.10, 0.05, or 0.01), and use this in step 5.

For many managerial problems, =0.05 is chosen.

16

8
Step 3: compute a test statistic

• A test-statistic measures how close the sample has come to


the null hypothesis
• A well-thought-off test statistic (statisticians figure this out)
follows a well-known distribution such as the normal, t-, or
chi-square distribution
• For testing in a cross tab, you use the chi-square test-
statistic:

 
O1  E1  O2  E2 
2
2

2
 ...
Oi  Observed counts for cell i
E1 E2 Ei  Expected counts for cell i
when H0 is true
• Interpretation?

17

Step 3: compute a test statistic


Three steps to compute a chi-square statistic:
I. Obtain the values for Oi from your cross table (slide 12 or 13)
  No  Yes  Total 
Wind  1 963  2 91  1054 

Water  3 577  4 50  627 

Fire  5 919  6 120  1039 

Contam  7 378  8 26  404 

Theft  9 1115  10 176  1291 

Total  3952  463  4415 


 

18

9
Step 3: computing Ei (previous slide)
II. Obtain the Ei’s – warning: this is tricky
row total  column total
Use formula: Ei 
total sample size
  No  Yes  Total 
1054 3952 1054 463
Wind  1  943.5  2  110.5  1054 
4415 4415
627 3952 627 463
Water  3  561.2  4  65.8  627 
4415 4415
1039 3952 1039 463
Fire  5  930  6  109  1039 
4415 4415
404 3952 404 463
Contam  7  361.6  8  42.4  404 
4415 4415
1291 3952 1291 463
Theft  9  1155.6  10  135.4  1291 
4415 4415

Total  3952  463  4415 


 

19

Step 3: compute a test statistic


Step III: Fill out the formula on slide 17

963 943.5 2 91 110.5 2 577 561.2 2


Χ2
943.5 110.5 561.2
2 2
50 65.8 919 930 120 109 2
65.8 930 109
2 2
378 361.6 26 42.4
361.6 42.4
1115 1155.6 2 176 135.4 2
1155.6 135.4
0.403 3.441 0.445 3.794 0.130 1.110
0.744 6.343 1.426 12.174 30.01 

20

10
Step 4: prepare a statistical decision

• When the  value you computed is:


– small (‘close to zero’)  your data is close to the null
hypothesis
– large  your data is ‘far away’ from the null hypothesis

• The question you need to answer in step 4:


“When is the  value ‘too large’ so that I do not believe
anymore that H0 is true?”

• Statistical result: if the null hypothesis is true, than the


frequency distribution of all possible  values (imagine you
get many many samples) is a chi-square distribution

21

Step 4: prepare a statistical decision


Curve the found value for chi-square (30.01) to get a precise
definition of how far / close the sample is from the null
– Compute the P-value from a chi-square distribution with degrees of
freedom = (# rows – 1) × (# columns – 1)
– Here: d.f. = (5-1) × (2-1) = 4×1=4
Use e.g. PQRS or online applets, e.g.
Web: http://homepage.divms.uiowa.edu/~mbognar/applets/chisq.html
Experimental: http://rstudio-test.hec.fr/probcalc/

22

11
Step 4: prepare a statistical decision
CAUTION

Using the chi-square distribution to curve your chi-


square test value, requires that the expected
counts are large enough

To compute a valid P-value:

At most 20% of the Ei ’s computed in step 3


can be less than 5

23

Step 5: make a statistical decision

• You have to make the final decision: reject or not


reject the null hypothesis
– What you do: compare the P-value to the chosen
significance level (in step 2) of the test
• Statistical decision rule:
– If the P-value is LESS than : REJECT the null hypothesis
– If the P-value is LARGER than : DO NOT REJECT the null
hypothesis

• P-value = 0.00 and will therefor be LESS than  =


0.05. Hence, REJECT the null hypothesis

24

12
Step 6: make a business conclusion

If the null hypothesis is rejected in step 5, then you


formulate the business conclusion by interpreting the
pattern of the relationship from the cross table (use
percentages) (table from slide 13).

25

Step 6: make a business conclusion

Our analysis suggests that (in the population) there is a


significant difference in likelihood for the claim types to be
fraudulent (2= 30.01, P-value = 0.00), and claims of some
types are more likely to be fraudulent than others.

Specifically, our sample suggests that claims from theft


and fire are about twice as likely (13.6% and 11.5%) to be
fraudulent than claims from contamination (6.4%). About
8% of the wind and water claims are fraudulent.

We recommend that fraud inspectors spend relatively


more time investigating claims from theft and fire…
(etc. depending on case context)

26

13
Try it yourself!
How about size of the town? Are fraudulent claims more
likely to happen in smaller or larger cities? You should
develop an hypothesis test to investigate this research
question; use =0.10.

You should find that the null hypothesis is not rejected (but
borderline). Size of town does not help us assign fraud
inspectors, for instance.
27

Cross tabs + chi-square test in sum

Very useful and powerful technique to analyze the


relationship between two categorical variables

– A cross tab “describes” their relationship (descriptive


statistics)

– A chi-square test “generalizes” the finding to the


population (inferential statistics)

– We could “graph” their relationship using clustered or


segmented bar charts (data science camp day 1)

28

14
Learning about fraud

• A series of cross-tab analyses help in understanding fraud

– Claims from theft and fire are more likely to be fraudulent


then other claim types (slide 26)

– But claims filed in smaller or larger towns/cities are


fraudulent about equally likely (@home exercise)

– [[ We should complete the analyses: how about marital


status? Education? Retire? Residence type? Etc. – it’s like
painting a picture ]]

• These insights can be used “as-is”, or may be seen as


stepping stone to a more complex model of fraud
prediction: should we investigate this claim or not?

29

Today’s lecture

Part 1: Bivariate statistics

Part 2: Inference for a cross tab

Part 3: Graphics do’s and don’ts

Part 4: Towards business analytics

30

15
Making bad graphs

Question to class

What makes a good graph?

31

Bad graph 1

Do you have something to show? No? Let’s fill it up with chartjunk!


32

16
Bad graph 2

Compressed y-axis distorts differences – we can show that


expiration of the Bush tax cuts are *really* bad!
33

34

17
Bad graph 3

Let’s display data inaccurately! We can abuse the visual


metaphor, change scales etc.
35

200 yrs ago people (Playfair, 1786) already knew how to do this..
36

18
Bad graph 4

Urgency to misinform: abuse the (time) scale!


37

For time series plots, use regular time scale


38

19
Bad graph 5

Let’s show off: clutter it up and confuse people!

39

Pie charts should add up to 100%


Label them clearly, with a minimum of ink, and no distractions

40

20
Graphics in sum

• Good graphics: display data accurately and clearly


– Examine them carefully to know what they have to say

– Then let them say it with a minimum of ink

• Good practices:
– Data-ink ratio should grow with the amount of data
displayed

– No “chart junky”

– Choose scale carefully

– Label clearly and fully

41

Today’s lecture

Part 1: Bivariate statistics

Part 2: Inference for a cross tab

Part 3: Graphics do’s and don’ts

Part 4: Towards business analytics

42

21
In sum – first 5 sessions
Using statistics for decisions
• Always start with descriptive statistics
– Get the key ‘statistics’ for your variables
– Ask/compute graphics for the key variables
– Convince yourself that the data is of good quality (e.g.
sample selection, sample size, measurement, outliers
etc.)

• An important step to be a statistician: realize you are


working with a sample but your conclusions /
recommendations / decisions are for the population
(inferential statistics; CLT)!
43

Univariate statistics
• Given a purpose of analysis...
• Categorical variable
Descriptive

– Numerical: frequency table (proportions, counts)


– Graphical: bar, pie chart
– Inferential: confidence interval for a proportion (dichotomous
variable), Z-test for a proportion (dichotomous variable), chi-
square test for proportionS
• Quantitative variable
Descriptive

– Numerical: central tendency, dispersion, 5 number summary


– Graphical: boxplot, histogram
– Inferential: confidence interval for a mean and Z-test for a mean
• And, sample size calculations for key variables

44

22
Bivariate statistics

The purpose of the analysis and the scale level of the


variables help us decide what statistical technique to use

[[ If one variable is “time” we do time series analysis ]]


45

Parts TWO and THREE of this class focus more


on ‘business analytics’

• Multivariate statistics – bringing together multiple


variables in one approach / probability model

– Usually comes after uni/bivariate statistical analysis

– Very useful for decision making and predicting

• Techniques: regression analysis, logit regression


models, factor analysis, cluster analysis

46

23
Next class meeting: SPSS lab 2

• Covers sessions 3&4


• Prepare case:
– Continue working with insurance fraud dataset (you
already know data and case context)
– Will be posted on Blackboard
• Pay attention to discussion questions given in case;
these hint towards the Q&A form
• Go through ‘How to guides’ for sessions 3&4

47

Appendix

Same cross tab as slide 13 with column percentage


instead of row percentage
48

24
Statistics and business analytics

Using correlations and linear regressions


to inform decisions

Session 5

Fall 2018, Peter Ebbes

Course announcements

Quiz 2

1
Today’s lecture

Part 1: Bivariate stats for La Quinta (case)

Part 2: Correlations

Part 3: Simple linear regressions mechanics

Part 4: Interpreting SPSS regression output

Brief recap of PART ONE of this class

• Intro to statistical and business analytics


– Descriptive stats vs Inferential stats
– Univariate stats vs Bivariate stats

• Three important things for applied work


1. Purpose of the analysis
2. Use scale level (categorical vs quantitative) to
determine statistical method
3. Keep in mind the implication of the sampling
distribution

2
Bivariate stats techniques
The purpose of the analysis as well as the scale level of the
variables help us decide what statistical technique to use

Case: La Quinta

La Quinta operates and provides franchise services to


more than 800 hotels with over 80000 rooms in the
U.S., Canada and Mexico under La Quinta Inns and
La Quinta Inns & Suites brands.

Based in Dallas, Texas with 9,000 employees


nationwide

[[ with an early morning stats class! ]]


6

3
Case: La Quinta

Important challenge in hotel business: expanding


locations [[ e.g. WSJ 10/31/2012 Marriott plans Asia expansion ]]

Where should La Quinta locate a new hotel?

Margin
factor

Market Demand
Competition Community Physical
Awareness Generators

Case: La Quinta
• Sample of 100 hotels
• We got a subset of the variables that measure the
factors in the profit margin model
– # of rooms within 3 mile radius (competition)

– Distance to nearest competitor (market awareness)

– Offices, higher education (demand generators)

– Median household income (community)

– Distance to downtown (physical)

– Profit margin

• You got the data: where do you start?


8

4
Univariate statistics: variable ‘Margin’

Profit margin (in percentages)

Sample: 45.74, S=7.75, min=27.30, max=62.80


Perform same analyses for other 6 quantitative variables
9

Univariate statistics: variable ‘Margin’


Profit margin (in percentages)

Suppose we randomly draw another hotel from La Quinta’s database.


What would be your prediction for its profit margin based on this graph
and in absence of a model?
10

5
Today’s lecture

Part 1: Bivariate stats for La Quinta (case)

Part 2: Correlations

Part 3: Simple linear regressions mechanics

Part 4: Interpreting SPSS regression output

11

Correlation
• When you want to describe the relation between TWO
quantitative variables, you may compute the correlation
coefficient

• Correlation coefficient defined:


– Correlation coefficient summarizes the strength of linear association
between two quantitative variables.
– Sample correlation: r (is computed from sample)
– Population correlation: unknown – Greek letter ‘rho’)
– Correlation coefficients are always between -1 and +1.

• Correlation helps answer questions like:


– If X increases does Y tend to increase or decrease?
– If X is greater in value does Y tend to be greater or smaller in value?

What graphical technique accompanies correlation analysis?


12

6
Correlation between profit margin and
competition is negative

Graphically: scatter plot

r = -0.47

13

Correlation between profit margin and demand


generators is positive

r = 0.50

14

7
Correlation between profit margin and physical
factor is negligible

r = -0.09

15

The sample correlation coefficients r can be


compute with SPSS

The numbers in red are the ones


we saw on previous slides
Interpretation? (table is symmetric)

16

8
Are correlation coefficients used a lot in
applied business analytics?!
Yes!! A lot… But here are two warnings:

1. The correlation coefficient measures the strength of a


linear relationship between two variables

2. Be careful interpreting correlations: a strong relationship


between two variables does not mean causation

– … the number of Churches in a city correlates with the number of


crimes
– … students in a psychology class who had long hair got higher
scores on the midterm than those who had short hair

‘spurious correlations’ or ‘nonsense correlations’


17

Freakonomics: Everything Is Correlated


(04/04/2011)

www.correlated.org

Goal: uncover one surprising correlation every day

www.Tylervigen.com

“Spurious Correlations was a project I put together as a fun


way to look at correlations and to think about data.”

18

9
Freakonomics: Everything Is Correlated
People who drowned after falling out (04/04/2011) www.Tylervigen.com
of a fishing boat (# deaths)

0.95

Marriage rate Kentucky (per 1000)

19

Freakonomics: Everything Is Correlated


(04/04/2011) www.Tylervigen.com
Worldwide non-commercial space

0.79
launches (#)

Sociology doctorates awarded in US (#)

20

10
Freakonomics: Everything Is Correlated
(04/04/2011) www.Tylervigen.com
Price of apples ($ per pound)

0.89

Number of labor political action committees in US

Please remember: correlation is not causation!


21

Today’s lecture

Part 1: Bivariate stats for La Quinta (case)

Part 2: Correlations

Part 3: Simple linear regressions mechanics

Part 4: Interpreting SPSS regression output

22

11
Regression analysis

“Workhorse of applied statistics”


Objective
– Quantify the relationship between a criterion
(dependent) variable and one or more predictor
(independent) variables
Uses
– Understanding how a predictor variable influences the
dependent variable

– Predicting/Forecasting the dependent variable based on


specified values of the predictor variables

23

Common applications of regression in business


• Predicting demand
– Impact of economic conditions

– Marketing mix models

– Price elasticity

• Risk assessments
– Insurance polices

– Financial risk (beta)

– Examining abnormal financial returns

• Other applications (economics, political science,


meteorology etc.)
24

12
Correlation vs. regression analysis

• Correlation analysis/scatter plots: do before performing


regressions

• Correlation analysis tells us the strength of a linear


relationship between two quantitative variables

• Regression analysis allows us to build a mathematical


equation describing these linear relationships

• These equations can be used to

– Predict the value of the dependent variable

– Explain the effect of one or more independent variables on


the dependent variable

25

How do you quantify the relation between


dependent and independent variables?

One way: through a straight line

Straight line formula:

Y = a + b*X

a = intercept (cuts Y-axis)


b = slope of line

(Demand generator)
26

13
The Regression equation -
true model (in population)
A.k.a.
Regressor;
Dependent Variable Independent Variable Explanatory
(“Profit margin”) variable;
(“Office space volume”)
Predictor

Y   0  1 X 1  

Constant Coefficient of
(Intercept) Independent Variable
(Slope)

How do we obtain values for β0 and β1?


27

Sum of squared errors is measure of


predictive accuracy

45.7 (slide 8)

28

14
Sum of squared errors is measure of
predictive accuracy

45.7 (slide 8)

29

Sum of squared errors is measure of


predictive accuracy

45.7 (slide 8)

30

15
Sum of squared errors is measure of
predictive accuracy

45.7 (slide 8)

31

Sum of squared errors is measure of


predictive accuracy

“Best” line

45.7 (slide 8)

32

16
Sum of squared errors is measure of
predictive accuracy
ei (green error bar) is
the difference between
the predicted value
“Best” line (red line) and the
actual data

45.7 (slide 8)

The objective of Ordinary Least


Squares (OLS) is to minimize ∑

33

Does X help us explain/predict Y?


(Predicted value)

“Best” line (red):

Difference between the


best line and the actual
data:

(Sum of Squared Errors;


a.k.a. unexplained
variation/variance)

34

17
Does X help us explain/predict Y?

“Best” line (red):

The difference between


the overall mean and
the actual data

Sum of Squares Total;


total variation in Y

35

Does X help us explain/predict Y?

“Best” line (red):

The difference between


the overall mean and
the actual data

The difference between
the best line and the
actual data:

Question:
Can SSE>SST?

36

18
Does X help us explain/predict Y?

• SST is the total variation in Y


• SSE is the unexplained variation in Y
• SSE is not larger than SST by design of OLS [[ remember:
the fitted line (red) is found by minimizing SSE ]]
• How much of the variance of Y is explained by X?

"Variation in Y explained"
"Total variation in Y"

– SSR is the Sum of Squares due to Regression ~ explained


variation in Y (by the model)
– Maximum value of R2?
– Minimum value of R2?

37

Today’s lecture

Part 1: Bivariate stats for La Quinta (case)

Part 2: Correlations

Part 3: Simple linear regressions mechanics

Part 4: Interpreting SPSS regression output

38

19
SPSS puts a line where sum of squared errors
is smallest

Table 1

Table 2

Table 3
“Best” line

39

SPSS output for simple linear regression


Table 1

How much variance in Y (profit margin) is explained?


– R-square = 0.251 – Interpretation?
– Adjusted R-square: similar, but takes into account the
number of independent variables
– [[ Std. error of the estimate: standard deviation of the error (slide 26) ]]

40

20
SPSS output for simple linear regression
Table 2

Did X (Volume of office space) tell us anything?


– FYI: SSE = 4453.569, SST = 5949.458, and SSR =
SST – SSE = 1495.889
– More important: omnibus test of model relevance
[[ generalize sample result to the population ]]

41

SPSS output for simple linear regression


Table 2 (cont.)

Did X (Volume of office space) tell us anything?


– The P-value in the column ‘Sig.’ was generate by the null hypothesis
that “the model has no explanatory power (in the population)”
– This is equivalent to stating that all regression coefficients  are zero
– Here H0 is rejected (P-value = 0.00 < 0.05)
– Conclusion: the model explains some variation in the dependent
variable (in the population)
42

21
SPSS output for simple linear regression
Table 3

What is the best regression line?


– The estimated coefficients for 0 and 1 are, respectively, b0 =
34.188 and b1 = 0.023
– Generalize results to population through an hypothesis test that X
does not explain Y (no relation)
– Here, null hypothesis is rejected (P-value = 0.00 < 0.05) and
‘Volume’ has a significant (positive) effect on ‘Margin’
– We will get back to ‘Standardized Coefficients’ later
43

Using the simple linear regression model


The estimated model is:
Profit margin = 34.188 + 0.023 * Volume (in 1000 sqft)
This model:
– Quantifies the linear relation between profit margin and office space
volume in the hotel area, and is more informative than the correlation
coefficient (r=0.50 on slide 13)
– Explains how much “extra” profit margin (0.023) can be expected from
a 1 unit (in 1000sqft) increase in office space volume in the area
– Can be used to predict a profit margin for different values of office
space volume in the area
• Area A has 800 (thousand) sq feet in office space: expected profit margin
for a hotel is: 34.188+0.023*800=52.6
• Area B has 400 (thousand) sq feet in office space: expected profit margin
for a hotel is: 34.188+0.023*400=43.4

44

22
Try it yourself @Home 1

Interpret the above regression output


45

Try it yourself @Home 2

Interpret the above regression output


46

23
Today’s class in sum

Bivariate statistics for two quantitative variables


– Correlation analysis
– Simple linear regression model

Next class meeting: extend the simple linear


regression model to situations where we have
multiple (quantitative) variables
– Multivariate statistics
– Multiple linear regression

47

24
Statistics and business analytics

Using linear regressions to inform decisions

Session 6

Fall 2018, Peter Ebbes

Course announcements
SPSS lab 3 (of 5), on Thursday Oct 25th, practices
basic regressions (sessions 5 and 6)

– Prepare: ‘How to guides in SPSS’ sessions 5 and 6 and


SPSS lab case guidelines.

– Quiz 3: opens up one day before SPSS lab (aka


tomorrow); closes two days after SPSS lab on Saturday
Oct 27th at 23.59hrs.

1
Today’s lecture

Part 1: Multiple linear regressions

Part 2: Regression diagnostic tasks

Part 3: Using for decisions – predictions

Previous class meeting


• Analyzing two quantitative variables jointly
– Correlations

– Simple linear regression model

• Examples:
– Sale force management: what is the relationship between
sales productivity (e.g. sales) and years of experience for
sales people?

– Marketing: how sensitive are customers to price (price


sensitivity)

– Finance: how sensitive are the stocks of company A to


general market movements? (CAPM, beta)

2
Case: La Quinta – previous class meeting
Important challenge in hotel business: expanding locations
Where should La Quinta locate a new hotel?

Margin
factor

Market Demand
Competition Community Physical
Awareness Generators

# rooms Office Distance


data

within 3 space in to town


miles area center

We did three simple linear regressions. What if we want to


assess the impact of ALL factors on margin jointly?
5

Multiple linear regression model


Simple linear regression was used to analyze how one
quantitative variable (the dependent variable Y) is related
to one other quantitative variable (the independent
variable X).

Multiple linear regression allows for any number of


independent variables.
– But still only one dependent variable
– And still all variables are quantitative (but this can be
extended; next class meetings)
– This is an example of multivariate statistical analysis

We expect to develop models that fit the data better


than would a simple linear regression model.
6

3
Multiple linear regression model

We now assume that we have (say) K independent variables


potentially related to one dependent variable

Interpretation is similar to the simple linear regression model


(session 5, slide 27)
– Y is the dependent variable
– X’s are the independent variables
– is the constant (intercept)
– ,…, ’s are the slopes
– is the error term
7

Case: La Quinta
Margin
factor

Market Demand
Competition Community Physical
Awareness Generators

• This theoretic model is operationalized by measuring each


of these factors by one or more variables
– Margin is measured by the variable ‘profit margin’
– Each factor is measured by one or more variables (see also prep
guide session 5)
– To keep it simple, in this case we work with a set of 6 variables
• Then, we fit a multiple linear regression model to estimate
the relation between the factors and profitability
8

4
La Quinta: regression results from SPSS

SPSS will give you three tables that need to be interpreted


(similar tables as in session 5 slide 39)

La Quinta’s regression model


A first glance
• Although we haven’t checked significance of these variables
yet, at first pass:

38.14 0.008 1.65 0.02 0.21 0.41 0.23

• It suggests that increases in


(2) the number of miles to closest competition, (3) office space,
(4) student enrollment and (5) household income will positively
impact the profit margin.
• Likewise, increases in
(1) the total number of lodging rooms within a short distance and
(6) the distance from downtown will negatively impact profit
margin.
10

5
SPSS output for linear regression
Table 1

How much variance in Y (profit margin) is explained?


– R-square = 0.53 – Interpretation?
– Adjusted R-square: similar, but takes into account the number
of independent variables
– [[ Note: the R-square went up from 0.25 to 0.53 by including additional
independent variables; quite an improvement! (session 5 slide 40) ]]

11

SPSS output for linear regression


Table 2

Did the independent variables tell us anything about profit margin?


– The P-value in the column ‘Sig.’ was generate by the null hypothesis that “the
model has no explanatory power (in the population)”
– This is equivalent to stating that all regression coefficients  are zero
– Here H0 is rejected (P-value = 0.00 < 0.05)
– Conclusion: the model explains some variation in the dependent variable (in
the population)
12

6
SPSS output for simple linear regression
Table 3

It is important to inspect the (here) six t-tests for each independent


variable. Each tests that (in the population) there is no relation.
13

Interpreting the Coefficients*

• Intercept (b0 = 38.14)


– Average profit margin when all independent variables = zero.
– What’s it mean?

• # of motel and hotel rooms (b1 = –0.008)


– Each additional room within three miles of the La Quinta inn, the
profit margin will decrease (on average) 0.008.
– (I.e. for each additional 1000 rooms the margin decreases 8)

• Distance to nearest competitor (b2 = 1.65)


– For each additional mile that the nearest competitor is to a La Quinta
inn, the average profit margin increases (on average) 1.65.

*in each case we assume all other variables are held constant!
14

7
Interpreting the Coefficients*

• Office space (b3 = 0.020)


– For each additional thousand square feet of office space, the margin
will increase (on average) 0.020.
• E.g. an extra 100,000 square feet of office space will increase margin
(on average) 2.0.

• Median household income (b5 = 0.41)


– For each additional thousand dollar increase in median household
income, the average profit margin increases 0.41

*in each case we assume all other variables are held constant!
15

Which of the factors is/are most important in


determining location?

Inspect the standardized coefficients or t-values


16

8
Which of the factors is/are most important in
determining location?
• The independent variables X1, X2,…,X6 are all measured
on different scales complicating assessing their relative
importance
– X1 (number of rooms) is a count (1,2,…)
– X2 (distance to competitor) is in miles
– X3 (volume of office space) is in 1000s sq ft
– Etc.
• The estimated regression coefficients incorporate these
scale differences and it is therefore tricky to compare
them relatively
• Two possible solutions:
– Use the standardized coefficients
– Use the t-values
17

Which of the factors is/are most important in


determining location?
• Standardized coefficients
– First standardize your dependent and independent variables
– (re)Run the regression on the standardized variables
– Now all estimated regression coefficients are on the same scale
– E.g. an increase of 1 standard deviation in X1 leads to a
decrease of 0.440 in Y ( denotes the sample standard
deviation of Y). [[ -0.440 comes from table slide 16 ]]
• t-value comparison
– The magnitude of the t-values may be a better approach to
assess the relative importance of independent variables
• Important to remember: in general, do not use the estimated
coefficients (b1, b2,…,b6) for relative importance!
18

9
Which of the factors is/are most important in
determining location?

The variables X1, X3, and X5 are most important in explaining


profitability
19

Today’s lecture

Part 1: Multiple linear regressions

Part 2: Regression diagnostic tasks

Part 3: Using for decisions – predictions

20

10
Check the underlying assumptions of linear
regressions

Before you use the regression model for decision


making, you need to convince yourself that what
you are (about) to work with is okay to use

21

Garbage In = Garbage Out

22

11
Important aspects to check in regression
analysis
Before drawing conclusions about a population based on
a regression analysis done on a (one!) sample, first
check (at the minimum) the following five aspects:
1. Variable types: use quantitative variables [[ fun fact: we will
extend this next class meetings to categorical variables ]]

2. Assess goodness of fit: investigate R2 and all the P-values


(significance of independent variables)

3. Is there a linear relationship between Y and the Xs?

4. Residual analysis: your errors (see slide 7) should ‘behave’


(ideally, normally distributed, and no outliers)

5. Multicollinearity: your predictor (independent) variables should


not correlate ‘too highly’

23

Aspect 3: Is there a linear relationship between


Y and the Xs?

Inspect scatter plots (e.g. session 5) of the dependent


variable versus the independent variables

– Be aware of abnormal patterns, in particular non-linear


patterns

– Be aware of outliers

24

12
Is there a linear relationship between Y and the Xs?
Inspect scatter plots of the dependent variable versus the
independent variables

R2 = 50%

(note: no relation,
i.e. 2 0%, is not
per se abnormal)

25

Is there a linear relationship between Y and the Xs?


Inspect scatter plots of the dependent variable versus the
independent variables

R2 = 50%

26

13
Is there a linear relationship between Y and the Xs?
Inspect scatter plots of the dependent variable versus the
independent variables

R2 = 50%

27

Aspect 4: inspect the residuals (errors) of your model

• The errors should (ideally) have a normal distribution


(at least, approximately so)
– Inspect through a histogram (e.g. with a normal curve as
reference in it) or boxplot

• Good practice:
(1) Look for “normality” [[ next slide ]]

(2) Be aware of large residuals (=errors) [[ next slide + 1 ]]

• There are no ‘hard-and-fast’ rules for “normality”; most


applications will not have a perfectly normal distribution
for the errors. That’s okay. But strong deviations should
ring alarm bells.

28

14
Inspect the residuals (errors) of your model

La Quinta: errors behave pretty well


29

Inspect the residuals (errors) of your model

Check if any of your residuals (model errors) are extremely


large
– Rule of thumb: three or more standard deviations from the
“average” residual is considered large

The 10 largest
residuals are all
less than +/- 3
(‘Std. Residual’)
which is good
(and rare)!

30

15
Aspect 5: Check for multicollinearity
• With more X variables, you run the risk of multicollinearity
– Two (or more) independent variables are highly correlated
with each other
• This poses a threat to your regression model:
– Untrustworthy estimates for your β’s (“wrong” signs)
– Low t-values (“very few predictors are significant”)
– Limits the size of R2
– Hard to assess importance of predictors
• Importance of multicollinearity problem is less severe if
your goal is prediction, however, it is more important if
your goal is explanation
• Neither detection nor solutions are obvious:
1. Compute correlation matrix among independent variables
2. Run collinearity diagnostics in SPSS
31

Aspect 5: Check for multicollinearity


• Compute the bivariate correlation coefficients between
each pair of X1, X2,…,X6 – see session 5 slide 16
– Rule of thumb: be aware of correlations over +/- 0.80
– La Quinta case: of the 15 bivariate correlation coefficients,
the largest was 0.15
• Compute standard collinearity diagnostics
– VIF (Variance Inflation Factors): (sloppy) how much error
(variance) inflation is there in the estimated regression
coefficients?
– [[ Tolerance: (1/VIF) ]]
– Rule of thumb: (a) no VIF should be larger than 10; (b)
average VIF should be close to 1; and (c) VIFs in between
4 and 10 should be examined

32

16
Check for multicollinearity

Collinearity statistics appended to the coefficient table (slide 13)


La Quinta: no problems with multicollinearity

33

What to do if multicollinearity IS present?

Basically multicollinearity means there is not enough


information in the data to estimate the separate slopes
, ,…,

Possible solutions:
– Increase sample size (duh!)
– get rid of some of the independent variables (duh!)
– leave as is (duh!), but do report in analysis
– [[ savvy: Factor analysis – replace two ore more collinear
variables with a synthetic variable that summarizes them;
session 9 ]]

Bottom line: no easy solutions to multicollinearity

34

17
Today’s lecture

Part 1: Multiple linear regressions

Part 2: Regression diagnostic tasks

Part 3: Using for decisions – predictions

35

Using the regression model for predictions

Predict the operating margin if a La Quinta Inn is built at a


location where…
 There are 3815 rooms within 3 miles of the site.
 The closest other hotel or motel is .9 miles away.
 The amount of office space is 476000 square feet.
 There is one college and one university nearby with a total
enrollment of 24500 students.
 Census data indicates the median household income in the
area (rounded to the nearest thousand) is $35000.
 The distance to the downtown center is 11.2 miles.

How much profit margin can we expect for this location?

36

18
Using the regression model for predictions
Plug in (hypothetical) X values in your estimated
equation:
X1 = 3815; X2 = 0.9; X3 = 476; X4 = 24.5; X5 = 35; X6 = 11.2

38.14 0.008 1.65 0.02 0.21


0.41 0.23
38.14 0.008 3815 1.65 0.9 0.02
476 0.21 24.5 0.41 35 0.23 11.2
37.09

Hence, we expect a profit margin of 37.09% for the


particular location on previous slide
– Good prediction?
37

Using the regression model for predictions

Two caveats for predictions


1. The prediction (37.09%) is a point prediction. Chances
are the actual profit margin will be different (this is based
on a sample and the model is an approximation). So, we
are better off giving an interval as prediction.
o Compute with SPSS. For the La Quinta location, the 95%
prediction interval is [25.4%, 48.8%]
2. Extrapolation: never use the model outside of the range
on which it was estimated
o E.g. how about expected profit margin for a location like
before (slide 36) but instead with a large college university
that has 50000 students?
o The maximum observed in the sample is 26500!
38

19
In sum: choosing and using a regression model
• Choosing a regression model:
– Reasonable model fit (i.e. check the underlying assumptions
– five ‘tasks’ on slide 23)

– Relatively high values of R2

– Signs of the (significant) regression coefficients need to


make sense (theory)

• Using a regression model to inform decisions:


– How do independent variables influence a dependent
variable? Which independent variables are (most) important?

– Predictions for (hypothetical) values of the independent


variables

39

Is there more to learn about regressions?!

• Absolutely!

• Next class meetings we will look at situations


where…
– The linearity assumption is violated
– We have categorical variables

40

20
Statistics and business analytics

Using linear regressions to inform decisions


(part II)

Session 7

Fall 2018, Peter Ebbes

Course announcements

Music preference survey: please fill out!


[[ will email you link shortly ]]

1
Previous class meetings
Linear regression models
– Quantify the relation between one dependent variable
and one ore more independent variables
– Very important class of models in applied statistical
work
– Useful for explaining and predicting
– Several diagnostic tasks need to be performed before
regressions can be used for decisions [[ G.I.G.O. –
slide 23 session 6 ]]

o One of them: all variables are quantitative

o Another one of them: linearity

Today’s class meeting


But: aren't those two assumptions on the previous
slide huge limitations?
• Answer: yes they are […] limitations!
• In many applications we observe categorical variables
and these may be related to a dependent variable
– Today: how can we include categorical variables as
independent variables in a regression model

• In other applications we often suspect non-linear


relationships between an independent variable and a
dependent variable
– Last part of today’s lecture (brief)

2
Today’s lecture

Part 1: Categorical regressors

Part 2: Interactions with a categorical regressor

Part 3: Nonlinear relations (briefly)

Mini case: gender discrimination in salary at


large US bank
• Common application in HR and business law
• Data (n=208) is from a real case: a US bank was facing
a gender discrimination suit in mid 90s
• Charge: female employees receive substantially smaller
salaries than its male employees
• Variables included (among others):
– Salary (in $1000s)
– Gender
– Education level (categorical)
– Job level (categorical)
– Experience (yrs)
6

3
Using statistics in this law suit case

When a statistician gets the data… (s)he first becomes


friends with it!
– A run down analysis of all the variables in the data file:
univariate statistical analysis!
– Very important and useful:
o Often already yields decision making insights

o Allows for data sanity checks

o Helps guide subsequent analyses

o Helps understand the limitations of


the data given a statistical technique

Become friends with ‘Salary’ and ‘Gender’

A logical next step: combine gender and salary


(bivariate stats). How?
8

4
Evidence for gender discrimination in salary?

Test that the means for men and women are the same
[[ see session 3 part 2 ]]

Difference in sample means: 8.3 (in $1000s)


Z-statistic = 4.14, P-value = 0.000
Conclusion? Is there proof for gender discrimination in
salary?

Evidence for gender discrimination in salary?


The t-test shows that females earn significantly less than men.
But perhaps there is a reason for this:
They might have been hired more recently (less experience)

They may work at lower job grades

They may have lower education levels, etc.

A better approach is to explain salary (quantitative) differences


among men and women (categorical), while controlling for other
factors such as education (categorical), experience
(quantitative), job grade (categorical) etc.
“For a male and female in the same job grade, with the same level
of experience and education, is there a difference in salary?”

In a regression model we are now mixing categorical and


quantitative independent variables
10

5
Including categorical independent variables
• We can NOT include categorical variables as
regressors in a regression model “as is”
– Gender: 1=male, 2 = female
– Job grade: 1=lowest,…,6=highest
– Education: 1=high school,…,5=grad school
• Why’s that?
– Regression models do multiplications, additions,
subtractions which can only be done with quantitative
variables
– Regression interpretation: an one unit increase in X results
in a unit increase in Y, for all X values. This is generally
too restrictive if X is categorical (e.g. next slide)

11

[[ bad ]] Regression model salary vs job grade


Salary (quantitative)

Job grade (categorical)


24.4 5.6 (X is ‘Job grade’ with levels=1,2,…,6); R-
square = 62%
12

6
[[ bad ]] Regression model salary vs job grade
Salary (quantitative)

Job grade (categorical)


But the salary means for each job grade level show that the
linearity assumption is too restrictive (means plot)
13

Categorical (=nominal/ordinal) independent variables


Categorical variables can be included as predictors
by using “dummy variables”
– A way of representing groups of people using only
zeros and ones

Example: Gender

Gender Value Dummy


variable 1

Female 1 1

Reference Male 2 0
 

14

7
Why does dummy (0/1) coding work?
Let’s consider a better model to investigate possible salary
discrimination, by controlling for experience [[ ‘YrsExper’ –
quantitative independent variable ]]
Salary = β0 + β1 * YrsExper + β2 * Femaledummy
Two cases:
1. Employee is female ---- Femaledummy = 1
Salary = β0 + β1 * YrsExper + β2 * 1 or
Salary = (β0 + β2) + β1 * YrsExper
2. Employee is male ---- Femaledummy = 0
Salary = β0 + β1 * YrsExper + β2 * 0 or
Salary = β0 + β1 * YrsExper

15

Estimating the salary model on previous slide

Males
Salary (quant)

Females
(nominal)

YrsExper (quant) R2 = 49%


Salary = 35.8 + 0.98 * YrsExper – 8.0 * Femaledummy
16

8
Detailed interpretation of the regression
coefficients on previous slide
• The intercept (b0=35.8; P-value=0.00)
– The expected starting salary for males with zero years of
experience
• The slope for years of experience (b1=0.98; P-
value=0.00)
– The expected increase in salary for one extra year of
experience at the bank for either gender
• The slope for the female dummy (b2=-8.0; P-
value=0.00)
– This is the key coefficient for this law case
– It indicates that the average salary for women is 8.0
(~$8000) lower than for men, given that they have the
same experience levels
17

Categorical (=nominal/ordinal) independent variables


Job Value
level

Lowest 1

2nd 2

3rd 3 ???

4th 4

5th 5

Highest 6
 
18

9
Categorical (=nominal/ordinal) independent variables
Job Value Dum1 Dum2 Dum 3 Dum 4 Dum 5
level

Lowest 1 1 0 0 0 0

2nd 2 0 1 0 0 0

3rd 3 0 0 1 0 0

4th 4 0 0 0 1 0

5th 5 0 0 0 0 1

Highest 6 0 0 0 0 0
 
19

Steps to create dummy variables


1. Count number of categories you have and subtract 1.
2. Create as many new variables as the value you got in step 1. These
are your dummy variables.
3. Choose one of your categories as baseline (the category against
which all other categories are compared).
4. Assign the baseline category the value 0 for all your dummy
variables.
5. For the first dummy variable, assign the value 1 to the first category
that you want to compare against the baseline. Assign all the other
categories 0 for this dummy variable.
6. For the second dummy variable, assign the value 1 to the second
category that you want to compare against the baseline. Assign all
the other categories 0 for this dummy variable.
7. Repeat this for all remaining dummy variables.
8. Include all your dummy variables in the regression model.
20

10
Salary model with YrsExp, JobGrade, Gender

R2 =74%; F=81.6, P-value=0.00


21

What can we learn from the model on previous slide?


• For instance: an employee…
– … in job grade 1 makes 27.5 ($1000s) less on average than
an employee in job grade 6 (reference), all else [[ gender and
experience ]] being equal
– … in job grade 5 makes 11.34 ($1000s) less on average than
an employee in job grade 6, all else being equal
– … with one additional year of experience makes 0.41
($1000s) more on average, regardless of gender and job
grade
• But, a female employee makes (on average) 1.93
($1000s) less than a male employee, all else being
equal [[experience and job grade]]
– But this is not significant (or, at best, marginally) after
controlling for experience and job grade! Conclusion?

22

11
@home exercise for a rainy day

Extend the previous model (slide 21) by also controlling for


education level (categorical, five categories)

– Create 4 dummy variables in SPSS and add these to the 7


independent variables

– Does education have an effect on salary, given gender,


experience, and job grade?

– Are the assumptions underlying your regression model


satisfied (session 6 slide 23)?

23

Today’s lecture

Part 1: Categorical regressors

Part 2: Interactions with a categorical regressor

Part 3: Nonlinear relations (briefly)

24

12
Salary model from slide 16: two parallel lines

Males
(n=68)
Salary (quant)

Females
(n=140)

(nominal)

YrsExper (quant)
Given experience, females earn less than men. But salary
increases at the same rate for males and females. Realistic?
25

Interaction variables in regressions

• The two parallel lines on previous slide imply that


males and females salary increases at the same
rate.
• This is unlikely to be a realistic assumption [[ and a
possible indicator of discrimination in the lawsuit! ]]
• Ideally, we would like to estimate one regression
equation using the full sample (n=208), rather than
estimating two separate regressions for males
(n=68) and females (n=140) in relatively small
samples.
• This can be done through an interaction variable

26

13
Interaction variables in regressions
• An interaction variable is a product of two
explanatory variables
– Scale level doesn’t matter (e.g. dummy×dummy,
dummy×quantitative, quantitative×quantitative)
– Useful if we believe the effect of one explanatory
variable on Y depends on the value of another
explanatory variable

• Example lawsuit case: Y=Salary, X=YrsExp,


D=Femaledummy

Interpretation? Do same analysis as on slide 14!

27

Interaction variables in regressions


Interpreting interactions with a dummy variable is tricky
and can be best done by writing separate equations
and seeing how they differ

For FEMALES, D=1, so…


1 1
For MALES, D=0, so…
0 0

Hence: using the interaction X×D we get a separate


line for males and females, each with its own intercept
and slope
28

14
Case gender discrimination: interaction of
gender with years of experience
Y=Salary, X1=YrsExp, X2=Female (dummy), X3=X1×X2

R2 = 64%, F=120.16 (P-value=0.00)


Interpretation? Write down the separate equations for
females and males, and interpret each one

29

Case gender discrimination: interaction of


gender with years of experience
Detailed interpretation coefficients previous slide:
• The intercept (b0=30.43; p-value=0.00)
– Average salary of males at 0 yrs of experience
• Slope for experience (b1=1.528; p-value=0.00)
– Average increase in salary per extra year for males
• Slope for female dummy (b2=4.098; p-value=0.00)
– Expected salary premium ($4098) for females at 0 yrs of
experience over males
• Slope for interaction (b3=-1.248; p-value=0.00)
– The salary penalty ($1248) per extra year of experience for
females relative to males
30

15
Interaction of gender with years of experience

Males
Salary (quant)

Females

YrsExper (quant)
The effect of years of experience on salary is quite different for
male and female employees: males move up the salary ladder
much quicker!
31

@home exercise for another rainy day

Extend the previous model (slide 29) by also controlling for


job grade (categorical, 6 categories). How does the model
fit the data? How would you interpret the coefficients?

Once you have included job grade (and if it still rains), you
should include education level in the model as well (slide
23). The model now already becomes pretty complex. How
does it fit the data? How would you interpret the
coefficients?

32

16
One note of caution
While not emphasized today, using dummy variables
and interaction terms does not free you from the
diagnostic tasks discussed before! (session 6 slide 23)
– G.I.G.O.!
– Dependent variable = quantitative
– Independent variables are quantitative OTHERWISE
use DUMMIES
– Linear relation between Y and quantitative X’s
– Assess goodness of fit (R square; p-values)
– Residuals (errors) must ‘behave’
– Multicollinearity

33

Today’s lecture

Part 1: Categorical regressors

Part 2: Interactions with a categorical regressor

Part 3: Nonlinear relations (briefly)

34

17
Generalizing linear regression
• The linearity assumption is often a good and
convenient assumption, but sometimes not realistic
• How do we know things are linear or not?
– LOOK AT YOUR DATA (scatterplots of Y and Xs, and
examine residuals)
– Economic theory

• If things are not linear, what can we do?


– Categorical variables: use dummies
– Quantitative variables: transform your data such that
the relationship between f(X) and g(Y) is linear

35

Example: ad spending on sales


Sales ($100s)

Ad spending ($100s)

Would a linear regression model be a good model to


explain/predict sales using ad spending?
36

18
Example: ad spending on sales
Sales ($100s)

Fit Y against X

Ad spending ($100s)
• With SPSS: 8181 85 ( 0.66; 491; P-value = 0)
• Probably not: for low and large values of X we over predict Y,
for medium values of X we under predict Y. Alternatives?
37

Example: ad spending on sales


Sales ($100s)

Fit Y against X and X2

Ad spending ($100s)
• With SPSS: 6773 190 1.10 ( 0.69; etc.)
• Hard to interpret the coefficients b1=190 and b2=-1.10
• Other alternatives: fit Y against √ or instead of X and X2
38

19
Example: ad spending on sales
Sales ($100s)

Fit LN(Y) against LN(X)


LN is the natural logarithm

Ad spending ($100s)
• With SPSS: LN 8.5 0. 25LN .
• Interpretation: a 1% increase in X goes with a 0.25 percent
increase in Y [[ sales-advertisement elasticity ]]
39

Example: ad spending on sales


• The curve on previous slide is probably the most
reasonable alternative: the log-log regression model
• LN(Y) = β0 + β1LN(X) 
LN is the natural logarithm
– (dy/Y)/(dx/X) = ∆Y%/∆X% = β1
– Constant elasticity: a 1% increase in X goes with a 1%
increase in Y (decrease if 1 is negative)
– If Y is sales and X is prices, then 1 is the price-elasticity
• The log transform is used a lot in demand modeling:
– It induces nice statistical properties (e.g. making skewed
error distributions more symmetric)
– Has a convenient interpretation in terms of percentages
(elasticities)

40

20
Today’s class in sum
• How to use categorical variables as independent
variables in a linear regression model
– Use dummy variables to represent the categorical variable

• Interactions: if the effect of one explanatory variable on Y


depends on the value of another explanatory variable
• One way to deal with non-linearities in regressions
– A common transformation uses the natural logarithm
– In SPSS: first compute new variables from the old
variables (Transform-Compute Variable)
– Then run a regression using the newly computed variables

• Next class meeting: how to use a categorical variable as


DEPENDENT variable
41

21
Statistics and business analytics

Logistic regressions for a categorical


dependent variable

Session 8
Warning!
Tough Lecture!

Fall 2018, Peter Ebbes

Course announcements
Music preference survey is available! Please fill out..

https://hec.az1.qualtrics.com/jfe/form/SV_0debqP4E3ppDOoB

Closes tomorrow (Friday) tonight!

1
Today’s lecture

Part 1: Logit regression main idea

Part 2: Logit regression SPSS output

Part 3: Logit regression interpretation

Part 4: Predicting probabilities and


other diagnostic tasks

Previous class meetings: linear regression


models

• Quantify the relation between one dependent


variable and one ore more independent variables

– Dependent variable: quantitative (sales, salary, stock


returns, insurance claim etc)

– Independent variable: quantitative or dummy


representation (if categorical)

• Today’s class meeting: what could we do if we have


a categorical dependent variable?

2
In many business applications we have
categorical variables…

… that we want to explain and / or predict


– What drives customer retention? (customer churns or
not)

– What factors are explaining bankruptcy of startups? (a


startup goes bankrupt or not)

– What is the probability that a LinkedIn user is a job seeker


(job seeker yes vs no)

– What is the probability that an insurance claim is


fraudulent (e.g. session 4)

What to do with categorical dependent variables?

• In the previous examples, the dependent variable


has two categories, and is therefore not quantitative
– However, we still would like to be able to explain or
predict it as a function of other variables.

• The dependent variable is, in fact, ‘binary’ (1 or 0)


– E.g. buy vs not buy, job seeker vs non job seeker,
fraudulent vs not fraudulent etc

• Idea: explain / predict the probability of a 1 or 0 as a


function of X’s
• This wont go well with the linear model… Why?

3
Brilliant idea!

Don’t predict the probability (p) of a 1 or a 0, but


something on a scale of negative infinity to positive
infinity

Instead, predict the log-odds: log[ p / (1-p) ]

(see also appendix 1 for a discussion on probability and odds)


7

Idea: convert probabilities (p) to span the entire


line (log-odds)

4
Example: simple logistic regression model
(one independent variable)

log[ p / (1-p) ] = β0 + β1 * X1

Allows for the prediction / explanation of probabilities


– Prob(LI user is job seeker) = p

– Examples for X1, e.g. updated profile page in last month,


grew his/her LI network, invitations send / invitations
received.

– More general, include multiple X’s jointly as independent


variables
9

Today’s lecture

Part 1: Logit regression main idea

Part 2: Logit regression SPSS output

Part 3: Logit regression interpretation

Part 4: Predicting probabilities and


other diagnostic tasks

10

5
Mini case: gender discrimination in salary at
large US bank

• Common application in HR and business law: a US bank


was facing a gender discrimination suit in mid 90s
• Sample n=208
• Charge: female employees receive substantially smaller
salaries than its male employees
• Prelim insights (last class meeting): there is no strong
evidence for salary discrimination after controlling for
experience, job grade etc., but it may be the case that
women don’t advance as fast
• Additional variable: employee was promoted
in the last 12 months (Y/N)

11

Bar chart for promotion

73% P(promoted)?

27%

Are women more or less like to be promoted than men?


Bivariate statistics: two categorical variables – cross tab (next slide)
12

6
Clustered bar chart promotion and gender

Chi-square=12.7; p-value = 0.00; only 19% of females were promoted,


versus 43% of males. But: other factors (e.g. experience, education)
are likely to affect promotion; how to control for?
13

Logistic regression for promotion


(two independent variables)

Let’s keep things simple for class discussion purpose:


– Prob(Y=1) = p
– Y = ‘Prom’ (1=Y, 0=N)
– X1 = ‘YrsExper’ (quantitative)
– X2 = ‘Gender_dum’ (1=F, 0=M)

Use SPSS to estimate the logistic regression model:

log[ p / (1-p) ] = β0 + β1 * X1 + β2 * X2

14

7
For logistic regression, SPSS produces LOTS of output
Here are the relevant tables (interpretation next slides)
Table 1 Table 2

Table 3

Table 4

15

Logistic regression for promotion


(with two independent variables)

Prob(Y=1) = p
Y = ‘Prom’ (1=Y, 0=N)
X1 = ‘YrsExper’ (quantitative)
X2 = ‘Gender_dum’ (1=F, 0=M)

Although we havent checked significance of these


variables yet, a first pass look at the estimated model (from
table 4 previous slide, column ‘B’):

log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2

16

8
SPSS output for logit regression
Table 1
• Generally, as the logit regression is “almost” linear (it is linear
in log-odds), much of the reasoning / interpretation / checks is
similar to linear regressions (sessions 5—7)

• Omnibus test of model relevance: do our predictors (X1 and


X2) explain/predict anything in Y (probability of being
promoted)?
– Test statistic is the chi-square statistic (model), i.e. 2 = 39.98
with corresponding P-value = 0.00
– Conclusion: the explanatory variables have a significant impact

17

SPSS output for logit regression


Table 2

• As for linear regressions, the R2 tells us how much of the


variation in Y (the variation in the observed 1’s and 0’s)
we can explain with the independent variables
– Q&S R square is lower; it cannot reach theoretical maximum of 1
– Nagelkerke’s R-square is an adjustment
• “About 25.4% of the variation in the 0/1’s of job
promotions is explained by experience and gender”
18

9
SPSS output for logit regression
Table 3 False positive
False negative

• The classification table is also an indicator of how well


the model performs
• Interpretation?
For 145 individuals the model correctly predicts they were not promoted;
for 7 individuals the model incorrectly predicts they were promoted; for 40
individuals the model incorrectly predicts they were not promoted, for 16
individuals the model correctly predicts they were promoted

19

SPSS output for logit regression


Table 3 False positive
False negative

• Mechanics: for each one of the 208 employees..


– The model computes a probability that the employee was promoted
– Example: P(employee #12 is promoted | who’s female with 10yrs
experience) = 0.3 [[ more details on this in part 4 ]]
– Because 0.3 < 0.5 (cut value), we take 0 (not promoted)
– If P(..|..)>0.5, we take 1 (promoted)
– The cut value can be chosen
20

10
SPSS output for logit regression
Table 3 False positive
False negative

• There are no clear guidelines for the cut value. For instance it
depends on whether predicting a false negative or a false
positive is worse (e.g. more costly); see appendix 2.
• Sometimes a reasonably compromise is to set it to the observed
proportion of promotions in the sample (here: 27% of the sample
is promoted; see slide 12)

21

SPSS output for logit regression


Table 4

The omnibus test (slide 17) indicated that at least one


regression coefficient β is nonzero. Which ones are
significant (in the population)?
– Inspect the P-values (column ‘Sig.’) for the test β = 0
– Conclusion: both independent variables have a significant effect
(P-values<0.05) on Y (promoted Y/N)
– Importance of independent variables? Check Wald statistic
– Now we can move on to a detailed interpretation: either through
log odds or the odds ratio (part 3)
22

11
Today’s lecture

Part 1: Logit regression main idea

Part 2: Logit regression SPSS output

Part 3: Logit regression interpretation

Part 4: Predicting probabilities and


other diagnostic tasks

23

Detailed interpretation: log odds


log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
(p=probability promotion, X1 = YrsExper, X2 = Female_dummy)

• Intercept (b0 = -1.52; P-value=0.00)


– The log odds of a promotion for a male with zero years of
experience is -1.52
• Years of experience (b1 = 0.13; P-value=0.00)
– An additional year of experience increases the log odds of
a promotion by 0.13 (regardless of gender)
• Gender dummy (b2 = -1.32; P-value=0.00)
– Regardless of years of experience, the log odds of being
promoted is -1.32 lower for females than males

In other words: in terms of log-odds, interpretation is similar to linear regression


But, many researchers do not find these interpretations precise enough
24

12
A more detailed interpretation: odds ratio
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
(p=probability promotion, X1 = YrsExper, X2 = Female_dummy)
• A more precise interpretation can be given through the odds
ratio (exp(B) column in table 4). When X1 is increased by 1, the
odds ratio is

new odds after a unit change in 1 1  


original odds
• In other words, for previous example.. See appendix 3&4 for “the
math behind this formula”
– Odds ratio = e0.13 = 1.14
– If X1 (=YrsExper) increases with 1, then the odds of being
promoted are 1.14 times the odds before the increase (regardless
of gender)
– Or, a (1.14 – 1)*100%=14% increase in odds of promotion for
every year of experience
25

A more detailed interpretation: odds ratio


log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2
(p=probability promotion, X1 = YrsExper, X2 = Female_dummy)
• A more precise interpretation can be given through the odds
ratio (exp(B) column in table 4). When X2 is increased by 1
(0=male1=female), the odds ratio is

new odds after a unit change in 2 2  


original odds
• In other words, for previous example..
– Odds ratio = e–1.32 = 0.27
– The odds to be promoted for females are 0.27 times the odds for
men (with same level of experience)
– Or, a (0.27 – 1)*100% = 73% decrease in odds of promotion of
females compared to males

26

13
Remark: odds ratio interpretation
• Odds ratio’s are not easy to interpret; they are fairly
abstract
• Make them easier to interpret through ‘baseline odds’,
which translates them to a concrete situation like “the
number of successes per the number of failures”
• Example: let’s choose as baseline odds a situation
where all X variables are put to 0
– This represents a promotion of a male with 0yrs of experience

– Baseline odds = exp(-1.52) = 0.22 (or 22/100)

– “We would expect 22 males to be promoted for every 100 males


that are not, within the group that have 0 yrs of experience”

• How does this result for men compare to females?


27

Remark: odds ratio interpretation


• We found (slide 26) that the odds for females with 0 yrs
of experience decrease by 73%, hence, 0.27*0.22=0.06
– “We would expect 6 females to be promoted for every 100
females that are not, within the group that have 0 yrs experience”

– That is a quite substantial difference with the males!

• How to get a meaningful baseline value?


– One way, as we just did, put all explanatory variables to zero

– Or, if not meaningful, consider an “average” case (i.e. put the


value of all explanatory variables to their averages)

• Another advantage of working with baseline odds is that


it helps to understand the impact of the results
[[ next slide ]]
28

14
Remark: odds ratio interpretation
• What if being promoted (at 0yrs experience) were rare?
• Suppose instead that these baseline odds were 0.001:
– “One male is promoted within his first year for every 1000 males
in their first year that are not”
• Now, the odds for a female would change from 0.001 to
0.00027:
– Because: 0.27 0.001 0.00027

– Hence, we have 1 female promotion within her first year for


every 3700 females in their first year that are not promoted
– Compared to males, we have 3.7 males that are promoted for
every 3700 males that are not promoted in their first year
– This difference does not sound “as impressive” as the difference
in the example on the previous two slides

29

Remark: odds ratio interpretation

• Hence, odds ratio’s alone are hard to interpret.

• Besides, they cannot be used very well to assess


“impact”. This is particular the case when the event
is rare.

• Baseline odds, however, can help provide a


convenient way of interpretation and evaluate the
size of the effect

• In sum, baseline odds should always be included in


discussing logistic regression analysis.

30

15
Today’s lecture

Part 1: Logit regression main idea

Part 2: Logit regression SPSS output

Part 3: Logit regression interpretation

Part 4: Predicting probabilities and


other diagnostic tasks

31

Using the logit model for predictions

• The real power of the logit model comes from it’s


ability to estimate/predict probabilities
– What is the probability that a female employee with 5
years of experience is promoted?
– What is the probability that a male employee with 5
years of experience is promoted?

• Approach: plug in the X values in your log-odds


equation and solve for p

log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2


32

16
Using the logit model for predictions
Example: the probability that a female employee with 5
years of experience is promoted is

log[ p / (1-p) ] = -1.52 + 0.13 * 5 -1.32 * 1 = -2.19

Take exponent on both sides of equation:

p / (1-p) = exp(-2.19) = 0.111

Solve for p:

p = 0.111 / (1 + 0.111) = 0.10

For completeness, we need to compute (bit tricky!) an


(confidence) interval: the 95% interval is [0.06, 0.17]
33

Using the logit model for predictions


In sum – steps to estimate a probability (previous slide):

1. Plug in the X’s for the scenario in the log-odds equation


and complete the products and sums to get the log-odds

2. Exponentiate the number you got in step 1

3. Plug this number in the following equation to compute p


(the probability of observing a 1):

number from step 2



1 number from step 2

[[ 4. Ask for a confidence interval for the predicted probability;


need computer for that! ]]
34

17
Using the logit model for predictions

Super-fun @home exercise: estimate the probability of


promotion for a male employee with 5 yrs of experience
log[ p / (1-p) ] = -1.52 + 0.13 * X1 -1.32 * X2

You should find: 0.30 (rounded)

You can either use the previous “manual” steps (slide


34) or use SPSS to get the prediction (See How to
guide)

35

Using the logit model for predictions

Blue = M
Probability of promotion

Red = F
Calculations on
previous slides

[[ What’s missing
in this graph? ]]

YrsExper

Using this simple model, it appears that males have on


average much higher probabilities being promoted than
females, for any level of experience
36

18
Important aspects to check in regression analysis
Before drawing conclusions about a population based on a
LOGIT regression analysis, first check (at the minimum) the
following five aspects:
1. Variable types: dependent variable is DICHOTOMOUS (1/0);
independent variables are quantitative otherwise dummies

2. Is there a linear relationship between the log odds and the Xs?
Hard to investigate! Rely on theory and model checks 3. and 4.

3. Assess goodness of fit: investigate R2 (table 2) and all the P-


values (tables 1 and 4 – significance of independent variables)

4. Residual analysis: inspect the model predictions (table 3)

5. Multicollinearity: your predictor (independent) variables should


not correlate ‘too highly’ (check with correlations)

37

Today’s class in sum


• Logistic regressions for dichotomous dependent
variables
– Quite similar to linear regressions: is linear in log odds
– Interpret through log odds, or odds ratio with baseline odds
– Allows for prediction of probabilities (+ intervals!)
– Need to perform similar 5 tasks as linear regressions
before conclusions can be drawn (slide 37)
• Can be further generalized (beyond scope class):
– More than two categories: multinomial logit model
– Ordered categories (ordinal dependent variable): ordinal
logistic regression
• Next class meeting: SPSS lab 4 (of 5)!

38

19
Appendix

Appendix 1 – probability versus odds

Appendix 2 – cut value for logit prediction

Appendix 3 – odds ratio in a logit model

Appendix 4 – the math behind odds ratio

39

Appendix 1: probability versus odds


• See also optional reading listed in the prep guide ‘Odds or
Probability’ by Ronald Wasserstein
• Informally:
– Probability – the number of ways an event can occur divided by
the total number of possible outcomes
– Odds – the number of ways an event can occur divided by the
number of ways it does not occur
• Example 1: take p=Pr(win)=0.5. Then Odds(win) = p/(1-p) =
0.5/0.5 = 1/1. For every win there is one loss. Or, play two
times, you are expected to win one and lose one.
• Example 2: take p=Pr(win)=0.25. Then Odds(win) = p/(1-p) =
0.25/0.75 = 1/3. For every win there are three losses. Or, play
four times, you are expected to win one and lose three.
• Example 3: take p=Pr(win)=0.75. Then Odds(win) = p/(1-p) =
0.75/0.25 = 3/1. For every three wins there is one loss. Or,
play four times, you are expected to win three and lose one.
40

20
Appendix 2: cut of value for logit prediction
Predicted
No Yes
No Correct False positive
Observed
Yes False negative Correct

Choosing the cut value (slides 19—21) to classify predictions is not


obvious. A couple of strategies could be adopted:

First strategy: in absence of any information, the best prediction of the


probability that a randomly drawn observation is a ‘Yes’, is the observed
proportion ( ) of ‘Yes’ in the sample (e.g. slide 12).
The logit model incorporates information through your Xs, and predicts
that the probability of ‘Yes’ is (say) . Now we could classify the
observation as a ‘Yes’ if and as a ‘No’ if .
This was the approach taken in class with 0.27 (slide 21).

41

Appendix 2: cut of value for logit prediction


Relative cost table Predicted False positive
False negative No Yes Probability
Observed No 0 C 1
(‘Truth’) Yes 5C 0

Second strategy: another approach can be taken if there is some


knowledge of the relative cost of a false positive and a false
negative. Suppose that the cost of a false negative is (say) 5 times
the cost of a false positive (with no cost of a correct classification).
Assume the proportion of ‘Yes’ is and ‘No’ is 1 (in the
population). This is summarized in the above table.
Here, we can classify a prediction either as ‘Yes’ or as ‘No’. The
expected cost of these two actions are now given by:
‘ ’ 1 0 1
‘ ’ 0 1 5 5
(continued on next slide)
42

21
Appendix 2: cut of value for logit prediction
Given these expected costs, it would be better to classify an
observation as ‘Yes’ if the expected cost of that action is lower than
the expected cost of classifying the observation as ‘No’. That is,
classify as ‘Yes’ if:
‘ ’ ‘ ’⟺
1 5 ⟺
1 5 ⟺
1 6 ⟺
1 1
⟺ 0.17.
6 6
That is, our cut-off value is now 0.17 instead of 0.27, which
acknowledges the relative cost of misclassification.

Hence, when the logit model predicts for an observation that the
probability of a ‘Yes’ is larger than 0.17, i.e. 0.17, then
classify as ‘Yes’, otherwise classify as ‘No’.
43

Appendix 2: cut of value for logit prediction

Following the preceding, if the cost of a false negative is equal to


the cost of a false positive, then the cut-off value would be 0.5.
Third strategy: a third approach would run the classification table
for many choices of the cut value. In the previous two examples we
took just two values (0.27 and 0.17). But, we could create several
tables for varying choices of the cut value, and discuss the
implications for the resulting classifications.
This is the idea behind the ROC curve which is obtained by
changing the cut value and calculating the false negative and
positive rates. Some researchers advocate choosing the model that
maximizes the area under the ROC curve. For a discussion, see
e.g. Ledolter (2013) Ch. 8 (detailed reference in course syllabus).

44

22
Appendix 3: odds ratio in a logit model

The odds ratio for a regression coefficient is


odds 1
odds
It represents the change in the odds of the outcome
(multiplicatively) by increasing by 1 unit
– If 0, the odds and probability are the same at all
values ( 1)

– If 0, the odds and probability increase as increases


( 1)

– If 0, the odds and probability decrease as


increases ( 1)

45

Appendix 4: the math behind odds ratio


• Let p be the probability of an event (e.g. buy product,
person is promoted).
• The logistic regression model can be given as (slide 9):
log ∗
1
• Exponentiation of both sides gives:
∗ ∗ ∗
1
• If we increase with one unit, we get ‘new’ odds. Call
these odds+1
′ ∗ ∗
1 ′
∗ ∗ ∗
(continued on next slide)
46

23
Appendix 4: the math behind odds ratio

• Now the ratio of the new odds over the old odds is:

∗ ∗ ∗
∗ ∗

• Hence,

odds after a unit change in



original odds

This is the formula on slide 25

47

24
Statistics and business analytics

Data reduction through factor analysis

Session 9

Fall 2018, Peter Ebbes

Course announcements
• Next class meeting: session 10 of 10 (cluster analysis)
• SPSS Lab 5 of 5 (covers sessions 9 and 10): we’ll have
two lab sessions
– Lab 5.1 (Thu Nov 22)

• Final team project “review” / class wrap up

• Info about the final team project (organization, deliverables etc.)

• Start with SPSS lab 5

– Lab 5.2 (Tue Nov 27)

• Finish + hand in SPSS lab 5

• Quiz 5 of 5: opens one day before lab 5.2, closes two days
after lab 5.2
2

1
Today’s lecture

Part 1: Basics of Factor Analysis (FA)

Part 2: Running a FA

Part 3: Interpreting a FA solution

Part 4: FA goodness of fit

Part 5: Putting FA to work

How does what we are about going to do relate


to what we did?

• PART 2: various forms of regression analysis


– Multivariate statistical technique
– Singles out one variable (dependent) which is to be
explained or predicted by one or more other variables
(independent variables aka regressors, predictors etc)
• Today we start with PART 3: factor analysis (FA)
– Also multivariate statistical technique
– But: no variable is singled out for special treatment as a DV
– FA poses “How are these k variables interrelated?”
– It may be seen as a “data reduction” tool and sometimes
precedes regression analysis to reduce the set of regressors

2
Daimler/Chrysler seeks a new image

(Dodge Viper, 2016 model)

Daimler/Chrysler seeks a new image

Daimler/Chrysler seeks a deep understanding of the


psychological characteristics of yuppies (prime target
group for the car) in order to formulate the marketing
program for the Dodge Viper.

– How to overcome its boxcar image?


– What incentives to offer?
– What is the role of styling and prestige in promotion?
– How exploit the Daimler-Benz merger?

3
Daimler/Chrysler seeks a new image

The company has been presented with the responses from


the attitude survey (handout p2). The criterion variable
(“I would consider buying the Dodge Viper made by
Daimler-Chrysler”) is an important measure for
behavior/interest.
Chrysler needs to know the about the psychological
characteristics of the yuppies to configure the Dodge
Viper program.
Task data analytics team: make recommendations on the
design, brand positioning, and targeting of the Dodge
Viper to increase appeal to the yuppie market, based on
analyses of the this data.

Daimler/Chrysler regression
• Regression analysis based on attitudinal data
• The full model (handout pp3-5) is not very useful
– Too many regressors
– Most regressors are not significant (P-value>0.05)
– High levels of multicolinearity (average VIF = 5.6, several VIFs
> 10, several tolerance levels < 0.07)
– Many correlations between regressors over 0.8, e.g.
o “I can do anything I set my mind to” with “Skeptical predictions are
usually wrong” (r=0.96, p-value<0.05)
o “I would like to take a trip around the world” with “I wish I could leave
my present life and do something entirely different” (r=0.90 , p-
value<0.05)
o “I usually dress for fashion, not comfort” with “I am in very good
physical condition”. (r=0.92, p-value<0.05)
8

4
Daimler/Chrysler regression

A smaller model (fewer regressors) could be better


(handout pp6-7)

– But: which regressors to choose?

– *Adjusted* R-square smaller model higher than for


previous large model (57.3% vs 56% respectively)

– Multicollinearity is less a problem

– Is this model useful? What can we learn from it


relevant for this case?

Factor Analysis: in general

• Factor analysis is a data reduction technique useful in


dealing with large data sets in which quantitative
variables are (strongly) correlated

• Assumption is that the original variables are generated


by a set of underlying dimensions that cannot be
measured directly

• Factors are derived sequentially so that they are


uncorrelated and jointly describe the total variance of the
original variables in descending order of ‘importance’

• These “factors” could then be used for further analysis


(e.g. in a regression) for (strategic) decision making

10

5
Factor Analysis: to potentially help a regression
Goal: we need to run following regression:
Y  b0  b1 X 1  b2 X 2  b3 X 3  b4 X 4  b5 X 5  ...  bn X n

Factor 1 Factor 2 “Super variables”

1. Create a small number of indices (factors) from a set of


correlated variables that capture the statistical information
in the original set of variables F1

2. Understand the underlying structure


F2
3. Use the factors for subsequent analysis,
instead of the original X’s
Y  d 0  d1 F1  d 2 F2

11

Factor Analysis mechanics


• We will eschew technical details in this class and focus
on interpreting and using factor analysis
• It is based on a similar idea as regressions: the total
variance of one variable (Y) is partitioned into
components which sum to the total (SST=SSR+SSE)
• In a nutshell, for factor analysis, it is similar:
– For a set of variables (X1, X2,…,Xn), the total (co-)
variability (say, R) can be partitioned into a common
portion C which is explained by the factors and a portion U
which is unexplained by the factors: R=C+U

– Factor analytic approaches transform the original set of


variables into a new set of uncorrelated linear
combinations of these variables
12

6
Today’s lecture

Part 1: Basics of Factor Analysis (FA)

Part 2: Running a FA

Part 3: Interpreting a FA solution

Part 4: FA goodness of fit

Part 5: Putting FA to work

13

Factor Analysis in 7 (easy) steps

1. Confirm data are metric (quantitative)


2. Decide on the number of factors to be derived
3. Derive the factor solution
4. Present the (rotated) solution
5. Interpret the solution
6. Evaluate the goodness of fit
7. Save factor scores for subsequent analyses
(optional)

14

7
Decide on the number of factors
• The maximum number of factors is the number of
variables (here: 30)
• The choice depends on..
a) Managerial decision (subjective)

b) Elbow in scree plot (handout p8)

c) Eigenvalues > 1 [[ sloppy: the eigenvalue represents


how much variance a factor explains ]]

d) Cumulative variance explained

• Step b) is a plot in the SPSS output, and steps c) and d)


are to be found in the SPSS table ‘Total Variance
Explained’ (handout p9)

15

Decide on the number of factors

• Inspect the scree plot and the ‘Total Variance Explained’


table
• The scree plot suggests kinks at 3, 9, and 13 (maybe 15
too) factors
– 13 factors is not acceptable because the eigenvalues for
10, 11, 12, 13 are too low (below 1)

– 3 factors is not ideal because only 34% of the variance is


explained relative to 4, 5, … factor solutions which all have
eigenvalues > 1

– 9 factors seems to be a good compromise

– Regardless, choosing the number of factors also depends


on interpretability of the factors (next part)
16

8
Today’s lecture

Part 1: Basics of Factor Analysis (FA)

Part 2: Running a FA

Part 3: Interpreting a FA solution

Part 4: FA goodness of fit

Part 5: Putting FA to work

17

Deriving and interpreting the factor solution

• The purpose of factor analysis is twofold: data reduction


but also substantive interpretation
o “What are the constructs that underlie the observed
variables? Can we give them an interpretation?”
• Interpretation is done by examining the ‘Component
Matrix’ table and/or ‘Rotated Component Matrix’ table
(handout pp10-11)
o These are called ‘Factor Loadings’, which are nothing
more than the correlations between the original variables
and each of the factors
o If a correlation is high (near -1 or 1), it means that the
factor ‘loads high’ on that variable, that is, that variable will
be used to interpret the factor later on
18

9
Rotating factors to facilitate interpretation

Rotation is a transformation of the initial solution, i.e. the


unrotated factor loadings (handout p10), into a new solution
which is easier to interpret

Factor 2
1
(1) Orthogonal rotation
(varimax is most popular) 2

(2) Oblique rotation 90°


Factor 1

5
3 4

19

Interpretation of factors: in-class exercise

• Better to use the ‘Rotated Component Matrix’ table.


Identify the significant loadings in each row/column

– Row-wise (i.e. for each variable), circle highest loadings

– Examine “significance” of the circled loadings (rule-of-


thumb: should be (well) over 0.30 in absolute value)

– Underline all other “significant” loadings

• Label each factor by finding a collective name to


describe the items most strongly associated with this
factor

20

10
Interpretation of factors

Factor 1
V11 - family is not too heavily in debt today (0.896)
V12 - pay cash for everything I buy (0.902)
V13 - spend for today & let tomorrow bring what it will (0.937)
V14 - use credit cards because I can slowly pay off bill (0.937)
V15 - seldom use coupons when I shop (0.871)
V16 - interest rates are low enough so I can buy what I want
(0.758)

Possible name: Financial composure


Variance explained by factor: 15.74% (handouts p9)

21

Interpretation of factors

Factor 2

V2 - very good physical condition (0.907)


V3 - dress for fashion, not comfort (0.905)
V4 - have more stylish clothes than my friends (0.826)
V5 - want to look a little different from others (0.648)

Possible name: Style conscious


Variance explained by factor: 9.38%

22

11
Interpretation of factors

Factor 3

V7 - not concerned about the ozone layer (0.764)


V8 - the govt is doing too much to control pollution (0.837)
V9 - basically, society today is fine (0.859)
V10 - don't have time to volunteer for charities (0.859)

Possible name: Societal apathy


Variance explained by factor: 9.37%

23

Interpretation of factors

Factor 4

V22 - American-made cars can't compare with foreign-made


(0.955)
V23 - govt should restrict imports of products from Japan (.955)
V24 - Americans should always try to buy American products
(0.915)

Possible name: Patriotism


Variance explained by factor: 9.12%

24

12
Interpretation of factors

Factor 5

V29 - sceptical predictions are usually wrong (0.950)


V30 - can do anything I set my mind to (0.955)
V31 - in five years, my income will be a lot higher (0.896)

Possible name: Optimism


Variance explained by factor: 9.10%

25

Interpretation of factors

Factor 6

V6 - life is too short not to take some gambles (0.445)


V25 - would like to take a trip around the world (0.912)
V26 - want to do something different with my life (0.923)
V27 - usually among the first to try new products (0.708)

Possible name: Adventurous


Variance explained by factor: 8.53%

26

13
Interpretation of factors

Factor 7

V17 - have more self-confidence than most of my friends (0.903)


V18 - like to be considered a leader (0.935)
V19 - others often ask me for help (0.877)

Possible name: Opinion leadership


Variance explained by factor: 8.29%

27

Interpretation of factors

Factor 8

V20 - children are the most important thing in a marriage (0.901)


V21 - would rather spend a quiet night at home than go out
(0.900)

Possible name: Family traditionalism


Variance explained by factor: 5.46%

28

14
Interpretation of factors

Factor 9

V28 - like to work hard and play hard (0.618)

Possible name: Endurance


Variance explained by factor: 3.54%

29

Interpretation of the factors -- summary


• Nine psychographic factors:
Factor 1: Financial composure
Factor 2: Style conscious
Factor 3: Societal apathy
Factor 4: Patriotism
Factor 5: Optimism
Factor 6: Adventurous
Factor 7: Opinion leadership
Factor 8: Family traditionalism
Factor 9: Endurance
• These nine factors explain 78.5% of the variance in the
original 30 variables
30

15
Today’s lecture

Part 1: Basics of Factor Analysis (FA)

Part 2: Running a FA

Part 3: Interpreting a FA solution

Part 4: FA goodness of fit

Part 5: Putting FA to work

31

Evaluate the goodness of fit (1)

As for regression models, we need to convince ourselves


that the factor analysis is “good enough” for managerial use

1. Face validity? That is, does the found solution make sense?
Can it be given a reasonable interpretation? [[ subjective! ]]

2. How much of the total variation in the original 30 variables is


explained by the 9 factor solution?

– Here: 78.5% (‘Total Variance Explained’ table handout p9)

3. How much of the variation in each variable is accounted for


by the factor solution? [[ next slide ]]

32

16
Evaluate the goodness of fit (2)
• How much of the variation in each variable is accounted
for by the factor solution?
• Inspect the ‘Communalities’ table (handout p12)
– A ‘0’ means no variation of that variable is explained by the
9 factors [[ could suggest another factor is needed ]]
– A ‘1’ means all the variation is explained by the factor(s)
and variable and factor are the same [[ defies purpose of
factor analysis ]]
– Ideal is somewhere in between
• Variable V6 (“Life is too short not to take some gambles”) is
most poorly captured (communality = 0.37)
• Variable V30 (“I can do anything I set my mind to”) is best
captured (communality = 0.94)
33

Today’s lecture

Part 1: Basics of Factor Analysis (FA)

Part 2: Running a FA

Part 3: Interpreting a FA solution

Part 4: FA goodness of fit

Part 5: Putting FA to work

34

17
Save factor scores for subsequent
analyses (optional)
So, what was al this “Factor Analyzing” good for?
1. It helps us give names to underlying factors or constructs
(‘super variables’) in sets of highly intercorrelated /
multicollinear variables
– Having data on many variables does not mean that we know
what is going on
– Instead, looking at a fewer number of transformed variables
often gives more comprehensible and useful information.

2. Factor analysis provides us with a set of transformed


variables, which may subsequently be used in another
statistical analysis, such as in a regression [[ next slides ]].

35

Save factor scores for subsequent


analyses (optional)
• Use SPSS to compute ‘Factor scores’
– For each observation, obtain an actual value for the (here)
9 factors
– SPSS adds 9 new columns in your data spreadsheet (Data
View) [[ ‘FAC1’, ‘FAC2’, …, ‘FAC9’ ]]
• How are the scores for a factor computed?
– First intuition: take the average of the X’s that load on the
factor
– Better: compute a weighted average keeping in mind the
factor loadings
• Then, use the factor scores as (here: 9) independent
variables in a regression
36

18
Configuring the Viper program using
psychological characteristics (handout pp13-14)

R2 = 0.51, F=47.01 (P-value=0.00), Max VIF=1, Average VIF=1,


all tolerances > 0.1, residuals approx normal
37

Some flavor of previous result in recent ads (Dodge website)


Adventure; patriotism

Style; adventure;
probably not family
traditionalism

Style; optimism (?)


38

19
Factor analysis popular in business research

Marketing

– E.g. Dodge Viper case

Economics

– Next months employment rate depends on many variables


(interest rates, money supply, jobs created, consumer
confidence, wages, inventory, inflation etc.)

– Summarize state of economy in smaller number of “state” factors

Finance

– Use factor analysis to identify common factors: system factors


(market, industry), non-system factors (firms specific), and use
factors to describe stock returns

39

Factor analysis review

1. To reduce the number of variables


2. To detect structure in the relationships between
variables; variables “group” together into factors
3. [[ optional ]] Create entirely new set of (quantitative)
‘super variables’ for use in subsequent analysis (e.g.
regressions or cluster analysis)
X1

X2
Today, we did all three, and combined F1
factor analysis with regressions, which X3
Y
led to fewer variables in our regression
equation, eliminating multicollinearity F2 X4

X5
40

20
Next class meeting

• How to reduce the number of rows (~ observations) in


the dataset?

• Group observations together in a “smart way”

– Observations within a group should be as similar as


possible

– Observations belonging to different groups should be as


dissimilar as possible

• Business application: segmentation

– Technique: cluster analysis

– Illustration: MBA music market segmentation

41

21
Statistics and business analytics

Segmentation through cluster analysis

Session 10

Fall 2018, Peter Ebbes

Course announcements
• SPSS Lab 5 of 5 (covers sessions 9 and 10): we’ll have
two lab sessions

– Lab 5.1 (Thu Nov 22) -- final team project “review” /


class wrap up; start with SPSS lab 5

– Lab 5.2 (Tue Nov 27) -- finish + hand in SPSS lab 5

• Course evaluations – please fill out!

1
Today’s lecture

Part 1: Basics of clustering and K-means

Part 2: Choosing the number of clusters

Part 3: Case – MBA music market

Part 4: Clustering wrap-up

MBA music market

A managerial question
o We wish to create CD(s) or playlist(s) for the MBA
market. What would be the best music compilation for
this market?
o Survey (handout p1)
o Assume mean (say) over 7.0 should be
included, mean of 4.0 or less excluded
 For this class the compilation should include:

 For this class the compilation should not include:

2
HEC Paris MBA students music preference

(handout p2)
5

MBA music market

Targeting to the “average” customer: one CD with


Rock, Pop, Classical, Jazz

– Would you be happy with the album targeted to the


“average” customer?
– Could we do better?

Homogeneous market: “one shoe fits all” is old


fashioned

Heterogeneous market: customers behave differently,


and have different needs and wants

3
MBA music market

• Lets use data analytic approaches to try to divide up


the market into groups of consumers that are “alike”
within groups and “different” across groups

• Goal: increase effectiveness by designing an


appropriate business strategy for these subgroups of
consumers (“market segmentation”)

Cluster analysis

A class of statistical / mathematical techniques used to


classify objects into groups
– Objects within a group should be as similar as possible

– Objects belonging to different groups should be as dissimilar as


possible

Ideal Scenario Real-world Scenario

4
Let’s examine Rock vs. Rap for 11 students

Music preferences 11 students: four groups (handout p3)


9

Let’s examine Rock vs. Rap for 11 students


Cluster solution for four clusters
Cluster means –

Cluster 4
Cluster 3

Cluster 2 Cluster 1

Cluster sizes –

Cluster Count %
A cluster solution consists of at least: 1 2 0.18
2 3 0.27
(a) the cluster means/centers
3 3 0.27
(b) cluster sizes 4 3 0.27
(c) For each observation, its cluster Total 11 1.00
assignment (a “new” categorical variable)
10

5
Let’s examine Rock vs. Rap for 11 students

• How were they grouped? Visual inspection: employ


some measure of similarity to assess proximity
• What would be a measure of “distance” in this plot?
11

Let’s examine Rock vs. Rap for 11 students

To measure distance we
could use the Euclidean
distance measure, which
measures the length of the
line segment connecting two
points

• Squared Euclidean distance for distance between student i and j:


∑ (here: p=2 dimensions, xik is coordinate for i-th
student on k-th variable)
• Example 1: distance between S1 and S7 is 8 7 8 2 37
• Example 2: distance between S1 and S3 is 8 9 8 8 1
12

6
Let’s examine Rock vs. Rap for 11 students

Proximity matrix: distances between all pairs of students (handout p4)

13

Clustering main idea/approach


• We managed to cluster 11 observations into four clusters
‘by the eye’
• But what if we have many more observations (students)
and many more dimensions (genres)? That is, how to
assign n students to clusters based on variables?
• We need a computer algorithm to do that for us
– Objective: assign observations to clusters such that there
will be as much similarity within a cluster and as much
difference between clusters as possible

• A well known approach: K-means clustering


[[ FYI there are about a “gazillion” approaches ]]

14

7
K-means clustering

• It attempts to find clusters that are most compact, in


terms of the [[ square of the Euclidian ]] distance of each
observation to the center of each cluster

• Fairly simple, fast, and reliable algorithm to assign


observations to clusters (see appendix)

• A basic example of unguided / unsupervised


“machine” learning – discover unknown groupings
just from the structure in our data

• Only input variable required: the number of clusters

15

K-means clustering

• How to choose the number of clusters?

• Good news: many different ways to do

• Bad news: the different ways tend to give different


answers

• Reasonable approach: rely on techniques from


machine learning for guidance [[ next part ]]

– Note: need decent sample size and fast computer

16

8
Today’s lecture

Part 1: Basics of clustering and K-means

Part 2: Choosing the number of clusters

Part 3: Case – MBA music market

Part 4: Clustering wrap-up

17

Cross-validation to get a feel for number of


clusters in the data

• Important approach in data mining/machine learning


approaches to validate the model (e.g. Ledolter, 2013)
• Basic idea:
– split the dataset in a training dataset and test dataset

– fit the model (e.g. cluster model with K-means) on the


training dataset

– test the fitted model (e.g. the found cluster structure) on


the test dataset

• Main goal: how does the model/approach generalize to a


new independent dataset; to prevent overfitting

18

9
Cross-validation to get a feel for number of
clusters in the data
• Run K-means for k 1,2, … , clusters on the training
dataset. Then, examine for each choice of how well cluster
solution fits in the test dataset
• Plot the fit statistics against 1,2, … ,
– Disclaimer: many fit statistics, no agreement on which is best
– Distance measure: choose the that corresponds to the elbow
or kink
– Model-based proxies based on Akaike and Bayes Information
Criteria: assumes that variables are (approximately) independent
and normally distributed within each cluster; penalizes for
number of parameters; choose that gives smallest AIC or BIC
– R-square based proxies: (sloppy) how much variance of the
variables can be explained by the cluster solution? Choose
that gives highest R-square value

19

In-class exercise 1&2

• I applied this approach to two (fake) datasets with


500, with 16 genres (slide 5), where I know the true
number of clusters and where the cluster solution is well-
defined in the data.

• The results are in your handouts on pp5-6

• Exercise 1: the first example, is similar to the case we


discussed earlier in class (slide 9). How do the fit
statistics support the four cluster solution?

• Exercise 2: how many clusters do we seem to have in


the data for the other example, as suggested by the fit-
statistics?

20

10
In-class exercise 1
Scatter plot (handout p5)

Scatter plot: observed preferences Rock and Rap/Hip-Hop (e.g. slide 9)


21

In-class exercise 1
Fusion diagram (handout p5)

Reading from left to right:


Starting with 1 cluster,
adding a second cluster
reduces the distance
coefficient considerably.
Going from 4 to 5 clusters,
we are (relatively speaking)
not reducing the distance
coefficient much further.
A four clusters structure
seems to best describe this
data

[[ The distance coefficient measures the average distance of each


observation to its cluster center ]]
22

11
In-class exercise 1

(handout p5)

Minimum for all three model based proxies occurs at a four cluster
solution; BIC penalizes here the most for number of parameters
23

In-class exercise 1

(handout p6)

When the number of clusters reaches 4, there is no further improvement in


the R2 based metrics (in fact, the adj. R2 measures decrease for 4)
24

12
Run K-means to get four cluster solution
(In-class exercise 1, cont.)

Green = cluster 1; Blue = cluster 2


Black = cluster 3; Red = cluster 4
Cl. 1 Cl. 2 Cl. 3 Cl. 4
Rock 1.91 9.02 2.00 8.97
Rap/HH 9.02 9.00 2.05 1.99
Size 24% 28% 25% 23%

The algorithm does what we


could do by the eye, it assigns
each of the 500 observations to
one of four clusters (colors)

The Rock/Rap preferences of


the 500 observations are
described well by these four
cluster
25

In-class exercise 2: how many clusters?

Fit statistics on
handout p6

Scatter plot: observed preferences Rock and Rap/Hip-Hop – 1 cluster


26

13
In-class exercise 2: how many clusters?
Cl. 1 Cl. 2
Red = cluster 1; Black = cluster 2
Rock 5.07 4.99
Rap/HH 4.88 5.12
Size 52% 48%

Warning: K-means WILL give


you a solution – here I ran it for
2

Even though there are clearly


no clusters in the data, the
algorithm will give you
something

The fit statistics indicated that


1 is the ‘best’ cluster
solution (i.e. no clusters)
27

In sum: choosing the number of clusters


• The previous examples were very clear and obvious:
real data generally does not behave like that
• Therefore, choosing the number of clusters is often
highly subjective, and should always be discussed as
part of a cluster analysis. Usually a combination of:
– Examining ‘fit metrics’ (e.g. previous approaches based on
cross-validation)
– Choosing a range for instead of one number
– Examine cluster solutions (cluster means/ sizes, cluster
assignments) for three or four choices of

• As final solution, we choose the cluster solution that can


be supported by the data, and at the same time can be
used to develop appropriate business strategies
28

14
Implementing K-means

SPSS’ implementation of K-means is quite poor

– It does not randomize starting values of the K-means


algorithm [[ e.g. appendix ]]

– It cannot do cross-validation to get a feel for the number


of clusters (which has to be part of cluster analysis)

– Use our browser version (based on R/Shiny):

http://rstudio-test.hec.fr/kmeans/

(see “how to guide” today’s session)


29

Today’s lecture

Part 1: Basics of clustering and K-means

Part 2: Choosing the number of clusters

Part 3: Case – MBA music market

Part 4: Clustering wrap-up

30

15
So, where does this leave us for the MBA
music market case?
• Data:
– 801 students (seven cohorts) provided liking responses on
10 pnt scale of 16 musical types (slide 5; handout pp1-2)
• Managerial question:
– Are there segments of students who might best be
targeted differently with different music playlists/CDs?
– If so, who are they?
• To address the managerial question, we run a cluster
analysis. We need to provide (at the minimum): (a) a
discussion of the # clusters, and (b) a discussion of the
cluster solution (cluster centers, cluster sizes, cluster
assignments)
31

MBA music market clustering example


[[ In-class exercise 3&4&5 ]]
• In-class exercise 3 (statistics): examine the fit metrics; how
many clusters would you recommend? (handout p7)

• In-class exercise 4 (marketing/product manager): examine the


cluster solution that I ran (handout pp8—9)
– How many clusters did I choose? Do you agree?
– How would you describe each cluster?
– What type of playlists/CDs would you market? Which genre(s) would
you include? Which genre(s) would you not include? How large are the
potential clusters?

• In-class exercise 5 (individual targeting): in what cluster do


you fall (if you filled out questionnaire)? Do you agree? How
well do you fit in the cluster? (handout pp11—12)

32

16
MBA music market (in-class exercise 3)
• How many clusters could be supported by the data?

• Use the cross-validation fit statistics to get a feel for the


number of clusters
– Look for kinks in the fusion plot; find the lowest AIC/BIC; find the
highest adjusted R2

– Usually the metrics do not agree, and the best you can do is to
give a range for

– For the class case, we could argue anywhere from 3—7 clusters

• We would then need to get the cluster solution for each


3,4,5,6, and 7, and examine each for managerial relevance

• I started with the 4 solution to get an initial idea of the


MBA music market segments (handouts pp8—12; next slides)

33

MBA music market (in-class exercise 4)


Cluster 1 Cluster 2 Cluster 3 Cluster 4

I
n
c
l
u
d
e

E
x
c
l
u
d
e

N
34

17
MBA music market (in-class exercise 4)
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Rock Pop Classical RapHipHop
Jazz Rock BroadwayMovies Pop
I Classical Jazz Jazz RnB
n Folk
Classical Rock
c
Blues
l
RnB
u
RapHipHop
d
TechnoDance
e
BroadwayMovies

Reggae ChristianGospel NewAge NewAge


E RnB Kids Reggae Folk
x NewAge TechnoDance Country
c Folk RnB ChristianGospel
l RapHipHop Kids Kids
u Country RapHipHop
d ChristianGospel
e Kids

N 161 (20%) 246 (31%) 170 (21%) 224 (28%)

35

How would this improve the initial CD (slide 5)?


Cluster 1 Cluster 2 Cluster 3 Cluster 4
Rock Pop Classical RapHipHop
Jazz Rock BroadwayMovies Pop
I Classical Jazz Jazz RnB
n Folk
Classical Rock
c
Blues
l
RnB
u
RapHipHop
d
TechnoDance
e
BroadwayMovies

Reggae ChristianGospel NewAge NewAge


E RnB Kids Reggae Folk
x NewAge TechnoDance Country
c Folk RnB ChristianGospel
l RapHipHop Kids Kids
u Country RapHipHop
d ChristianGospel
e Kids

N 161 (20%) 246 (31%) 170 (21%) 224 (28%)

36

18
MBA music market (in-class exercise 4)

Based on this cluster solution we could propose four CDs


or playlists (albums) to market at HEC Paris.

Cluster 1: CD “Classic rock” featuring Rock, Classic, Jazz

Cluster 2: CD “We want it all” featuring a mix of the most popular Pop,
Rock, Jazz, Classical, Blues, RnB, RapHipHop, TechnoDance,
BroadwayMovies

Cluster 3: CD “The Sophisticates” featuring Jazz, Classical, Folk,


Broadway movies

Cluster 4: CD “Party People” featuring Rap, Hip Hop, RnB [[ and a bit
of Techno/Dance ]]

37

MBA music market clustering

In addition to describing/naming the four clusters and


the sizes of the clusters:

• Examine visualizations of the cluster solution; these help


present the results and judge the merit of the solution
(handout p10)

• Find the cluster assignment for each respondent in the


dataset; this potentially helps to individually
market/target products, and identify which respondent
best represents the cluster (in-class exercise 5;
handouts pp11—12)

38

19
Today’s lecture

Part 1: Basics of clustering and K-means

Part 2: Choosing the number of clusters

Part 3: Case – MBA music market

Part 4: Clustering wrap-up

39

Scatter plot MBA preferences R/H vs Country

Uni/bivariate statistics for these two quantitative variables: can


we learn anything useful for decision making?
40

20
Scatter plot MBA preferences R/H vs Country

Not much! Mean (stdev) for RHH and Country are 5.9 (2.7) and 5.2 (2.6),
and 0.03 (P-value=0.47). Doesn’t help much for decision making.
41

Scatter plot MBA preferences R/H vs Country

Black – cl 1
Blue – cl 2
Red – cl 3
Green – cl 4

But, a cluster solution with four clusters gives insights that may
be useful for marketing decision making
42

21
Cluster analysis in sum…

• Separating cases into clusters so that cases in the same


cluster are similar to one another, and different from cases
in the other clusters
• Very judgmental: what variables to cluster on [[ here: only
quantitative ]], what clustering approach [[ here: only K-means ]],
how many clusters [[ here: only cross-validation ]], and how to
interpret each cluster [[ ~segment? ]]
• Good practice: examine reliability and validity
o Run solutions for different number of clusters, cluster
approaches/algorithms, and starting values of the algorithms
o Judge to the extent solutions can be interpreted and used
• Final solution is the one that makes the most sense and
can be used to develop appropriate business strategies
43

Cluster analysis in sum…


• Is there even more to say about cluster analysis? You
bet!
• Business strategy needs to recognize that customer
needs and wants are heterogeneous (marketing!)
– K-means clustering is useful (despite the limitations): e.g.
part-time MBA team at Air France (2014)
– A statistically more sound approach comes from finite
mixture // latent class models: e.g. we developed a strategy
for Intel’s processors based on these finite mixture models
– But need specialized software (e.g. R, Matlab, GAUSS)
• What’s up next: SPSS lab 5 and quiz 5!
– Factor analysis and K-means clustering

44

22
APPENDIX

Appendix 1: K-means algorithm

Appendix 2: K-means in SPSS

45

Appendix 1: K-means algorithm


It attempts to find clusters that are most compact, in terms of the
[[ square of the Euclidian ]] distance of each observation to the
center of each cluster

The clustering proceeds as follows:


1. Choose starting seeds; “individuals” who are quite different from
one another [[ best practice: use some randomization; multiple starts ]]

2. Go over the sample and allocate each observation to the closest


seed

3. Compute the mean of each cluster. These means will be the


new seeds.

4. If the new seeds are close to the previous, stop. Otherwise, repeat
steps 2 and 3.

[[ note: in step 2, all observations are re-assigned to the clusters ]]

46

23
Appendix 2: K-means in SPSS
• Two challenges: choosing [[ but cross-validation can help ]], and
more importantly, choosing the starting seeds (step 1 previous
slide) is tricky
• Final solution tends to be sensitive to choice of starting seeds
particularly for small(er) samples, many clustering variables, and
a “messy” cluster structure
• SPSS’ implementation of K-means is quite poor
– By default, it uses the first observations in your data spreadsheet
as starting values
– Hence, (randomly) re-arranging the rows of your data spreadsheet
could lead to (very) different cluster solutions
– Therefore, you should never-ever rely on a single clustering
solution from one set of starting values
• Because SPSS has no automated option for “randomizing”
starts, its K-means implementation is not recommended
47

24

Potrebbero piacerti anche