Sei sulla pagina 1di 29

Summarizing and Visualizing Data

Ravindra S. Gokhale IIM Indore


1

Some Examples of Business Situations addressed using Statistics

A business school A claims that their average starting offers are more than that of another business school B. Is the claim true? A plant X has two assembly lines. Employees face one of the three kinds of accidents sprain, cut, burns. Do the accident patterns with respect to their type differ in the two assembly lines? A bank wants to assess the credit worthiness of its applicant. Should it pass the loan or reject it?

Some Examples of Business Situations addressed using Statistics


A company is planning two types of prices (Rs. 25000 versus Rs. 35000) for its software. It plans two different promotion campaigns (speed of the product versus computational power of the product). It proposes to give two types of incentives for the customer (30 day free trial versus a free gift of related software). What strategy should be taken in order to get maximum sales? An HR manager wants to verify the claim that increased compensation claim true? increase the motivation of the employees irrespective of their age, gender, experience level, etc. Is the

Approaches for Statistical Problems


Two approaches they are used jointly in most cases

Descriptive statistics Inferential statistics

Descriptive statistics
Used to describe main features of a collection of data quantitatively Aim to summarize a data set quantitatively without employing a probabilistic formulation

Inferential statistics

Aims making conclusions using data that is subject to random variation Used for: Estimation; Hypothesis testing; Predicting/forecasting

Some Basic Concepts

Some Basic Concepts


Types of Variables Scales of Measurement Measures of central tendency Measures of dispersion Skewness and Kurtosis Quartiles Histogram Box plot Scatter plot Stem and Leaf Display Pie Chart and Bar Graph

Types of Variables
Quantitative Variables A quantitative variable can be described by a number for which arithmetic operations such as averaging make sense.

Qualitative Variables A qualitative (or categorical) variable simply records a quality. If a number is used for distinguishing members of different categories of a qualitative variable, the number assignment is arbitrary.

Scales of Measurement
Nominal Scale
e.g. North = 1, East = 2, South = 3, West = 4

Ordinal Scale
e.g. Very good = 4, Good = 3, Fair = 2, Unacceptable = 1

Interval Scale
v measurement the value of zero is assigned arbitrarily and therefore we cannot take ratios of two measurements. v but we can take ratios of intervals. v e.g. 100 deg C. is not twice as hot as 50 deg C.

Ratio Scale
v we can take ratios of those measurements. v the zero in this scale is an absolute zero. v e.g. money - a sum of Rs.100 is twice as large as Rs. 50.
8

Measures of Central Tendency

Measures of Central Tendency

A statistician had his head in an oven and his feet in ice, and he said that on the average he feels fine.

10

Measures of Central Tendency


Relates to the way in which quantitative data tend to cluster around some value the central value Most commonly used measures: Mean (), Median, Mode All have the same unit as that of the data.

What is the purpose of different measures of central tendency?

11

Measures of Central Tendency


An outlier affects the mean but does not affect the median

12

Measures of Dispersion
Dispersion indicates the variability or spread in a variable Most commonly used measures Variance; Standard deviation; Inter-quartile range

Variance - describes how far the values lie from the mean

2 =

(x - )
i i =1

13

Measures of Dispersion
Standard deviation square root of the variance

Denoted by Standard deviation is given by:

(x - )
i =1 i

Low data points tend to be very close to the mean; High data is spread out over a large range of values Sample standard deviation (s) is used as an estimator of

s=

(x - x)
i i =1

N-1

[what is a sample ? What is an estimator ?? why N-1 in the denominator instead of N ???]
14

Skewness and Kurtosis


Skewness a measure of asymmetry of the data

Can be positive or negative or undefined Negative skew tail is to the left Positive skew tail is to the right

Kurtosis - a measure of the "peakedness" of the data

High kurtosis sharper peak and longer fatter tails Low kurtosis rounded peak and shorter thinner tails

15

Quartiles
Quartiles are the three values which divide the sorted data set into four equal parts

Each part then represents one fourth of the sampled data

The three quartiles:

First quartile (Q1) = lower quartile cuts off lowest 25% of data Second quartile (Q2) = median cuts the data set into half Third quartile (Q3) = upper quartile cuts off highest 25% of data

16

Quartiles
No universal method for calculating quartiles One method: Lk = N x (k / 4); where k = 1 for Q1, 2 for Q2, 3 for Q3
v If Lk is a whole number, then Qk = average of the values corresponding to the positions Lk and Lk+1 v If Lk is a decimal, the Qk = value corresponding to the position rounded upto the higher whole number position

Other method:
v Median of the data set gives Q2 v Divide data set into two. [In case of odd data points in original set include median in both the halves]. The median of upper and lower halves gives Q3 and Q1 respectively

17

Quartiles
Interquartile range IQR = Q3 Q1

IQR is a more robust measure for variability It does not get affected much by skewness or outliers

18

Histogram
A graphical display of tabular frequencies

Shown in the form of adjacent rectangles Provides a good visual representation of the distribution of data

Important factor to be considered while construction:


What is the ideal number of bins (denoted by k) and the corresponding (equal) width of the interval (denoted by h)?
v Small bin width Too many bins Difficult to interpret v Large bin width Less number of bins Loss of information

Some rules of thumb:


v Sturges formula: k = ceiling[log2N + 1] = ceiling[(ln N/ln 2) + 1] v Scotts choice: h = (3.5 x sample standard deviation) / (cube root of N)

19

Histogram
Identifying relation between mean, median, and shape of a histogram

Symmetric: mean median Left (or negatively) skewed: mean < median (generally) Right (or positively) skewed: mean > median (generally)

20

Box Plot
Also known as box-and-whisker plot Provides a five number summary:

The smallest observation (minimum) Lower quartile (Q1) Median (Q2) Upper quartile (Q3) Largest observation (maximum)

Also indicates outlier observations, if any Spacings between the different parts of the box help indicate: the degree of dispersion (spread) and skewness in the data, and identify outliers

Box plots are very effective while comparing values in two or categories

more
21

Box Plot (cont)


Outliers* min(Xmax , {Q3+[1.5 x IQR]}) Q3

Q2 Q1 max(Xmin , {Q1- [1.5 x IQR]}) *Note: Generally this is the value beyond which readings are considered as
outliers. However, there is no universal definition.
22

Scatter Plot
Displays values for two variables for a set of data Provides a visual representation of relationship between two variables

23

Stem and Leaf Display


Contains features of a histogram. More informative than a histogram. e.g. The stem and leaf display for 11, 12, 12, 13, 15, 15, 15, 16, 17, 20, 21, 21, 21, 22, 22, 22, 23, 24, 26, 27, 27, 27, 28, 29, 29, 30, 31, 32, 34, 35, 37, 41, 41, 42, 45, 47, 50, 52, 53, 56, 60, 62 will be given as:
1 2 3 4 5 6 | | | | | | 122355567 0111222346777899 012457 11257 0236 02

24

Pie Chart and Bar Graph


A pie chart is the most illustrative way of displaying quantities as percentages of a given total. Pie charts are used to present frequencies for categorical data. The scale of measurement may be nominal or ordinal

Bar graphs are often used to display categorical data where there is no emphasis on the percentage of a total represented by each category. The scale of measurement is nominal or ordinal.

25

An Illustration
NAME: Car Data TYPE: Multiple Regression SIZE: 804 observations, 12 variables

DESCRIPTIVE ABSTRACT: Data collected from Kelly Blue Book for several hundred 2005 used GM cars allows students to develop a multivariate regression model to determine their car value based on a variety of characteristics such as miles driven, make, model, engine size, interior style, cruise control, etc.

26

An Illustration
SOURCES:

For this data set, a representative sample of over eight hundred, 2005 GM cars were selected, then an algorithm was developed following the 2005 Central Edition of the Kelly Blue Book to estimate retail price.

27

An Illustration
VARIABLE DESCRIPTIONS:
v Price: suggested retail price of the used 2005 GM car in excellent condition. The condition of a car can greatly affect price. All cars in this data set were less than one year old when priced and considered to be in excellent condition. v Miles: number of miles the car has been driven v Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet v Model: specific models for each car manufacturer such as Ion, Vibe, Cavalier v Trim (of car): specific type of car model such as SE Sedan 4D, Quad Coupe 2D

28

An Illustration
VARIABLE DESCRIPTIONS:
v Type: body type such as sedan, coupe, etc. v Cylinder: number of cylinders in the engine v Liter: a more specific measure of engine size v Doors: number of doors v Cruise: indicator variable representing whether the car has cruise control (1 = cruise) v Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded) v Leather: indicator variable representing whether the car has leather seats (1 = leather)

29

Potrebbero piacerti anche