Topic1 Summarizing and Visualizing Data PDF

Summarizing and Visualizing Data
Ravindra S. Gokhale IIM Indore

1
Some Examples of Business Situations addressed using Statistics
A business school A claims that their average starting offers are more than that of another business school B. Is the claim true? A plant X has two assembly lines. Employees face one of the three kinds of accidents sprain, cut, burns. Do the accident patterns with respect to their type differ in the two assembly lines? A bank wants to assess the credit worthiness of its applicant. Should it pass the loan or reject it?
Some Examples of Business Situations addressed using Statistics

A company is planning two types of prices (Rs. 25000 versus Rs. 35000) for its software. It plans two different promotion campaigns (speed of the product versus computational power of the product). It proposes to give two types of incentives for the customer (30 day free trial versus a free gift of related software). What strategy should be taken in order to get maximum sales? An HR manager wants to verify the claim that increased compensation claim true? increase the motivation of the employees irrespective of their age, gender, experience level, etc. Is the
Approaches for Statistical Problems

Two approaches they are used jointly in most cases
Descriptive statistics Inferential statistics
Descriptive statistics
Used to describe main features of a collection of data quantitatively Aim to summarize a data set quantitatively without employing a probabilistic formulation
Inferential statistics
Aims making conclusions using data that is subject to random variation Used for: Estimation; Hypothesis testing; Predicting/forecasting
Some Basic Concepts
Some Basic Concepts

Types of Variables Scales of Measurement Measures of central tendency Measures of dispersion Skewness and Kurtosis Quartiles Histogram Box plot Scatter plot Stem and Leaf Display Pie Chart and Bar Graph
Types of Variables
Quantitative Variables A quantitative variable can be described by a number for which arithmetic operations such as averaging make sense.
Qualitative Variables A qualitative (or categorical) variable simply records a quality. If a number is used for distinguishing members of different categories of a qualitative variable, the number assignment is arbitrary.
Scales of Measurement
Nominal Scale
e.g. North = 1, East = 2, South = 3, West = 4
Ordinal Scale
e.g. Very good = 4, Good = 3, Fair = 2, Unacceptable = 1
Interval Scale
v measurement the value of zero is assigned arbitrarily and therefore we cannot take ratios of two measurements. v but we can take ratios of intervals. v e.g. 100 deg C. is not twice as hot as 50 deg C.
Ratio Scale
v we can take ratios of those measurements. v the zero in this scale is an absolute zero. v e.g. money - a sum of Rs.100 is twice as large as Rs. 50.
8
Measures of Central Tendency
A statistician had his head in an oven and his feet in ice, and he said that on the average he feels fine.
10

Relates to the way in which quantitative data tend to cluster around some value the central value Most commonly used measures: Mean (), Median, Mode All have the same unit as that of the data.
What is the purpose of different measures of central tendency?
11

An outlier affects the mean but does not affect the median
12
Measures of Dispersion
Dispersion indicates the variability or spread in a variable Most commonly used measures Variance; Standard deviation; Inter-quartile range
Variance - describes how far the values lie from the mean
2 =
(x - )
i i =1
13
Measures of Dispersion
Standard deviation square root of the variance
Denoted by Standard deviation is given by:
(x - )
i =1 i
Low data points tend to be very close to the mean; High data is spread out over a large range of values Sample standard deviation (s) is used as an estimator of
s=
(x - x)
i i =1
N-1
[what is a sample ? What is an estimator ?? why N-1 in the denominator instead of N ???]
14
Skewness and Kurtosis

Skewness a measure of asymmetry of the data
Can be positive or negative or undefined Negative skew tail is to the left Positive skew tail is to the right
Kurtosis - a measure of the "peakedness" of the data
High kurtosis sharper peak and longer fatter tails Low kurtosis rounded peak and shorter thinner tails
15
Quartiles
Quartiles are the three values which divide the sorted data set into four equal parts
Each part then represents one fourth of the sampled data
The three quartiles:
First quartile (Q1) = lower quartile cuts off lowest 25% of data Second quartile (Q2) = median cuts the data set into half Third quartile (Q3) = upper quartile cuts off highest 25% of data
16
Quartiles
No universal method for calculating quartiles One method: Lk = N x (k / 4); where k = 1 for Q1, 2 for Q2, 3 for Q3
v If Lk is a whole number, then Qk = average of the values corresponding to the positions Lk and Lk+1 v If Lk is a decimal, the Qk = value corresponding to the position rounded upto the higher whole number position
Other method:
v Median of the data set gives Q2 v Divide data set into two. [In case of odd data points in original set include median in both the halves]. The median of upper and lower halves gives Q3 and Q1 respectively
17
Quartiles
Interquartile range IQR = Q3 Q1
IQR is a more robust measure for variability It does not get affected much by skewness or outliers
18
Histogram
A graphical display of tabular frequencies
Shown in the form of adjacent rectangles Provides a good visual representation of the distribution of data
Important factor to be considered while construction:

What is the ideal number of bins (denoted by k) and the corresponding (equal) width of the interval (denoted by h)?
v Small bin width Too many bins Difficult to interpret v Large bin width Less number of bins Loss of information
Some rules of thumb:

v Sturges formula: k = ceiling[log2N + 1] = ceiling[(ln N/ln 2) + 1] v Scotts choice: h = (3.5 x sample standard deviation) / (cube root of N)
19
Histogram
Identifying relation between mean, median, and shape of a histogram
Symmetric: mean median Left (or negatively) skewed: mean < median (generally) Right (or positively) skewed: mean > median (generally)
20
Box Plot
Also known as box-and-whisker plot Provides a five number summary:
The smallest observation (minimum) Lower quartile (Q1) Median (Q2) Upper quartile (Q3) Largest observation (maximum)
Also indicates outlier observations, if any Spacings between the different parts of the box help indicate: the degree of dispersion (spread) and skewness in the data, and identify outliers
Box plots are very effective while comparing values in two or categories
more
21
Box Plot (cont)

Outliers* min(Xmax , {Q3+[1.5 x IQR]}) Q3
Q2 Q1 max(Xmin , {Q1- [1.5 x IQR]}) *Note: Generally this is the value beyond which readings are considered as
outliers. However, there is no universal definition.
22
Scatter Plot
Displays values for two variables for a set of data Provides a visual representation of relationship between two variables
23
Stem and Leaf Display

Contains features of a histogram. More informative than a histogram. e.g. The stem and leaf display for 11, 12, 12, 13, 15, 15, 15, 16, 17, 20, 21, 21, 21, 22, 22, 22, 23, 24, 26, 27, 27, 27, 28, 29, 29, 30, 31, 32, 34, 35, 37, 41, 41, 42, 45, 47, 50, 52, 53, 56, 60, 62 will be given as:
1 2 3 4 5 6 | | | | | | 122355567 0111222346777899 012457 11257 0236 02
24
Pie Chart and Bar Graph

A pie chart is the most illustrative way of displaying quantities as percentages of a given total. Pie charts are used to present frequencies for categorical data. The scale of measurement may be nominal or ordinal
Bar graphs are often used to display categorical data where there is no emphasis on the percentage of a total represented by each category. The scale of measurement is nominal or ordinal.
25
An Illustration
NAME: Car Data TYPE: Multiple Regression SIZE: 804 observations, 12 variables
DESCRIPTIVE ABSTRACT: Data collected from Kelly Blue Book for several hundred 2005 used GM cars allows students to develop a multivariate regression model to determine their car value based on a variety of characteristics such as miles driven, make, model, engine size, interior style, cruise control, etc.
26
An Illustration
SOURCES:
For this data set, a representative sample of over eight hundred, 2005 GM cars were selected, then an algorithm was developed following the 2005 Central Edition of the Kelly Blue Book to estimate retail price.
27
An Illustration
VARIABLE DESCRIPTIONS:
v Price: suggested retail price of the used 2005 GM car in excellent condition. The condition of a car can greatly affect price. All cars in this data set were less than one year old when priced and considered to be in excellent condition. v Miles: number of miles the car has been driven v Make: manufacturer of the car such as Saturn, Pontiac, and Chevrolet v Model: specific models for each car manufacturer such as Ion, Vibe, Cavalier v Trim (of car): specific type of car model such as SE Sedan 4D, Quad Coupe 2D
28
An Illustration
VARIABLE DESCRIPTIONS:
v Type: body type such as sedan, coupe, etc. v Cylinder: number of cylinders in the engine v Liter: a more specific measure of engine size v Doors: number of doors v Cruise: indicator variable representing whether the car has cruise control (1 = cruise) v Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded) v Leather: indicator variable representing whether the car has leather seats (1 = leather)
29

Topic1 Summarizing and Visualizing Data PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Topic1 Summarizing and Visualizing Data PDF

Caricato da

Copyright:

Formati disponibili

Summarizing and Visualizing Data

Ravindra S. Gokhale IIM Indore

Some Examples of Business Situations addressed using Statistics

Some Examples of Business Situations addressed using Statistics

Approaches for Statistical Problems

Descriptive statistics Inferential statistics

Some Basic Concepts

Some Basic Concepts

Measures of Central Tendency

Measures of Central Tendency

Measures of Central Tendency

What is the purpose of different measures of central tendency?

Measures of Central Tendency

Denoted by Standard deviation is given by:

Skewness and Kurtosis

Kurtosis - a measure of the "peakedness" of the data

Each part then represents one fourth of the sampled data

The three quartiles:

Important factor to be considered while construction:

Some rules of thumb:

Box Plot (cont)

Stem and Leaf Display

Pie Chart and Bar Graph

Potrebbero piacerti anche