Sei sulla pagina 1di 28

Understanding and Describing the Data:

Descriptive Statistics

ANOL BHATTACHERJEE, PH.D.


UNIVERSITY OF SOUTH FLORIDA
Ten Questions
 What is a frequency distribution?
 If a sample mean is different from the median, what does it say about the sample?
 Which is the best measure of central tendency in a skewed sample: mean, median, or mode?
 Provide an example of a situation where we can expect two modes in a sample distribution.
 What does a standard deviation mean?
 Is it possible for a sample to have a zero standard deviation? If so, when?
 Hypothetically, if the standard deviation of a sample is zero, what does it mean?
 How would you interpret the standard deviation of a bimodal sample?
 How can you detect outliers in a sample?
 If two variables have a +1 correlation, what does it mean?
Outline
 Learning objectives:
 Describe data using statistical estimates and graphics.
 Concepts discussed:
 Univariate analysis:
 Distribution: histograms.

 Central tendency: mean, median, mode.

 Dispersion: range, interquartile range, variance, standard deviation.

 Boxplots and outlier analysis.

 Multivariate analysis:
 Correlations.

 Scatterplots.
Observed Data
Variable names HomeID Price SqFt BedroomsBathroomsOffers Brick Neighborhood
1 114300 1790 2 2 2 No East
2 114200 2030 4 2 3 No East
One observation/ 3 114800 1740 3 2 1 No East
case/data point 4 94700 1980 3 2 3 No East
5 119800 2130 3 3 3 No East
6 114600 1780 3 2 2 No North
7 151600 1830 3 3 3 Yes West
8 150700 2160 4 2 2 No West
9 119200 2110 4 2 3 No East
Frequency distribution
of one variable

 Variables:
 You can choose which variable you want to predict (outcome or dependent variables) based on which other
variables (predictor or independent variables).
 Frequency distribution:
 Set of values for a given variable; Can be used to draw a histogram.
Describing the Data
 What are descriptive statistics?
 Numbers that quantitatively summarize and describe a sample of data.
 Examples: Mean, median, mode, standard deviation, range, maximum, minimum, quartiles, …

Question: Which of the following classes of students is more intelligent?

Class A - IQs of 13 Students Class B - IQs of 13 Students


102 115 128 109 127 162 131 103
131 89 98 106 96 111 80 109
140 119 93 97 93 87 120 105
110 109
Descriptive Statistics
Question: Which class is more intelligent now?

Class A - Average IQ Class B - Average IQ


110.54 110.23

With a summary descriptive statistic, it is much easier to answer our question.

Question: Which class has a wider variation in intelligence?


Descriptive Statistics: Univariate Analysis
To Organize Data To Summarize Data
 Tables  Central Tendency (middle values)
 Frequency Distributions  Mean
 Median
 Graphs
 Mode
 Bar Chart or Histogram
 Variation (differences within a group)
 Stem and Leaf Plot  Range
 Interquartile Range
 Variance
 Standard Deviation
Frequency Distribution
Frequency Distribution of IQs for Two Classes Relative Frequency Distribution of IQs for Two Classes
IQ Frequency IQ Frequency Percent Valid Percent Cumulative Percent
80.00 1 80.00 1 3.8 3.8 3.8
87.00 1 87.00 1 3.8 3.8 7.7
89.00 1 89.00 1 3.8 8.8 11.6
93.00 2 93.00 2 7.7 7.7 19.2
96.00 1 96.00 1 3.8 3.8 23.1
97.00 1 97.00 1 3.8 3.8 26.9
98.00 1 98.00 1 3.8 3.8 30.8
102.00 1 102.00 1 3.8 3.8 34.6
103.00 1 103.00 1 3.8 3.8 38.5
105.00 1 105.00 1 3.8 3.8 42.3
106.00 1 106.00 1 3.8 3.8 46.2
107.00 1 107.00 1 3.8 3.8 50.5
109.00 3 109.00 3 11.5 11.5 58.3
110.00 1 110.00 1 3.8 3.8 61.6
111.00 1 111.00 1 3.8 3.8 65.5
115.00 1 115.00 1 3.8 3.8 69.3
119.00 1 119.00 1 3.8 3.8 73.1
120.00 1 120.00 1 3.8 3.8 77.0
127.00 1 127.00 1 3.8 3.8 80.8
128.00 1 128.00 1 3.8 3.8 84.7
131.00 2 131.00 2 7.7 7.7 92.4
140.00 1 140.00 1 3.8 3.8 96.2
162.00 1 162.00 1 3.8 3.8 100.0
Total 26 Total 26 100.0 100.0
Frequency Distribution
Grouped Relative Frequency Distribution of IQ for Two Classes Histogram of Frequency Distribution

IQ Frequency Percent Cumulative Percent 6

80 – 89 3 11.6 11.6 5

90 – 99 5 19.3 30.9
4

100 – 109 8 30.7 61.6

Frequency
110 – 119 3 11.6 73.2 3

120 – 129 3 11.6 84.8


130 – 139 2 7.7 92.5 2

140 – 149 1 3.8 96.3


1
150 and over 1 3.8 100.0 Mean = 110.4583
Std. Dev. = 19.00338
N = 24
0

Total 26 100.0 100.0


80.00 100.00 120.00 140.00 160.00
IQ
Stem and Leaf Plot & Bar Graphs
Stem and Leaf Plot of IQ for Two Classes Bar Graph of the Number of Students in Each Class

Stem Leaf Bar Graph of Number of Students in Two Classes

8 079 12

9 33678 10

10 23567999
8

11 0159

Count
12 078 6

13 1 4

14 0 2

15
0

16 2 1.00
Class
2.00
Measure of Central Tendency: Mean
 Also called “arithmetic mean” or average.
 Add up the values for each case and divide by the total number of cases.

 There is also geometric mean and harmonic mean, but we won’t get into that.
 Class example:
Class A - IQs of 13 Students Class B - IQs of 13 Students
102 115 128 109 127 162 131 103
131 89 98 106 96 111 80 109
140 119 93 97 93 87 120 105
110 109
Mean IQ of Class A = 110.54 Mean IQ of Class B = 110.43
What Mean Means
 The mean is a “balance point.”
 Each person’s score is like 1 pound placed at the score’s position on a see-saw.
 Mean of 110 means that the weights of the left of the fulcrum (17+4) equals the ones on the right (21).
1 lb at 1 lb at 1 lb at
93 cm 106 cm 110 cm 131 cm

17 units 21 units
below 4 units above
below 0 units

 This means that the mean can be easily affected by outliers.


 Mean is a poor measure of central tendency if: (a) there are outliers, (2) sample is skewed.

Bill Gates
All of Us
Mean Outlier
Income distribution in the U.S.
Median
 The middle value when a variable’s values are ranked in order
 The 50th percentile: When data are listed in order, the median is the point at which 50% of the
cases are above and 50% below it.
 Insensitive to outliers: works better with skewed data.
Class A – IQ of 13 Students
89 93 97 98 102 106 109 110 115 119 128 131 140

Median = 109
(six cases above, six below)

If the student with the lowest IQ dropped out of class, the median would shift:
X 93 97 98 102 106 109 110 115 119 128 131 140

New Median = 109.5 = (109+110)/2


(six cases above, six below)
Mode
 The most common data point is called the mode.
Combined IQ scores for Classes A & B:

80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162

A la mode!!
 Note:
 It is possible to have more than one mode in a frequency distribution. 2.0

 The distribution on the right is bimodal.


1.8

Mode may not be at the center of a distribution.


1.6

Count
1.4

 Mode depicts “most likely” observation rather than “most typical” or 1.2

“most central.” 1.0

82.00 89.00 96.00 98.00 103.00 106.00 109.00 115.00 120.00 128.00 140.00
87.00 93.00 97.00 102.00 105.00 107.00 111.00 119.00 127.00 131.00 162.00
IQ
Comparing the Central Tendency Measures
 In symmetric distributions, the mean, median, and mode are the same.
 In skewed data, the mean and median lie further toward the skew than the mode.

Symmetric Skewed

Mean
Median Mode Mean
Mode Median

This bell-shaped distribution is


called a normal distribution
Measuring Variation: Range
 The spread, or the distance, between the lowest and highest values of a variable.
 Range = Maximum value – Minimum value

Class A – IQ of 13 Students Class B – IQ of 13 Students


102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82

Note: Range is sensitive to outliers


Interquartile Range
 A quartile represents 25% (a quarter) of a frequency distribution.
 Median: the 50th percentile that divides the cases in half.
 Lower quartile (Q1): 25th percentile that divides the first ¼ of cases from the latter ¾.
 Upper quartile (Q3): 75th percentile that divides the first ¾ of cases from the latter ¼.
 The interquartile range is the distance or range between the 25th percentile and the 75th percentile.

Combined IQ scores for Classes A & B:

80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162

Q1 (25th percentile) Median (50th Q3 (75th percentile)


percentile) =
(109+109)/2 = 109

Interquartile range = 120 – 97 = 23


Detecting Outliers in Data
 Any case that falls outside the range (Q1-1.5*IQR, Q3+1.5*IQR) is an outlier
 Question: What are the outliers, if any, in the following data?

Combined IQ scores for Classes A & B:

80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162

Q1=97 Median = 109 Q3=120

IQR = 120 – 97 = 23
Variance
 A measure of the spread (dispersion) of the recorded values on a variable.
 Larger variance means that individual cases are further from the mean.
 Smaller variance means that individual scores are closer to the mean.
 Computed as (for discrete values with equal probabilities): Mean

where µ = mean of the distribution


xi - µ = deviation of each case from the mean (residual) Mean
(xi - µ)^2 = Sum of square residuals (SSR)
 Population variance = SSR/n; Sample variance = SSR/(n-1).
Variance: IQ Example for Class A
IQ (Xi) Xi - µ (Xi - µ)2 Mean (µ) = 110.54
102 -8.54 72.91
SSR = Σ(Xi - µ )2 = 2891.23
115 4.46 19.91
128 17.46 304.91 Var(X ) = SSR / (n-1) = 240.94
109 -1.54 2.37 SD (σ)= sqrt(Var) = 15.52
131 20.46 418.67  Standard deviation (σ):
89 -21.54 463.91
 The average deviation of observations from
98 -12.54 157.21
the mean.
106 -4.54 20.60
140 29.46 867.98  σ = √Var(Xi)
119 8.46 71.60  What does it mean if SD=0?
93 -17.54 307.60
 There is no variance; all values are the same;
97 -13.54 183.29 and the “variable” is not a variable but a
110 -0.54 0.29 constant.
2891.23
 Both mean and SD are inflated by outliers.
240.94
15.52
Box-Plots
 Portrays almost all descriptive statistics in one graphic.
 A box-plot shows:
Upper and lower quartiles
180.00

 Mean 160.00 162 <- Outlier


 Median
 Range 140.00

 Outliers (1.5 IQR) 120


120.00

M=110.5 IQR = 23
109
100.00
97

80.00 82
IQ
Multivariate Analysis: Correlations
 Strength of association between two (or more) variables.
 Pearson’s product-moment correlation coefficient computed as:

 Ranges between -1 and + 1: Has a magnitude (size) and direction (sign).


 Questions:
 What does a correlation of 0 mean?
 What does a correlation of +1 mean?
 What does a correlation of -1 mean?
Correlation Table
 Shows associations between more than two variables (upper triangular matrix).
HomeID Price SqFt BedroomsBathroomsOffers Brick Neighborhood
1 114300 1790 2 2 2 No East
2 114200 2030 4 2 3 No East
3 114800 1740 3 2 1 No East
4 94700 1980 3 2 3 No East
Raw data 5 119800 2130 3 3 3 No East
6 114600 1780 3 2 2 No North
7 151600 1830 3 3 3 Yes West
8 150700 2160 4 2 2 No West
9 119200 2110 4 2 3 No East

Correlation
table
Multivariate Analysis: Scatterplots
 A 2D plot of the values of two variables,
showing their association graphically.

200000
 Figure on right shows a scatterplot of Price vs.
SqFt (recall that the correlation between these
two variables was 0.55).

160000
 Question: How would zero correlation look on

Price
a scatterplot?

120000
80000
1600 1800 2000 2200 2400 2600

SqFt
Effects of Outliers and Non-Normality
 Pearson’s correlation coefficient indicates
the strength of a linear relationship
between two variables (e.g., Figure 1).
 Non-linear associations (Figure 2) are hard
to interpret using Pearson’s correlation.
 Outliers (Figures 3 and 4) skews the
correlation coefficient; outliers must be
identified and removed before correlation
analysis.
Scatterplot Matrices 1600 2200 2.0 3.0 4.0

160000
 Combination of a correlation table and Price

scatterplots.

80000
 Too much information?

2200
SqFt
 Questions:

1600

5.0
 What does this matrix tell you about

4.0
the relationship between price and Bedrooms

3.0
number of offers?

2.0
4.0
 Why do some plots look like straight

3.0
Bathrooms
lines?

2.0
 Why are Brick and Neighborhood not

6
5
in this matrix?

4
Offers

3
2
1
80000 160000 2.0 3.0 4.0 5.0 1 2 3 4 5 6
One More Scatterplot Matrix Price
1600 2000 2400 2 2.5 3 3.5 4

2e+05
0.55* 0.53* 0.52* -0.31*
150000

1e+05

2600
2400 SqFt
0.48* 0.52* 0.34*
2200
2000
1800
1600

5
Bedrooms 4.5
0.41* 0.11
4
3.5
3
2.5
2
4
Bathrooms
3.5 0.14

2.5

2
6
Offers
5
4
3
2
1
1e+05 2e+05 2 2.5 3 3.5 4 4.5 5 1 2 3 4 5 6
Key Takeaways
 We need descriptive statistics to (a) organize data, and (b) summarize data.
 Descriptive analysis can be univariate or multivariate.
 Univariate summary statistics can be grouped into central tendency measures (mean,
median, mode) and variance measures (range, IQR, variance, SD).
 Summary data provide a general sense of the overall data sample, but to understand how
individual observations look like, you need graphical depictions such as frequency
distributions, histograms, etc.
 Multivariate analysis examines associations between two or more variables, again as numeric
statistics (correlation) or as graphical plots (scatterplots).
 The formulae look daunting but they are all computed today by software.
 Today’s software is also sophisticated enough to draw powerful graphical plots.

Potrebbero piacerti anche