Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Descriptive Statistics
Multivariate analysis:
Correlations.
Scatterplots.
Observed Data
Variable names HomeID Price SqFt BedroomsBathroomsOffers Brick Neighborhood
1 114300 1790 2 2 2 No East
2 114200 2030 4 2 3 No East
One observation/ 3 114800 1740 3 2 1 No East
case/data point 4 94700 1980 3 2 3 No East
5 119800 2130 3 3 3 No East
6 114600 1780 3 2 2 No North
7 151600 1830 3 3 3 Yes West
8 150700 2160 4 2 2 No West
9 119200 2110 4 2 3 No East
Frequency distribution
of one variable
Variables:
You can choose which variable you want to predict (outcome or dependent variables) based on which other
variables (predictor or independent variables).
Frequency distribution:
Set of values for a given variable; Can be used to draw a histogram.
Describing the Data
What are descriptive statistics?
Numbers that quantitatively summarize and describe a sample of data.
Examples: Mean, median, mode, standard deviation, range, maximum, minimum, quartiles, …
80 – 89 3 11.6 11.6 5
90 – 99 5 19.3 30.9
4
Frequency
110 – 119 3 11.6 73.2 3
8 079 12
9 33678 10
10 23567999
8
11 0159
Count
12 078 6
13 1 4
14 0 2
15
0
16 2 1.00
Class
2.00
Measure of Central Tendency: Mean
Also called “arithmetic mean” or average.
Add up the values for each case and divide by the total number of cases.
There is also geometric mean and harmonic mean, but we won’t get into that.
Class example:
Class A - IQs of 13 Students Class B - IQs of 13 Students
102 115 128 109 127 162 131 103
131 89 98 106 96 111 80 109
140 119 93 97 93 87 120 105
110 109
Mean IQ of Class A = 110.54 Mean IQ of Class B = 110.43
What Mean Means
The mean is a “balance point.”
Each person’s score is like 1 pound placed at the score’s position on a see-saw.
Mean of 110 means that the weights of the left of the fulcrum (17+4) equals the ones on the right (21).
1 lb at 1 lb at 1 lb at
93 cm 106 cm 110 cm 131 cm
17 units 21 units
below 4 units above
below 0 units
Bill Gates
All of Us
Mean Outlier
Income distribution in the U.S.
Median
The middle value when a variable’s values are ranked in order
The 50th percentile: When data are listed in order, the median is the point at which 50% of the
cases are above and 50% below it.
Insensitive to outliers: works better with skewed data.
Class A – IQ of 13 Students
89 93 97 98 102 106 109 110 115 119 128 131 140
Median = 109
(six cases above, six below)
If the student with the lowest IQ dropped out of class, the median would shift:
X 93 97 98 102 106 109 110 115 119 128 131 140
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162
A la mode!!
Note:
It is possible to have more than one mode in a frequency distribution. 2.0
Count
1.4
Mode depicts “most likely” observation rather than “most typical” or 1.2
82.00 89.00 96.00 98.00 103.00 106.00 109.00 115.00 120.00 128.00 140.00
87.00 93.00 97.00 102.00 105.00 107.00 111.00 119.00 127.00 131.00 162.00
IQ
Comparing the Central Tendency Measures
In symmetric distributions, the mean, median, and mode are the same.
In skewed data, the mean and median lie further toward the skew than the mode.
Symmetric Skewed
Mean
Median Mode Mean
Mode Median
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162
IQR = 120 – 97 = 23
Variance
A measure of the spread (dispersion) of the recorded values on a variable.
Larger variance means that individual cases are further from the mean.
Smaller variance means that individual scores are closer to the mean.
Computed as (for discrete values with equal probabilities): Mean
M=110.5 IQR = 23
109
100.00
97
80.00 82
IQ
Multivariate Analysis: Correlations
Strength of association between two (or more) variables.
Pearson’s product-moment correlation coefficient computed as:
Correlation
table
Multivariate Analysis: Scatterplots
A 2D plot of the values of two variables,
showing their association graphically.
200000
Figure on right shows a scatterplot of Price vs.
SqFt (recall that the correlation between these
two variables was 0.55).
160000
Question: How would zero correlation look on
Price
a scatterplot?
120000
80000
1600 1800 2000 2200 2400 2600
SqFt
Effects of Outliers and Non-Normality
Pearson’s correlation coefficient indicates
the strength of a linear relationship
between two variables (e.g., Figure 1).
Non-linear associations (Figure 2) are hard
to interpret using Pearson’s correlation.
Outliers (Figures 3 and 4) skews the
correlation coefficient; outliers must be
identified and removed before correlation
analysis.
Scatterplot Matrices 1600 2200 2.0 3.0 4.0
160000
Combination of a correlation table and Price
scatterplots.
80000
Too much information?
2200
SqFt
Questions:
1600
5.0
What does this matrix tell you about
4.0
the relationship between price and Bedrooms
3.0
number of offers?
2.0
4.0
Why do some plots look like straight
3.0
Bathrooms
lines?
2.0
Why are Brick and Neighborhood not
6
5
in this matrix?
4
Offers
3
2
1
80000 160000 2.0 3.0 4.0 5.0 1 2 3 4 5 6
One More Scatterplot Matrix Price
1600 2000 2400 2 2.5 3 3.5 4
2e+05
0.55* 0.53* 0.52* -0.31*
150000
1e+05
2600
2400 SqFt
0.48* 0.52* 0.34*
2200
2000
1800
1600
5
Bedrooms 4.5
0.41* 0.11
4
3.5
3
2.5
2
4
Bathrooms
3.5 0.14
2.5
2
6
Offers
5
4
3
2
1
1e+05 2e+05 2 2.5 3 3.5 4 4.5 5 1 2 3 4 5 6
Key Takeaways
We need descriptive statistics to (a) organize data, and (b) summarize data.
Descriptive analysis can be univariate or multivariate.
Univariate summary statistics can be grouped into central tendency measures (mean,
median, mode) and variance measures (range, IQR, variance, SD).
Summary data provide a general sense of the overall data sample, but to understand how
individual observations look like, you need graphical depictions such as frequency
distributions, histograms, etc.
Multivariate analysis examines associations between two or more variables, again as numeric
statistics (correlation) or as graphical plots (scatterplots).
The formulae look daunting but they are all computed today by software.
Today’s software is also sophisticated enough to draw powerful graphical plots.