Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Data analysis
14.310x
1 / 29
OK, we have data, shall we look at it?
• Plotting Histograms
• Plotting Kernels density
• Comparing two (or more) distribution
• Plotting estimates of CDFs
• Bivariate distributions
2 / 29
• A histogram is a rough estimate of the probability distribution
function of a continuous variable.
• We obtain it by binning the data (typically in bins of equal
size) and simply counting the number of observations within
each bin.
• Formally, a histogram is a function that counts the number of
observations that fit into each bin. Let n be the total number
of observations and k the P number of bins, the histogram
meets the definition: n = ki=1 mi
• Graph of a histogram: We draw, for each bin, a rectangle
proportional to the number of such cases.
• You can also divide by the total number of observations to
obtain the density: the proportion of cases that within each
bin.
3 / 29
Example: Women’s height in Bihar
1500
1000
count
500
4 / 29
Our universe in R
5 / 29
The code for the basic histogram
6 / 29
Mmmmm..Not too pretty
4000
3000
count
2000
1000
7 / 29
Let’s try again
8 / 29
That is better
1500
1000
count
500
9 / 29
Playing with the bins
Female Height in Bihar
6000
3000
4000
2000
1000 2000
0 0
120 140 160 180 200 120 140 160 180
bin width=5 bin width=10
10000
4000 7500
5000
2000
2500
0 0
125 150 175 200 100 150 200
bin width=20 bin width=50
10 / 29
Same thing for US women
300
200
count
100
11 / 29
The Kernel Density Estimation
• The histogram is a little a bit bumpy ...
• Kernel density estimation is a non-parametric way to estimate
the probability density function of a random variable.
• Straightforward extension of the histogram: at each point we
take a weighted average of the frequency of the observations.
• Formally: Let (x1, x2, ..., xn) be an independent and
identically distributed sample drawn from some distribution
with an unknown PDF f . We are interested in estimating the
shape of this function f . Its kernel density estimator is given
by:
n n
ˆ 1X 1 X x − xi
fh (x) = Kh (x − xi ) = K
n nh h
i=1 i=1
• where K () is the kernel, a non-negative function that
integrates to one and has mean zero , and h > 0 the
bandwidth
• Many choices for K (), but it is typically bell shaped
12 / 29
The Kernel Density Estimation
density.default(x = x, kernel = "gaussian")
0.30
0.25
0.20
Density
0.15
0.10
0.05
0.00
−2 0 2 4 6
N = 10 Bandwidth = 0.678
13 / 29
14 / 29
0.06
0.04
density
0.02
0.00
15 / 29
Things to choose
16 / 29
Varying the bandwidth
0.04 0.04
0.02 0.02
0.00 0.00
140 150 160 170 180 140 150 160 170 180
bw=1 bw=5
0.06 0.06
0.04 0.04
0.02 0.02
0.00 0.00
140 150 160 170 180 140 150 160 170 180
bw=10 bw=20
17 / 29
More about this distribution
• What do you think about the shape of this curve?
• It is unimodal, thin-tailed, symmetric
• Does it remind you of something we saw on the board of a
previous lecture?
• With large enough n, the binomal distribution started to have
this shape, no?
• The binomial distribution B(n, p) is approximately normal
with mean np and variance np(1 − p) for large n and for p not
too close to zero or one.
• We will come back to the Normal distribution in great detail
later, because it will turn out that many random variables can
be conveniently assumed to have a normal distribution.
• Height (in a particular population) is a canonical example in
textbooks, but why should it be normally distributed? or not?
when we define the normal distribution more formally, and
discuss where it comes from, we will discuss why heights
should or should not be normally distributed. 18 / 29
Beginning to play with a bit more
information: Female Heights in US and
Bihar
0.06
density
0.04
0.02
0.00
19 / 29
Cumulative histogram, cumulative CDF
20 / 29
Beginning to play with a bit more
information: Female Heights in US and
Bihar
1.00
0.75
0.50
y
0.25
0.00
21 / 29
When do we want to plot pdf vs cdf?
22 / 29
Haven’t you learnt something exciting??
1.00
0.75
0.50
y
0.25
0.00
23 / 29
Not too surprised? OK , let’s go back to
Steph Curry
1.00
0.75
player
Kevin Durrant
0.50
Lebron James
Stephen Curry
0.25
0.00
0 10 20 30 40
Distance to basket for shot made, CDF
24 / 29
Code to produce this graph
25 / 29
The boxplot
Stephen Curry
Kevin Durrant
Lebron James
0 10 20 30 40
Boxplots of distance of shot made, by player
26 / 29
Representing joint distributions
27 / 29
A basketball court
94
Distance from
Baseline
Y
-25 0 25
Distance from Midline
X
28 / 29
A histogram of the joint density–or the
map of a basketball court?
60
count
from baseline 40 40
30
20
20 10
0
−20 −10 0 10 20
distance_from_midline_feet
30
from baseline
20
10
−20 −10 0 10 20
distance_from_midline_feet
Now we see pretty clearly that there is bunching at the 3pt line!
29 / 29