Sei sulla pagina 1di 58

The Random Variable for

Probabilities
Chris Piech
CS109, Stanford University
Frequency Frequency
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
90 90
100 100

Frequency Frequency
10 10
20 20
30 30
40 40
50 50
60 60
Assignment Grades

70 70
80 80
90 90
100 100
We have 2055 assignment distributions from grade scope
Flip a Coin With Unknown Probability

Demo
Four Prototypical Trajectories

Today we are going to learn


something unintuitive, beautiful and
useful
Four Prototypical Trajectories

Review
Conditional Events

• Recall that for events E and F:

P( EF )
P( E | F ) = where P( F ) > 0
P( F )
F
E
Discrete Conditional Distributions
Continuous Conditional Distributions
• Let X and Y be continuous random variables
§ Conditional PDF of X given Y (where fY(y) > 0):

P (X = x, Y = y)
P (X = x|Y = y) =
P (Y = y)
fX,Y (x, y)@x@y
fX|Y (x, y)@x =
fX|Y (y)@y

fX,Y (x, y)@x


fX|Y (x, y)@x =
fX|Y (y)

fX,Y (x, y)
fX|Y (x, y) =
fX|Y (y)
Conditioning with a
continuous random
variable feels weird at first.
But then it gets good.

Its like biking with a


helmet…

Piech, CS106A, Stanford University


Continuous Conditional Distributions
• Let X be continuous random variable
• Let E be an event:

P (X = x, E)
P (E|X = x) =
P (X = x)
P (X = x|E)P (E)
=
P (X = x)
fX (x|E)P (E)@x
=
fX (x)@x
fX (x|E)P (E)
=
fX (x)
Anomaly Detection
• Let X be a measure of time to answer a question
• Let E be the event that the user is a human:

P (X = x, E)
P (E|X = x) =
P (X = x)
P (X = x|E)P (E)
=
P (X = x)
fX (x|E)P (E)@x
=
fX (x)@x
fX (x|E)P (E)
=
fX (x)

Piech, CS106A, Stanford University


Anomaly Detection
• Let X be a measure of time to answer a question
• Let E be the event that the user is a human
• What if you don’t know unconditional?:
Normal pdf

fX (x|E)P (E)
P (E|X = x) =
fX (x)

fX (x|E)P (E)
P (E|X = x) fX (x)
C
= fX (x|E C )P (E C )
P (E |X = x)
fX (x)
fX (x|E)P (E)
Piech, CS106A, Stanford University
Biometric Keystroke

Piech, CS106A, Stanford University


Mixing Discrete and Continuous
• Let X be a continuous random variable
• Let N be a discrete random variable
§ Conditional PDF of X given N:
p N | X ( n | x) f X ( x)
f X | N ( x | n) =
p N ( n)

§ Conditional PMF of N given X:


f X | N ( x | n) p N ( n)
p N | X (n | x) =
f X ( x)

§ If X and N are independent, then:


f X | N ( x | n) = f X ( x ) p N | X ( n | x ) = p N ( n)
Four Prototypical Trajectories

End Review
We are going to think of
probabilities as random
variables!!!

Piech, CS106A, Stanford University


Flip a Coin With Unknown Probability
• Flip a coin (n + m) times, comes up with n heads
§ We don’t know probability X that coin comes up heads

Frequentist Bayesian

n
X = lim f X|N (x|n) =
n+m!1 n + m
n P (N = n|X = x)fX (x)
⇡ P (N = n)
n+m

X is a single value X is a random variable


Flip a Coin With Unknown Probability
• Flip a coin (n + m) times, comes up with n heads
§ We don’t know probability X that coin comes up heads
§ Our belief before flipping coins is that: X ~ Uni(0, 1)
§ Let N = number of heads
§ Given X = x, coin flips independent: (N | X) ~ Bin(n + m, x)
P (N = n|X = x)fX (x)
fX|N (x|n) =
P (N = n)
n+m
n xn (1 x)m
=
Bayesian P (N = n) Bayesian “prior”
“posterior” n+m probability
= n
xn (1 x)m distribution
probability P (N = n)
distribution1 Z 1
= · xn (1 x)m where c = xn (1
c 0
Flip a Coin With Unknown Probability
• Flip a coin (n + m) times, comes up with n heads
§ We don’t know probability X that coin comes up heads
§ Our belief before flipping coins is that: X ~ Uni(0, 1)
§ Let N = number of heads
§ Given X = x, coin flips independent: (N | X) ~ Bin(n + m, x)
P
P (N
(N =
= n|X
n|X =
= x)f
x)f X
X (x)
(x) 1
ffX|NP
X|N (N
(x|n)
(x|n)= n|X
=
= P (N
= x)f
= n|X
X (x)= x)f X
X (x)
N (x|n) f=X|N
X|N (x|n) = P (N
(N == n)
n)
P (N n+m
= n) P
Binomial
n+m n
n+m
n+m m
n+m
n xn
n
n
n (1 x)m
m
m
m
n x=
= (1 nnx) x (1 x)
n
= = P (N
P (N
(N = = n)
= n)
n)
P (N = n) P
n+m
n+m
n+m n+m
n n m
n == n
n n x
xn (1
m n (1 x)
x) m
m
= = Px (N
P (1 =
(N = x)
n)
n) x (1 x)
P (N = n) P (N = n) Z
Z 1 Z 11
Z
1 n = 11 ·mxnnn (1 x)m m
m wherenc =
1 n
x
m n (1
n
= · x (1= cx)· x (1 x) where c = where
x c(1= x)
0
xdx(1 x
c c 0 0
0
Beta Random Variable
• X is a Beta Random Variable: X ~ Beta(a, b)
§ Probability Density Function (PDF): (where a, b > 0)
⎧ 1 xa −1 (1− x)b−1 1
⎪ 0 < x <1
f ( x) = ⎨ B(a,b) where B(a, b) = ∫ x a −1 (1 − x)b−1 dx
⎪⎩ 0 otherwise 0

§ Symmetric when a = b
a ab
§ E[ X ] = Var ( X ) = 2
a+b ( a + b ) (a + b + 1)
Meta Beta

Used to represent a
distributed belief of a probability
Beta is a distribution for
probabilities

Piech, CS106A, Stanford University


Back to flipping coins
• Flip a coin (n + m) times, comes up with n heads
§ We don’t know probability X that coin comes up heads
§ Our belief before flipping coins is that: X ~ Uni(0, 1)
§ Let N = number of heads
§ Given X = x, coin flips independent: (N | X) ~ Bin(n + m, x)
P
P (N
(N =
= n|X
n|X =
= x)f
x)f X
X (x)
(x)
P
ffX|N
X|N (N
(x|n)=
(x|n) =n|X
=
= P (N
= x)f
= n|X
X (x)= x)f X
X (x)
N (x|n) f=X|N (x|n)
X|N P (N
(N == n)
n)
P (N n+m
= n) P
n+m
n+m m n
n m
m
n+m n
x= n+m
n xnn (1
(1 nnx) x (1 x)
n x)mm
= n =
= P (N = n)
P (N = n) P P (N
(N = = n)
n)
n+m
n+m
n+m n+m
n n m
n == n
n n x
xn (1
m n (1 x)
x) m
m
= = Px (N
P (1 =
(N = x)
n)
n) x (1 x)
P (N = n) P (N = n) Z
Z 1 Z 11
Z
1 n = 11 ·mxnnn (1 x)m m
m wherenc =
1 n
x
m n (1
n
= · x (1= cx)· x (1 x) where c = where
x c(1= x)
0
xdx(1 x
c c 0 0
0
Dude, Where’s My Beta?
• Flip a coin (n + m) times, comes up with n heads
§ Conditional density of X given N = n
1
1 n
f X | N ( x | n) = ⋅ x (1 − x) m where c = ∫ x n (1 − x) m dx
c 0

§ Note: 0 < x < 1, so f X | N ( x | n) = 0 otherwise


§ Recall Beta distribution:
⎧ 1 xa −1 (1− x)b−1 1
⎪ 0 < x <1
f ( x) = ⎨ B(a,b) B(a, b) = ∫ x a −1 (1 − x)b−1 dx
⎪⎩ 0 otherwise 0

§ Hey, that looks more familiar now...


§ X | (N = n, n + m trials) ~ Beta(n + 1, m + 1)
Understanding Beta
• X | (N = n, m + n trials) ~ Beta(n + 1, m + 1)
§ X ~ Uni(0, 1)

§ Check this out, boss:


o Beta(1, 1) =?
1 1
f ( x) = x a −1 (1 − x)b−1 = x 0 (1 − x)0
B ( a, b) B ( a , b)
1
= 1 1 = 1 where 0 < x < 1
∫0 1 dx

o Beta(1, 1) = Uni(0, 1)

§ So, X ~ Beta(1, 1)
If the Prior was a Beta…
If our belief about X (that random variable for probability) was beta

1
fX (x) = xa 1
(1 x)b 1
B(a, b)

What is our belief about X after observing N heads?

fX|N (x, n) = ???


If the Prior was a Beta…
P (N = n|X = x)fX (x)
fX|N (x, n) =
P (N = n)
n+m
n xn (1 x)m fX (x)
=
P (N = n)
n+m 1
n xn (1 x)m B(a,b) xa 1
(1 x)b 1
=
P (N = n)
✓ ◆
n+m n 1
m
= K1 · x (1 x) xa 1
(1 x)b 1
n B(a, b)
n 1
m
= K2 · x (1 x) xa 1
(1 x)b 1
B(a, b)
= K2 · xn (1 x)m xa 1
(1 x)b 1

= K2 · xn+a 1
(1 x)m+b 1

X|N ⇠ Beta(n + a, m + b)
Understanding Beta

§ If “Prior” distribution of X (before seeing flips) is Beta

§ Then “Posterior” distribution of X (after flips) is Beta

• Beta is a conjugate distribution for Beta


§ Prior and posterior parametric forms are the same!
§ Practically, conjugate means easy update:
o Add number of “heads” and “tails” seen to Beta parameters
Further Understanding Beta
• Can set X ~ Beta(a, b) as prior to reflect how
biased you think coin is apriori
§ This is a subjective probability!
§ Then observe n + m trials,
where n of trials are heads
• Update to get posterior probability
§ X | (n heads in n + m trials) ~ Beta(a + n, b + m)
§ Sometimes call a and b the “equivalent sample size”
§ Prior probability for X based on seeing (a + b – 2)
“imaginary” trials, where (a – 1) of them were heads.
§ Beta(1, 1) ~ Uni(0, 1) à we haven’t seen any
“imaginary trials”, so apriori know nothing about coin
Check out Demo!
Four Prototypical Trajectories

Damn
Four Prototypical Trajectories

Next level?
Frequency Frequency
10 10
20 20
30 30
40 40
50 50
60 60
70 70
80 80
90 90
100 100

Frequency Frequency
10 10
20 20
30 30
40 40
50 50
60 60
Assignment Grades

70 70
80 80
90 90
100 100
We have 2055 assignment distributions from gradescope
Distributions
Binomial Geometric Exponential

Neg Binomial Normal


Poisson

Uniform
Beta Beta
Four Prototypical Trajectories

Grades must be bounded


Four Prototypical Trajectories

Normal: No
Four Prototypical Trajectories

Poisson: No
Four Prototypical Trajectories

Exponential: No
Four Prototypical Trajectories

Beta: Yes
Assignment Grades Demo
Assignment id = ‘1613’

Frequency

10 20 30 40 50 60 70 80 90 100
Assignment Grades Demo
Assignment id = ‘1613’

Frequency

10 20 30 40 50 60 70 80 90 100

X ⇠ Beta(a = 8.28, b = 3.16)


Assignment Grades
a = 4.03, b = 3.00
Frequency a = 8.28, b = 3.16

Frequency
100

100
10
20
30
40
50
60
70
80
90

10
20
30
40
50
60
70
80
90
a = 5.06, b = 3.79
a = 3.46, b = 0.35
Frequency

Frequency
100

100
10
20
30
40
50
60
70
80
90

10
20
30
40
50
60
70
80
90
We have 2055 assignment distributions from grade scope
Beta is a Better Fit

Unpublished results. Based on Gradescope data


Beta is a Better Fit For All Class Sizes

Unpublished results. Based on Gradescope data


Binomial Interpretation

Each student has the same probability of getting each


point. Generate grades by flipping a coin 100 times for
each student. The resulting distribution is binomial.

- Binomial
Normal Interpretation

What the Binomial said, but approximated.

- Normal
Beta Interpretation

Each student’s ability is represented as a probability.


The distribution of probabilities is a Beta distribution.
Each student has a different probability of getting
points, and that probability is sampled from a Beta
distribution.
- Beta

• This is Chris Piech’s opinion. It is open for debate


Assignment Grades
a = 4.03, b = 3.00
Frequency a = 8.28, b = 3.16

Frequency
100

100
10
20
30
40
50
60
70
80
90

10
20
30
40
50
60
70
80
90
a = 5.06, b = 3.79
a = 3.46, b = 0.35
Frequency

Frequency
100

100
10
20
30
40
50
60
70
80
90

10
20
30
40
50
60
70
80
90
These are the distribution of student point probabilitities
Assignment Grades Demo
What is the semantics of E[X]?
Frequency

10 20 30 40 50 60 70 80 90 100

X ⇠ Beta(a = 8.28, b = 3.16)


Assignment Grades
What is the probability that a student is bellow the mean?
X ⇠ Beta(a = 8.28, b = 3.16)

a 8.28
E[X] = = ⇡ 0.7238
a+b 8.28 + 3.16

P (X < 0.7238) = FX (0.7238)

Wait what? Chris are you holding out on me?

stats.beta.cdf(x, alpha, beta)

P (X < E[X]) = 0.46


Four Prototypical Trajectories

As far as I know, this is an


unpublished result
Implications

• Will be combined with Item Response Theory which


models how assignment difficulty and student ability
combine to give point probabilities.

• Suggests a way to calculate final grades as a


probabilistic most likely estimate of “ability”.

• Machine learning on education data will be more


accurate.

• Analysis of “mixture” distributions can be fixed.


Four Prototypical Trajectories

Will you use this on us?


Four Prototypical Trajectories

Not yet J
Four Prototypical Trajectories

Beta:
The probability density
for probabilities
Any parameter for a
“parameterized” random
variable can be thought of
as a random variable.

Piech, CS106A, Stanford University


Course Mean

Ε [CS109]
This is actual midpoint of course
(Just wanted you to know)

Potrebbero piacerti anche