Page 314 - Understanding Machine Learning
P. 314
Generative Models
296
using the drug. To do so, the drug company sampled a training set of m people and
gave them the drug. Let S = (x 1 ,...,x m ) denote the training set, where for each i,
x i = 1if the ith person survived and x i = 0 otherwise. We can model the underlying
distribution using a single parameter, θ ∈ [0,1], indicating the probability of survival.
We now would like to estimate the parameter θ on the basis of the training set S.
A natural idea is to use the average number of 1’s in S as an estimator. That is,
m
1
ˆ
θ = x i . (24.1)
m
i=1
Clearly, E S [θ] = θ.That is, θ is an unbiased estimator of θ. Furthermore, since θ is
ˆ
ˆ
ˆ
the average of m i.i.d. binary random variables we can use Hoeffding’s inequality to
get that with probability of at least 1 − δ over the choice of S we have that
log(2/δ)
ˆ
|θ − θ|≤ . (24.2)
2m
Another interpretation of ˆ θ is as the Maximum Likelihood Estimator,as we
formally explain now. We first write the probability of generating the sample S:
m
x (1−x i )
x i
P[S = (x 1 ,...,x m )] = θ (1 − θ) 1−x i = θ i i (1 − θ) i .
i=1
We define the log likelihood of S, given the parameter θ, as the log of the preceding
expression:
L(S;θ) = log P[S = (x 1 ,...,x m )] = log(θ) x i + log(1 − θ) (1 − x i ).
i i
The maximum likelihood estimator is the parameter that maximizes the likelihood
ˆ θ ∈ argmax L(S;θ). (24.3)
θ
Next, we show that in our case, Equation (24.1) is a maximum likelihood estimator.
To see this, we take the derivative of L(S;θ)with respect to θ and equate it to zero:
i x i i (1 − x i )
− = 0.
θ 1 − θ
Solving the equation for θ we obtain the estimator given in Equation (24.1).
24.1.1 Maximum Likelihood Estimation for Continuous
Random Variables
Let X be a continuous random variable. Then, for most x ∈ R we have P[X = x] = 0
and therefore the definition of likelihood as given before is trivialized. To overcome
this technical problem we define the likelihood as log of the density of the probabil-
ity of X at x. That is, given an i.i.d. training set S = (x 1 ,...,x m ) sampled according
to a density distribution P θ we define the likelihood of S given θ as
m m
L(S;θ) = log P θ (x i ) = log(P θ (x i )).
i=1 i=1