Page 314 - Understanding Machine Learning

P. 314

Generative Models
296

using the drug. To do so, the drug company sampled a training set of m people and
gave them the drug. Let S = (x 1 ,...,x m ) denote the training set, where for each i,
x i = 1if the ith person survived and x i = 0 otherwise. We can model the underlying
distribution using a single parameter, θ ∈ [0,1], indicating the probability of survival.
We now would like to estimate the parameter θ on the basis of the training set S.
A natural idea is to use the average number of 1’s in S as an estimator. That is,
m
1
ˆ
θ = x i . (24.1)
m
i=1
Clearly, E S [θ] = θ.That is, θ is an unbiased estimator of θ. Furthermore, since θ is
ˆ
ˆ
ˆ
the average of m i.i.d. binary random variables we can use Hoeffding’s inequality to
get that with probability of at least 1 − δ over the choice of S we have that

log(2/δ)
ˆ
|θ − θ|≤ . (24.2)
2m
Another interpretation of ˆ θ is as the Maximum Likelihood Estimator,as we
formally explain now. We ﬁrst write the probability of generating the sample S:
m
x (1−x i )
x i
P[S = (x 1 ,...,x m )] = θ (1 − θ) 1−x i = θ i i (1 − θ) i .
i=1
We deﬁne the log likelihood of S, given the parameter θ, as the log of the preceding
expression:

L(S;θ) = log P[S = (x 1 ,...,x m )] = log(θ) x i + log(1 − θ) (1 − x i ).
i i
The maximum likelihood estimator is the parameter that maximizes the likelihood
ˆ θ ∈ argmax L(S;θ). (24.3)
θ
Next, we show that in our case, Equation (24.1) is a maximum likelihood estimator.
To see this, we take the derivative of L(S;θ)with respect to θ and equate it to zero:

i x i i (1 − x i )
− = 0.
θ 1 − θ
Solving the equation for θ we obtain the estimator given in Equation (24.1).

24.1.1 Maximum Likelihood Estimation for Continuous
Random Variables
Let X be a continuous random variable. Then, for most x ∈ R we have P[X = x] = 0
and therefore the deﬁnition of likelihood as given before is trivialized. To overcome
this technical problem we deﬁne the likelihood as log of the density of the probabil-
ity of X at x. That is, given an i.i.d. training set S = (x 1 ,...,x m ) sampled according
to a density distribution P θ we deﬁne the likelihood of S given θ as

m m

L(S;θ) = log P θ (x i ) = log(P θ (x i )).
i=1 i=1

309 310 311 312 313 314 315 316 317 318 319