Page 322 - Understanding Machine Learning
P. 322
Generative Models
304
Therefore, we only need to show that for any θ, the solution of argmax G(Q,θ)
Q∈Q
is to set Q i,y = P θ [Y = y|X = x i ]. Indeed, by Jensen’s inequality, for any Q ∈ Q we
have that
m k
P θ [X = x i ,Y = y]
G(Q,θ) = Q i,y log
Q i,y
i=1 y=1
m k
P θ [X = x i ,Y = y]
≤ log Q i,y
Q i,y
i=1 y=1
m k
= log P θ [X = x i ,Y = y]
i=1 y=1
m
= log P θ [X = x i ] = L(θ),
i=1
while for Q i,y = P θ [Y = y|X = x i ] we have
m k
P θ [X = x i ,Y = y]
G(Q,θ) = P θ [Y = y|X = x i ]log
P θ [Y = y|X = x i ]
i=1 y=1
m k
= P θ [Y = y|X = x i ]log P θ [X = x i ]
i=1 y=1
m k
= log P θ [X = x i ] P θ [Y = y|X = x i ]
i=1 y=1
m
= log P θ [X = x i ] = L(θ).
i=1
This shows that setting Q i,y = P θ [Y = y|X = x i ] maximizes G(Q,θ) over Q ∈ Q and
shows that G(Q (t+1) ,θ (t) ) = L(θ (t) ).
The preceding lemma immediately implies:
Theorem 24.3. The EM procedure never decreases the log-likelihood; namely,
for all t,
L(θ (t+1) ) ≥ L(θ (t) ).
Proof. By the lemma we have
(t+1) (t+2) (t+1) (t+1) (t) (t)
L(θ ) = G(Q ,θ ) ≥ G(Q ,θ ) = L(θ ).
24.4.2 EM for Mixture of Gaussians (Soft k-Means)
Consider the case of a mixture of k Gaussians in which θ is a triplet
(c,{µ ,...,µ },{ 1 ,..., k })where P θ [Y = y] = c y and P θ [X = x|Y = y] is as given
1
k