Page 322 - Understanding Machine Learning

P. 322

Generative Models
304

Therefore, we only need to show that for any θ, the solution of argmax G(Q,θ)
Q∈Q
is to set Q i,y = P θ [Y = y|X = x i ]. Indeed, by Jensen’s inequality, for any Q ∈ Q we
have that
 
m k
P θ [X = x i ,Y = y]
G(Q,θ) =  Q i,y log 
Q i,y
i=1 y=1
  
m k
P θ [X = x i ,Y = y]
≤  log  Q i,y 
Q i,y
i=1 y=1
 
m k

= log  P θ [X = x i ,Y = y] 
i=1 y=1
m

= log P θ [X = x i ] = L(θ),
i=1
while for Q i,y = P θ [Y = y|X = x i ] we have
 
m k
P θ [X = x i ,Y = y]
G(Q,θ) =  P θ [Y = y|X = x i ]log 
P θ [Y = y|X = x i ]
i=1 y=1
m k

= P θ [Y = y|X = x i ]log P θ [X = x i ]
i=1 y=1
m k

= log P θ [X = x i ] P θ [Y = y|X = x i ]
i=1 y=1
m

= log P θ [X = x i ] = L(θ).
i=1
This shows that setting Q i,y = P θ [Y = y|X = x i ] maximizes G(Q,θ) over Q ∈ Q and
shows that G(Q (t+1) ,θ (t) ) = L(θ (t) ).
The preceding lemma immediately implies:
Theorem 24.3. The EM procedure never decreases the log-likelihood; namely,
for all t,
L(θ (t+1) ) ≥ L(θ (t) ).

Proof. By the lemma we have
(t+1) (t+2) (t+1) (t+1) (t) (t)
L(θ ) = G(Q ,θ ) ≥ G(Q ,θ ) = L(θ ).

24.4.2 EM for Mixture of Gaussians (Soft k-Means)

Consider the case of a mixture of k Gaussians in which θ is a triplet
(c,{µ ,...,µ },{ 1 ,..., k })where P θ [Y = y] = c y and P θ [X = x|Y = y] is as given
1
k

317 318 319 320 321 322 323 324 325 326 327