Page 321 - Understanding Machine Learning

P. 321

24.4 Latent Variables and the EM Algorithm 303

therefore alternates between ﬁnding θ given Q and ﬁnding Q given θ. Formally,
EM ﬁnds a sequence of solutions (Q (1) ,θ (1) ),(Q (2) ,θ (2) ),... where at iteration t,we
construct (Q (t+1) ,θ (t+1) ) by performing two steps.
Expectation Step: Set

(t+1)
Q i,y = P (t)[Y = y|X = x i ]. (24.10)
θ
This step is called the Expectation step, because it yields a new probability over
the latent variables, which deﬁnes a new expected log-likelihood function over θ.
(t+1)
Maximization Step: Set θ to be the maximizer of the expected log-
likelihood, where the expectation is according to Q (t+1) :
θ (t+1) = argmax F(Q (t+1) ,θ). (24.11)
θ
By our assumption, it is possible to solve this optimization problem efﬁciently.
(1) (1)
The initial values of θ and Q are usually chosen at random and the
procedure terminates after the improvement in the likelihood value stops being
signiﬁcant.

24.4.1 EM as an Alternate Maximization Algorithm

To analyze the EM algorithm, we ﬁrst view it as an alternate maximization
algorithm. Deﬁne the following objective function

m k

G(Q,θ) = F(Q,θ) − Q i,y log(Q i,y ).
i=1 y=1
The second term is the sum of the entropies of the rows of Q.Let
 
k
 

m,k
Q = Q ∈ [0,1] : ∀i, Q i,y = 1
 
y=1
be the set of matrices whose rows deﬁne probabilities over [k]. The following lemma
shows that EM performs alternate maximization iterations for maximizing G.
Lemma 24.2. The EM procedure can be rewritten as

(t+1) (t)
Q = argmaxG(Q,θ )
Q∈Q
(t+1) (t+1)
θ = argmaxG(Q ,θ).
θ
(t)
Furthermore, G(Q (t+1) ,θ (t) ) = L(θ ).
Proof. Given Q (t+1) we clearly have that
argmaxG(Q (t+1) ,θ) = argmax F(Q (t+1) ,θ).
θ θ

316 317 318 319 320 321 322 323 324 325 326