Page 321 - Understanding Machine Learning
P. 321
24.4 Latent Variables and the EM Algorithm 303
therefore alternates between finding θ given Q and finding Q given θ. Formally,
EM finds a sequence of solutions (Q (1) ,θ (1) ),(Q (2) ,θ (2) ),... where at iteration t,we
construct (Q (t+1) ,θ (t+1) ) by performing two steps.
Expectation Step: Set
(t+1)
Q i,y = P (t)[Y = y|X = x i ]. (24.10)
θ
This step is called the Expectation step, because it yields a new probability over
the latent variables, which defines a new expected log-likelihood function over θ.
(t+1)
Maximization Step: Set θ to be the maximizer of the expected log-
likelihood, where the expectation is according to Q (t+1) :
θ (t+1) = argmax F(Q (t+1) ,θ). (24.11)
θ
By our assumption, it is possible to solve this optimization problem efficiently.
(1) (1)
The initial values of θ and Q are usually chosen at random and the
procedure terminates after the improvement in the likelihood value stops being
significant.
24.4.1 EM as an Alternate Maximization Algorithm
To analyze the EM algorithm, we first view it as an alternate maximization
algorithm. Define the following objective function
m k
G(Q,θ) = F(Q,θ) − Q i,y log(Q i,y ).
i=1 y=1
The second term is the sum of the entropies of the rows of Q.Let
k
m,k
Q = Q ∈ [0,1] : ∀i, Q i,y = 1
y=1
be the set of matrices whose rows define probabilities over [k]. The following lemma
shows that EM performs alternate maximization iterations for maximizing G.
Lemma 24.2. The EM procedure can be rewritten as
(t+1) (t)
Q = argmaxG(Q,θ )
Q∈Q
(t+1) (t+1)
θ = argmaxG(Q ,θ).
θ
(t)
Furthermore, G(Q (t+1) ,θ (t) ) = L(θ ).
Proof. Given Q (t+1) we clearly have that
argmaxG(Q (t+1) ,θ) = argmax F(Q (t+1) ,θ).
θ θ