Page 321 - Understanding Machine Learning
P. 321

24.4 Latent Variables and the EM Algorithm  303


              therefore alternates between finding θ given Q and finding Q given θ. Formally,
              EM finds a sequence of solutions (Q (1) ,θ (1) ),(Q (2) ,θ (2) ),... where at iteration t,we
              construct (Q (t+1) ,θ (t+1) ) by performing two steps.
                Expectation Step: Set

                                         (t+1)
                                       Q i,y  = P (t)[Y = y|X = x i ].         (24.10)
                                                θ
                 This step is called the Expectation step, because it yields a new probability over
                 the latent variables, which defines a new expected log-likelihood function over θ.
                                         (t+1)
                Maximization Step: Set θ      to be the maximizer of the expected log-
                 likelihood, where the expectation is according to Q (t+1) :
                                       θ (t+1)  = argmax F(Q (t+1) ,θ).        (24.11)
                                                 θ
                 By our assumption, it is possible to solve this optimization problem efficiently.
                                      (1)      (1)
                 The initial values of θ  and Q   are usually chosen at random and the
              procedure terminates after the improvement in the likelihood value stops being
              significant.


              24.4.1 EM as an Alternate Maximization Algorithm

              To analyze the EM algorithm, we first view it as an alternate maximization
              algorithm. Define the following objective function

                                                  m  k

                               G(Q,θ) = F(Q,θ) −        Q i,y log(Q i,y ).
                                                 i=1 y=1
              The second term is the sum of the entropies of the rows of Q.Let
                                                               
                                                       k
                                                               

                                               m,k
                                 Q =   Q ∈ [0,1]  : ∀i,  Q i,y = 1
                                                               
                                                      y=1
              be the set of matrices whose rows define probabilities over [k]. The following lemma
              shows that EM performs alternate maximization iterations for maximizing G.
              Lemma 24.2. The EM procedure can be rewritten as

                                       (t+1)             (t)
                                     Q     = argmaxG(Q,θ   )
                                              Q∈Q
                                      (t+1)            (t+1)
                                     θ    = argmaxG(Q      ,θ).
                                               θ
                                            (t)
              Furthermore, G(Q (t+1) ,θ (t) ) = L(θ ).
              Proof. Given Q (t+1)  we clearly have that
                               argmaxG(Q  (t+1) ,θ) = argmax F(Q (t+1) ,θ).
                                  θ                  θ
   316   317   318   319   320   321   322   323   324   325   326