Page 322 - Understanding Machine Learning
P. 322

Generative Models
           304

                 Therefore, we only need to show that for any θ, the solution of argmax  G(Q,θ)
                                                                               Q∈Q
                 is to set Q i,y = P θ [Y = y|X = x i ]. Indeed, by Jensen’s inequality, for any Q ∈ Q we
                 have that
                                                                        
                                         m    k
                                                         P θ [X = x i ,Y = y]
                               G(Q,θ) =        Q i,y log                
                                                               Q i,y
                                         i=1  y=1
                                                                      
                                         m         k
                                                         P θ [X = x i ,Y = y]
                                      ≤     log    Q i,y              
                                                               Q i,y
                                         i=1      y=1
                                                                 
                                         m       k

                                      =    log    P θ [X = x i ,Y = y] 
                                         i=1    y=1
                                         m


                                      =    log P θ [X = x i ] = L(θ),
                                         i=1
                 while for Q i,y = P θ [Y = y|X = x i ] we have
                                                                              
                                    m    k
                                                              P θ [X = x i ,Y = y]
                          G(Q,θ) =        P θ [Y = y|X = x i ]log             
                                                               P θ [Y = y|X = x i ]
                                   i=1  y=1
                                    m  k


                                 =       P θ [Y = y|X = x i ]log P θ [X = x i ]
                                   i=1 y=1
                                    m                k


                                 =    log P θ [X = x i ]  P θ [Y = y|X = x i ]
                                   i=1              y=1
                                    m


                                 =    log P θ [X = x i ] = L(θ).
                                   i=1
                 This shows that setting Q i,y = P θ [Y = y|X = x i ] maximizes G(Q,θ) over Q ∈ Q and
                 shows that G(Q (t+1) ,θ (t) ) = L(θ (t) ).
                    The preceding lemma immediately implies:
                 Theorem 24.3. The EM procedure never decreases the log-likelihood; namely,
                 for all t,
                                             L(θ (t+1) ) ≥ L(θ (t) ).

                 Proof. By the lemma we have
                                 (t+1)      (t+2)  (t+1)    (t+1)  (t)    (t)
                              L(θ   ) = G(Q    ,θ    ) ≥ G(Q    ,θ  ) = L(θ  ).



                 24.4.2 EM for Mixture of Gaussians (Soft k-Means)

                 Consider the case of a mixture of k Gaussians in which θ is a triplet
                 (c,{µ ,...,µ },{  1 ,...,  k })where P θ [Y = y] = c y and P θ [X = x|Y = y] is as given
                      1
                            k
   317   318   319   320   321   322   323   324   325   326   327