Page 319 - Understanding Machine Learning
P. 319

24.4 Latent Variables and the EM Algorithm  301


              This ratiois oftencalledthe log-likelihood ratio.
                 In our case, the log-likelihood ratio becomes

                             1       T  −1         1       T  −1
                                   0
                                                         1
                                                                     1
                                               0
                             2 (x − µ )    (x − µ ) − (x − µ )    (x − µ )
                                                   2
              We can rewrite this as  w,x + b where

                                     T  −1         1   T  −1     T  −1
                         w = (µ − µ )      and b =  2  µ    µ − µ     µ 1  .    (24.8)
                               1
                                   0
                                                             0
                                                                 1
                                                       0
                 As a result of the preceding derivation we obtain that under the aforementioned
              generative assumptions, the Bayes optimal classifier is a linear classifier. Addition-
              ally, one may train the classifier by estimating the parameter µ ,µ and   from the
                                                                       1
                                                                    0
              data, using, for example, the maximum likelihood estimator. With those estimators
              at hand, the values of w and b can be calculated as in Equation (24.8).
              24.4 LATENT VARIABLES AND THE EM ALGORITHM
              In generative models we assume that the data is generated by sampling from a spe-
              cific parametric distribution over our instance space X. Sometimes, it is convenient
              to express this distribution using latent random variables. A natural example is a
                                                          d
              mixture of k Gaussian distributions. That is, X = R and we assume that each x is
              generated as follows. First, we choose a random number in {1,...,k}.Let Y be a
              random variable corresponding to this choice, and denote P[Y = y] = c y . Second,
              we choose x on the basis of the value of Y according to a Gaussian distribution

                                        1            1        T  −1
                  P[X = x|Y = y] =    d/2   1/2  exp − (x − µ )   y  (x − µ ) .  (24.9)
                                                                        y
                                                            y
                                  (2π)  |  y |       2
              Therefore, the density of X canbe writtenas:
                                 k

                    P[X = x] =     P[Y = y]P[X = x|Y = y]
                                y=1
                                 k
                                           1            1        T  −1

                             =     c y            exp − (x − µ )   y  (x − µ ) .
                                                                           y
                                                               y
                                     (2π) d/2 |  y | 1/2  2
                                y=1
              Note that Y is a hidden variable that we do not observe in our data. Nevertheless,
              we introduce Y since it helps us describe a simple parametric form of the probability
              of X.
                 More generally, let θ be the parameters of the joint distribution of X and Y (e.g.,
              in the preceding example, θ consists of c y , µ ,and   y ,for all y = 1,...,k). Then, the
                                                   y
              log-likelihood of an observation x can be written as
                                                                  
                                                   k


                             log P θ [X = x] = log    P θ [X = x,Y = y] .
                                                                   
                                                  y=1
   314   315   316   317   318   319   320   321   322   323   324