Page 318 - Understanding Machine Learning
P. 318

Generative Models
           300
                                                                         d
                 To describe the probability function P[Y = y|X = x] we need 2 parameters, each
                                                                                   d
                 of which corresponds to P[Y = 1|X = x] for a certain value of x ∈{0,1} .This
                 implies that the number of examples we need grows exponentially with the number
                 of features.
                    In the Naive Bayes approach we make the (rather naive) generative assumption
                 that given the label, the features are independent of each other. That is,

                                                     d

                                     P[X = x|Y = y] =  P[X i = x i |Y = y].
                                                    i=1
                 With this assumption and using the Bayes rule, the Bayes optimal classifier can be
                 further simplified:

                                h Bayes (x) = argmaxP[Y = y|X = x]
                                            y∈{0,1}
                                        = argmaxP[Y = y]P[X = x|Y = y]
                                            y∈{0,1}
                                                          d

                                        = argmaxP[Y = y]    P[X i = x i |Y = y].    (24.7)
                                            y∈{0,1}      i=1
                 That is, now the number of parameters we need to estimate is only 2d +1. Here, the
                 generative assumption we made reduced significantly the number of parameters we
                 need to learn.
                    When we also estimate the parameters using the maximum likelihood principle,
                 the resulting classifier is called the Naive Bayes classifier.


                 24.3 LINEAR DISCRIMINANT ANALYSIS

                 Linear discriminant analysis (LDA) is another demonstration of how generative
                 assumptions simplify the learning process. As in the Naive Bayes classifier we con-
                 sider again the problem of predicting a label y ∈{0,1} on the basis of a vector of
                 features x = (x 1 ,...,x d ). But now the generative assumption is as follows. First,
                 we assume that P[Y = 1] = P[Y = 0] = 1/2. Second, we assume that the condi-
                 tional probability of X given Y is a Gaussian distribution. Finally, the covariance
                 matrix of the Gaussian distribution is the same for both values of the label. Formally,
                             d
                 let µ ,µ ∈ R and let   be a covariance matrix. Then, the density distribution is
                         1
                      0
                 given by

                                              1            1        T  −1
                         P[X = x|Y = y] =            exp − (x − µ )     (x − µ ) .
                                                                  y
                                                                              y
                                         (2π) d/2 | | 1/2  2
                    As we have shown in the previous section, using the Bayes rule we can write
                                  h Bayes (x) = argmaxP[Y = y]P[X = x|Y = y].
                                             y∈{0,1}
                 This means that we will predict h Bayes (x) = 1iff


                                          P[Y = 1]P[X = x|Y = 1]
                                      log                        > 0.
                                          P[Y = 0]P[X = x|Y = 0]
   313   314   315   316   317   318   319   320   321   322   323