Page 317 - Understanding Machine Learning
P. 317

24.2 Naive Bayes  299








              parameter. Then,
                                                           P µ [ x]


                        E   [ ( ˆµ,x) −  (µ ,x)] =  E  log





                     x∼N(µ ,1)                x∼N(µ ,1)    P ˆµ [ x]


                                                         1        2  1      2
                                            =    E     − ( x − µ ) +  ( x − ˆµ)

                                              x∼N(µ ,1)  2           2

                                                        2
                                               ˆ µ 2  (µ )
                                            =    −       + (µ − ˆµ)  E   [ x]
                                               2     2            x∼N(µ ,1)


                                                        2
                                               ˆ µ 2  (µ )
                                            =    −       + (µ − ˆµ)µ
                                               2     2
                                               1        2
                                            = ( ˆµ − µ ) .                      (24.6)
                                               2

              Next, we note that  ˆµ is the average of m Gaussian variables and therefore it is also


              distributed normally with mean µ and variance σ /m. From this fact we can derive

              bounds of the form with probability of at least 1−δ we have that | ˆµ−µ | ≤ 
 where


              
 depends on σ /m and on δ.
                 In some situations, the maximum likelihood estimator clearly overfits. For exam-



              ple,  consider a  Bernoulli random  variable  X and let P[ X = 1] = θ . As we saw

              previously, using Hoeffding’s inequality we can easily derive a guarantee on |θ − ˆ θ|
              that holds with high probability (see Equation (24.2)). However, if our goal is to
              obtain a small value of the expected log-loss function as defined in Equation (24.5)

              we might fail.  For example, assume that θ is nonzero but very small.  Then,  the
                                                                            m
              probability that no element of a sample of size m will be 1 is (1 − θ ) , which is




              greater than e −2θ m .  It follows that whenever m ≤   log(2)  ,  the probability that the
                                                            2θ
              sample is all zeros is at least 50%, and in that case, the maximum likelihood rule will
              set θ = 0. But the true risk of the estimate θ = 0 is
                                                  ˆ
                 ˆ


                                  ˆ
                                            ˆ
                                                           ˆ
                             E [ (θ, x)] = θ  (θ,1) + (1 − θ ) (θ,0)
                            x∼θ

                                       = θ log(1/θ) + (1 − θ )log(1/(1 − θ))
                                                ˆ
                                                                     ˆ

                                       = θ log(1/0) = ∞.
              This simple example shows that we should be careful in applying the maximum
              likelihood principle.
                 To overcome overfitting, we can use the variety of tools we encountered previ-
              ously in the book. A simple regularization technique is outlined in Exercise 24.2.
              24.2 NAIVE BAYES
              The Naive Bayes classifier is a classical demonstration of how generative assump-
              tions and parameter estimations simplify the learning process. Consider the problem
              of predicting a label y ∈{0,1} on the basis of a vector of features x = (x 1 ,...,x d ),
              where we assume that each x i is in {0,1}. Recall that the Bayes optimal classifier is
                                  h Bayes (x) = argmax P[Y = y|X = x].
                                             y∈{0,1}
   312   313   314   315   316   317   318   319   320   321   322