Page 317 - Understanding Machine Learning

P. 317

24.2 Naive Bayes 299

parameter. Then,
P µ [ x]

E [ ( ˆµ,x) − (µ ,x)] = E log

x∼N(µ ,1) x∼N(µ ,1) P ˆµ [ x]

1 2 1 2
= E − ( x − µ ) + ( x − ˆµ)

x∼N(µ ,1) 2 2

2
ˆ µ 2 (µ )
= − + (µ − ˆµ) E [ x]
2 2 x∼N(µ ,1)

2
ˆ µ 2 (µ )
= − + (µ − ˆµ)µ
2 2
1 2
= ( ˆµ − µ ) . (24.6)
2

Next, we note that ˆµ is the average of m Gaussian variables and therefore it is also

distributed normally with mean µ and variance σ /m. From this fact we can derive

bounds of the form with probability of at least 1−δ we have that | ˆµ−µ | ≤
where

depends on σ /m and on δ.
In some situations, the maximum likelihood estimator clearly overﬁts. For exam-

ple, consider a Bernoulli random variable X and let P[ X = 1] = θ . As we saw

previously, using Hoeffding’s inequality we can easily derive a guarantee on |θ − ˆ θ|
that holds with high probability (see Equation (24.2)). However, if our goal is to
obtain a small value of the expected log-loss function as deﬁned in Equation (24.5)

we might fail. For example, assume that θ is nonzero but very small. Then, the
m
probability that no element of a sample of size m will be 1 is (1 − θ ) , which is

greater than e −2θ m . It follows that whenever m ≤ log(2) , the probability that the
2θ
sample is all zeros is at least 50%, and in that case, the maximum likelihood rule will
set θ = 0. But the true risk of the estimate θ = 0 is
ˆ
ˆ

ˆ
ˆ
ˆ
E [ (θ, x)] = θ (θ,1) + (1 − θ ) (θ,0)
x∼θ

= θ log(1/θ) + (1 − θ )log(1/(1 − θ))
ˆ
ˆ

= θ log(1/0) = ∞.
This simple example shows that we should be careful in applying the maximum
likelihood principle.
To overcome overﬁtting, we can use the variety of tools we encountered previ-
ously in the book. A simple regularization technique is outlined in Exercise 24.2.
24.2 NAIVE BAYES
The Naive Bayes classiﬁer is a classical demonstration of how generative assump-
tions and parameter estimations simplify the learning process. Consider the problem
of predicting a label y ∈{0,1} on the basis of a vector of features x = (x 1 ,...,x d ),
where we assume that each x i is in {0,1}. Recall that the Bayes optimal classiﬁer is
h Bayes (x) = argmax P[Y = y|X = x].
y∈{0,1}

312 313 314 315 316 317 318 319 320 321 322