Page 317 - Understanding Machine Learning
P. 317
24.2 Naive Bayes 299
parameter. Then,
P µ [ x]
E [ ( ˆµ,x) − (µ ,x)] = E log
x∼N(µ ,1) x∼N(µ ,1) P ˆµ [ x]
1 2 1 2
= E − ( x − µ ) + ( x − ˆµ)
x∼N(µ ,1) 2 2
2
ˆ µ 2 (µ )
= − + (µ − ˆµ) E [ x]
2 2 x∼N(µ ,1)
2
ˆ µ 2 (µ )
= − + (µ − ˆµ)µ
2 2
1 2
= ( ˆµ − µ ) . (24.6)
2
Next, we note that ˆµ is the average of m Gaussian variables and therefore it is also
distributed normally with mean µ and variance σ /m. From this fact we can derive
bounds of the form with probability of at least 1−δ we have that | ˆµ−µ | ≤
where
depends on σ /m and on δ.
In some situations, the maximum likelihood estimator clearly overfits. For exam-
ple, consider a Bernoulli random variable X and let P[ X = 1] = θ . As we saw
previously, using Hoeffding’s inequality we can easily derive a guarantee on |θ − ˆ θ|
that holds with high probability (see Equation (24.2)). However, if our goal is to
obtain a small value of the expected log-loss function as defined in Equation (24.5)
we might fail. For example, assume that θ is nonzero but very small. Then, the
m
probability that no element of a sample of size m will be 1 is (1 − θ ) , which is
greater than e −2θ m . It follows that whenever m ≤ log(2) , the probability that the
2θ
sample is all zeros is at least 50%, and in that case, the maximum likelihood rule will
set θ = 0. But the true risk of the estimate θ = 0 is
ˆ
ˆ
ˆ
ˆ
ˆ
E [ (θ, x)] = θ (θ,1) + (1 − θ ) (θ,0)
x∼θ
= θ log(1/θ) + (1 − θ )log(1/(1 − θ))
ˆ
ˆ
= θ log(1/0) = ∞.
This simple example shows that we should be careful in applying the maximum
likelihood principle.
To overcome overfitting, we can use the variety of tools we encountered previ-
ously in the book. A simple regularization technique is outlined in Exercise 24.2.
24.2 NAIVE BAYES
The Naive Bayes classifier is a classical demonstration of how generative assump-
tions and parameter estimations simplify the learning process. Consider the problem
of predicting a label y ∈{0,1} on the basis of a vector of features x = (x 1 ,...,x d ),
where we assume that each x i is in {0,1}. Recall that the Bayes optimal classifier is
h Bayes (x) = argmax P[Y = y|X = x].
y∈{0,1}