Page 318 - Understanding Machine Learning

P. 318

Generative Models
300
d
To describe the probability function P[Y = y|X = x] we need 2 parameters, each
d
of which corresponds to P[Y = 1|X = x] for a certain value of x ∈{0,1} .This
implies that the number of examples we need grows exponentially with the number
of features.
In the Naive Bayes approach we make the (rather naive) generative assumption
that given the label, the features are independent of each other. That is,

P[X = x|Y = y] = P[X i = x i |Y = y].
i=1
With this assumption and using the Bayes rule, the Bayes optimal classiﬁer can be
further simpliﬁed:

h Bayes (x) = argmaxP[Y = y|X = x]
y∈{0,1}
= argmaxP[Y = y]P[X = x|Y = y]
y∈{0,1}
d

= argmaxP[Y = y] P[X i = x i |Y = y]. (24.7)
y∈{0,1} i=1
That is, now the number of parameters we need to estimate is only 2d +1. Here, the
generative assumption we made reduced signiﬁcantly the number of parameters we
need to learn.
When we also estimate the parameters using the maximum likelihood principle,
the resulting classiﬁer is called the Naive Bayes classiﬁer.

24.3 LINEAR DISCRIMINANT ANALYSIS

Linear discriminant analysis (LDA) is another demonstration of how generative
assumptions simplify the learning process. As in the Naive Bayes classiﬁer we con-
sider again the problem of predicting a label y ∈{0,1} on the basis of a vector of
features x = (x 1 ,...,x d ). But now the generative assumption is as follows. First,
we assume that P[Y = 1] = P[Y = 0] = 1/2. Second, we assume that the condi-
tional probability of X given Y is a Gaussian distribution. Finally, the covariance
matrix of the Gaussian distribution is the same for both values of the label. Formally,
d
let µ ,µ ∈ R and let be a covariance matrix. Then, the density distribution is
1
0
given by

1 1 T −1
P[X = x|Y = y] = exp − (x − µ ) (x − µ ) .
y
y
(2π) d/2 | | 1/2 2
As we have shown in the previous section, using the Bayes rule we can write
h Bayes (x) = argmaxP[Y = y]P[X = x|Y = y].
y∈{0,1}
This means that we will predict h Bayes (x) = 1iff

P[Y = 1]P[X = x|Y = 1]
log > 0.
P[Y = 0]P[X = x|Y = 0]

313 314 315 316 317 318 319 320 321 322 323