Page 318 - Understanding Machine Learning
P. 318
Generative Models
300
d
To describe the probability function P[Y = y|X = x] we need 2 parameters, each
d
of which corresponds to P[Y = 1|X = x] for a certain value of x ∈{0,1} .This
implies that the number of examples we need grows exponentially with the number
of features.
In the Naive Bayes approach we make the (rather naive) generative assumption
that given the label, the features are independent of each other. That is,
d
P[X = x|Y = y] = P[X i = x i |Y = y].
i=1
With this assumption and using the Bayes rule, the Bayes optimal classifier can be
further simplified:
h Bayes (x) = argmaxP[Y = y|X = x]
y∈{0,1}
= argmaxP[Y = y]P[X = x|Y = y]
y∈{0,1}
d
= argmaxP[Y = y] P[X i = x i |Y = y]. (24.7)
y∈{0,1} i=1
That is, now the number of parameters we need to estimate is only 2d +1. Here, the
generative assumption we made reduced significantly the number of parameters we
need to learn.
When we also estimate the parameters using the maximum likelihood principle,
the resulting classifier is called the Naive Bayes classifier.
24.3 LINEAR DISCRIMINANT ANALYSIS
Linear discriminant analysis (LDA) is another demonstration of how generative
assumptions simplify the learning process. As in the Naive Bayes classifier we con-
sider again the problem of predicting a label y ∈{0,1} on the basis of a vector of
features x = (x 1 ,...,x d ). But now the generative assumption is as follows. First,
we assume that P[Y = 1] = P[Y = 0] = 1/2. Second, we assume that the condi-
tional probability of X given Y is a Gaussian distribution. Finally, the covariance
matrix of the Gaussian distribution is the same for both values of the label. Formally,
d
let µ ,µ ∈ R and let be a covariance matrix. Then, the density distribution is
1
0
given by
1 1 T −1
P[X = x|Y = y] = exp − (x − µ ) (x − µ ) .
y
y
(2π) d/2 | | 1/2 2
As we have shown in the previous section, using the Bayes rule we can write
h Bayes (x) = argmaxP[Y = y]P[X = x|Y = y].
y∈{0,1}
This means that we will predict h Bayes (x) = 1iff
P[Y = 1]P[X = x|Y = 1]
log > 0.
P[Y = 0]P[X = x|Y = 0]