Page 319 - Understanding Machine Learning

P. 319

24.4 Latent Variables and the EM Algorithm 301

This ratiois oftencalledthe log-likelihood ratio.
In our case, the log-likelihood ratio becomes

1 T −1 1 T −1
0
1
1
0
2 (x − µ ) (x − µ ) − (x − µ ) (x − µ )
2
We can rewrite this as w,x + b where

T −1 1 T −1 T −1
w = (µ − µ ) and b = 2 µ µ − µ µ 1 . (24.8)
1
0
0
1
0
As a result of the preceding derivation we obtain that under the aforementioned
generative assumptions, the Bayes optimal classiﬁer is a linear classiﬁer. Addition-
ally, one may train the classiﬁer by estimating the parameter µ ,µ and from the
1
0
data, using, for example, the maximum likelihood estimator. With those estimators
at hand, the values of w and b can be calculated as in Equation (24.8).
24.4 LATENT VARIABLES AND THE EM ALGORITHM
In generative models we assume that the data is generated by sampling from a spe-
ciﬁc parametric distribution over our instance space X. Sometimes, it is convenient
to express this distribution using latent random variables. A natural example is a
d
mixture of k Gaussian distributions. That is, X = R and we assume that each x is
generated as follows. First, we choose a random number in {1,...,k}.Let Y be a
random variable corresponding to this choice, and denote P[Y = y] = c y . Second,
we choose x on the basis of the value of Y according to a Gaussian distribution

1 1 T −1
P[X = x|Y = y] = d/2 1/2 exp − (x − µ ) y (x − µ ) . (24.9)
y
y
(2π) | y | 2
Therefore, the density of X canbe writtenas:
k

P[X = x] = P[Y = y]P[X = x|Y = y]
y=1
k
1 1 T −1

= c y exp − (x − µ ) y (x − µ ) .
y
y
(2π) d/2 | y | 1/2 2
y=1
Note that Y is a hidden variable that we do not observe in our data. Nevertheless,
we introduce Y since it helps us describe a simple parametric form of the probability
of X.
More generally, let θ be the parameters of the joint distribution of X and Y (e.g.,
in the preceding example, θ consists of c y , µ ,and y ,for all y = 1,...,k). Then, the
y
log-likelihood of an observation x can be written as
 
k

log P θ [X = x] = log  P θ [X = x,Y = y] .

y=1

314 315 316 317 318 319 320 321 322 323 324