Page 324 - Understanding Machine Learning
P. 324

Generative Models
           306

                    As an example, let us consider again the drug company which developed a new
                 drug. On the basis of past experience, the statisticians at the drug company believe
                 that whenever a drug has reached the level of clinic experiments on people, it is
                 likely to be effective. They model this prior belief by defining a density distribution
                 on θ such that

                                                   0.8  if θ> 0.5
                                           P[θ] =                                  (24.15)
                                                   0.2  if θ ≤ 0.5
                 As before, given a specific value of θ, it is assumed that the conditional probability,
                 P[X = x|θ], is known. In the drug company example, X takes values in {0,1} and
                               x
                 P[X = x|θ] = θ (1 − θ) 1−x .
                    Once the prior distribution over θ and the conditional distribution over X given
                 θ are defined, we again have complete knowledge of the distribution over X.This is
                 because we can write the probability over X as a marginal probability


                                P[X = x] =   P[X = x,θ] =    P[θ]P[X = x|θ],
                                           θ               θ
                 where the last equality follows from the definition of conditional probability. If θ
                 is continuous we replace P[θ] with the density function and the sum becomes an
                 integral:
                                                 F
                                       P[X = x] =  P[θ]P[X = x|θ]dθ.
                                                  θ
                    Seemingly, once we know P[X = x], a training set S = (x 1 ,...,x m ) tells us nothing
                 as we are already experts who know the distribution over a new point X. However,
                 the Bayesian view introduces dependency between S and X. Thisisbecause we
                 now refer to θ as a random variable. A new point X and the previous points in S are
                 independent only conditioned on θ. This is different from the frequentist philosophy
                 in which θ is a parameter that we might not know, but since it is just a parameter of
                 the distribution, a new point X and previous points S are always independent.
                    In the Bayesian framework, since X and S are not independent anymore, what
                 we would like to calculate is the probability of X given S, which by the chain rule
                 canbe writtenas follows:


                          P[X = x|S] =   P[X = x|θ, S]P[θ|S] =  P[X = x|θ]P[θ|S].
                                       θ                      θ
                 The second inequality follows from the assumption that X and S are independent
                 when we condition on θ.Using the Bayes rule we have

                                                    P[S|θ]P[θ]
                                            P[θ|S] =          ,
                                                       P[S]

                 and together with the assumption that points are independent conditioned on θ,we
                 can write
                                                          m
                                        P[S|θ]P[θ]    1
                                P[θ|S] =           =        P[X = x i |θ]P[θ].
                                           P[S]      P[S]
                                                          i=1
   319   320   321   322   323   324   325   326   327   328   329