Page 46 - Data Science Algorithms in a Week
P. 46

Using Deep Learning to Configure Parallel Distributed Discrete-Event Simulators  31

                                                               
                                                                    (   = 1|  ; θ) =    (∑    ℎ +    )                                                         (5)
                                               
                                                                    
                                                                         
                                                                     
                                                              =1

                       where   (  ) =  1   is a sigmoid function (Hinton, 2006; Hinton et al., 2006).
                                    1+   −  
                          Real-valued  GRBMs  have  a  conditional  probability  for  ℎ   =1,  a  hidden  variable
                                                                                  
                       turned on, given the evidence vector    of the form:

                                                                  
                                                                           (ℎ = 1|  ; θ) =    (∑       +    )                                                 (6)
                                                  
                                                                        
                                                                            
                                                                       
                                                                 =1

                          The  GRBM  conditional  probability  for       =1,  given  the  evidence  vector  h,  is
                                                                    
                       continuous-normal in nature and has the form

                                                                
                                                                              (   |  ; θ) =    (∑    ℎ +    , 1)                                                (7)
                                                                      
                                                    
                                                                           
                                                                      
                                                               =1

                                          (v −μ ) 2
                                           i
                                               
                                         −
                                            2
                       where    (μ , 1) =  e  √2π    is  a  Gaussian  distribution  with  mean  μ = ∑ J j=1 w h + a
                                                                                              ij
                                                                                                     i
                                   
                                                                                                 j
                                                                                      
                       and variance of unity (Mohamed et al., 2012; Cho, Ilin, & Raiko, 2011).
                          Learning from input-data in an RBM can be summarized as calculating a good set of
                       neuron  connection  weight  vectors,   ,  that  produce  the  smallest  error  for  the  training
                       (input-data) vectors. This also implies that a good set of bias (b and a) vectors must be
                       determined.  Because  learning  the  weights  and  biases  is  done  iteratively,  the  weight
                       update  rule  is  given  by  ∆     (equation  8).  This  is  the  partial  derivative  of  the  log-
                                                     
                       likelihood probability of a training vector with respect to the weights,

                                     ∂ log  [p(  )]
                                            = ∆   = 〈   ℎ 〉  − 〈   ℎ 〉  )                         (8)
                                                                                      

                          This  is  well  explained  by  Salakhutdinov  and  Murray  (2008),  Hinton  (2010),  and
                       Zhang et al. (2014). However, this exact computation is intractable because 〈   ℎ 〉
                                                                                                             
                       takes  exponential  time  to  calculate  exactly  (Mohamed  et  al.,  2011).  In  practice,  the
                       gradient of the log-likelihood is approximated.
                          Contrastive divergence learning rule is used to approximate the gradient of the log-
                       likelihood probability of a training vector with respect of the neuron connection weights.
                       The simplified learning rule for an RBM has the form (Längkvist et al., 2014):

                                        ∆   ∝ 〈      〉  − 〈      〉                                (9)
                                               
                                                              
                                                                                             
   41   42   43   44   45   46   47   48   49   50   51