Page 87 - Multicloud Workshop - Prework
P. 87

Map reduce












              Distributed Computing





                                                                                                               MapReduce is a programming model
              MapReduce                                                                                        for data processing and generating
       •
              A computing task is parallelized                                                                 large data sets with a parallel,
       •                                                                                                       distributed algorithm on a cluster. In
              by distributing data onto                                                                        the first step a worker node applies


              multiple worker nodes                                                                            the map() function to the local data
                                                                                                               producing output data. Then the
              The dataset cannot be stored
       •                                                                                                       output data is reshuffled so that data
              on a single physical node                                                                        that belongs to one key is located on


              Data is stored local to the                                                                      the same worker node.  Now the
       •                                                                                                       worker nodes can process each group
              compute process                                                                                  of output data per key in parallel.










       © 2016 Engage ESM All Rights Reserved
   82   83   84   85   86   87   88   89   90   91   92