Page 397 - Using MIS
P. 397
Q6 How Do Organizations Use BigData Applications? 365
MapReduce
Because BigData is huge, fast, and varied, it cannot be processed using traditional techniques.
MapReduce is a technique for harnessing the power of thousands of computers working in par-
allel. The basic idea is that the BigData collection is broken into pieces, and hundreds or thou-
sands of independent processors search these pieces for something of interest. That process is
referred to as the Map phase. In Figure 9-23, for example, a data set having the logs of Google
searches is broken into pieces, and each independent processor is instructed to search for and
count search keywords. Figure 9-23, of course, shows just a small portion of the data; here you
can see a portion of the keywords that begin with H.
As the processors finish, their results are combined in what is referred to as the Reduce
phase. The result is a list of all the terms searched for on a given day and the count of each. The
process is considerably more complex than described here, but this is the gist of the idea.
By the way, you can visit Google Trends to see an application of MapReduce. There you can
obtain a trend line of the number of searches for a particular term or terms. Figure 9-24 shows the
search trend for the term Web 2.0. The vertical axis is scaled; a value of 1.0 represents the average
number of searches over that time period. This particular trend line, by the way, supports the
contention that the term Web 2.0 is fading from use. Go to www.google.com/trends and enter the
terms Big Data, BigData, and Hadoop to see why learning about them is a better use of your time!
Hadoop
15
Hadoop is an open source program supported by the Apache Foundation that implements
MapReduce on potentially thousands of computers. Hadoop could drive the process of find-
ing and counting the Google search terms, but Google uses its own proprietary version of
MapReduce to do so instead.
Hadoop began as part of Cassandra, but the Apache Foundation split it off to become its own
product. Hadoop is written in Java and originally ran on Linux. Recently, Microsoft announced a
Log
Search log: segments: Map Phase Reduce Phase
…
Halon; Wolverine; …
Abacus; Poodle; Fence; Processor 1 Hadoop 14
Acura; Healthcare; Healthcare 85
Cassandra; Belltown; Hiccup 17
Hadoop; Geranium; Hurricane 8 Keyword:Total Count:
Stonework; Healthcare; … …
Honda; Hadoop; … Hadoop 10,418
Congress; Healthcare; Hadoop 3 Halon 4,788
Frigate; Metric; Clamp; Processor 2 Healthcare 2 Healthcare 12,487,318
Dell; Salmon; Hadoop; Honda 1 Hiccup 7,435
Picasso; Abba; … Honda 127,489
237,654
Hotel
… … … … Hurricane 2,799
…
Halon 11 …
Processor 9,555 Hotel 175
(+ or –) Honda 87
Figure 9-23 Hurricane 53
MapReduce Processing …
Summary
15 A nonprofit corporation that supports open source software projects, originally those for the Apache Web
server, but today for a large number of additional major software projects.