Page 196 - Big Data Analytics for Connected Vehicles and Smart Cities

P. 196

176 Big Data Analytics for Connected Vehicles and Smart Cities Building a Data Lake 177

side note, experience in working with departments of transportation and cities
in the United States indicates that a disproportionate value can be achieved
through the creation of a data catalogue. While data analysts and data scientists
would see this as a mere steppingstone toward the good stuff, it is obvious from
a practical point of view that simply knowing what data the organization has
collected and where it is located is extremely valuable to smart city and trans-
portation professionals.

9.4 Definition of a Data Lake

Up to this point in the book, the focus has been describing big data and analyt-
ics and the questions to be addressed. This chapter focuses on the techniques
required to build and manage a big data repository. One explanation of the
term data lake is as follows:

The idea of data lake is to have a single store of all data in the enterprise
ranging from raw data (which implies [an] exact copy of source system
data) to transformed data, which is used for various tasks, including
reporting, visualization, analytics, and machine learning. The [data lake]
includes structured data from relational databases (rows and columns),
semistructured data (CSV, logs, XML, JSON), unstructured data (e-mails,
documents, PDFs) and even binary data (images, audio, video) thus
creating a centralized data store accommodating all forms of data [2].

The data repository is often referred to as a data lake, and this analogy
will be used in this chapter. A data lake is a concept or analogy that is used to
explain the centralization of data into a single repository. It is a collection of
data from multiple sources that is accessible on an enterprise- or organization-
wide basis and that takes advantage of the dramatically reduced cost of storing
and manipulating data because of technologies such as Hadoop. It is a hardware
and software environment that supports data sharing and supports the creation
of a data catalogue. The creation of a data catalogue is an important dimension
in the creation of a data lake as it informs the entire organization with respect
to data available.
Data science capabilities continue to evolve and emerge. The latest evolu-
tion allows for the conduct of real-time analytics on data as it is being streamed
from the collection point to the storage area. For the purposes of this book, this
technique is referred to as the data river, as it involves processing on a stream,
rather than a static body. It is feasible that real-time processing on data streams
on the way to the data lake and analytics conducted on static data already in a
data lake can be supported within one framework for managing data and infor-

191 192 193 194 195 196 197 198 199 200 201