Page 194 - Big Data Analytics for Connected Vehicles and Smart Cities
P. 194
174 Big Data Analytics for Connected Vehicles and Smart Cities Building a Data Lake 175
There is a multitude of alternatives with respect to technologies that can
be deployed to establish and maintain a data lake. However, the most important
aspect associated with developing data lakes lies in taking a robust planning ap-
proach. Such an approach should take full advantage of previous experience and
lessons learned to accommodate a flexible choice of technologies and solutions.
The emergence of large-scale data storage and manipulation technologies
such as Hadoop [1] enables a new philosophy of data aggregation and consoli-
dation into a single repository, rather than the earlier approach where data had
to be divided and partitioned to make it manageable. A data lake is a virtual
concept as it is feasible to allow data to remain in the existing source while mak-
ing a copy available for use in the data lake. Again, this depends on the exact
choice of the solution of technology to be deployed.
The contents of this chapter will be akin to a waterskiing adventure across
the data lake, rather than a deep dive into specific technologies and products
within data science and analytics.
As stated in Chapter 2, the data lake analogy is useful in as it suggests a
clean or filtered body of water that contains useful and accessible data. It is not
a data swamp, which would contain both useful and not so useful items in a
mixture that would make the data less accessible. The data lake concept places
an emphasis on bringing data together, making it accessible and visible across
an organization or enterprise.
The creation of a data lake involves the removal of silos and partitions
that are present because of the way the data has been collected managed and
utilized in the past. Work assignments with several transportation agencies have
revealed a natural tendency for data to be collected in what could be referred
to as cockpits, with a cockpit being an array of data that is assembled by an
individual or team with the objective of supporting a specific job function. For
example, a traffic engineer might collect intersection turning movement and
highway flow data to support the calculation of traffic signal timings. Making
use of spreadsheets and other tools, the engineer can create a toolbox of data
that is specifically designed to support the tasks involved in the job. Unfortu-
nately, while this provides specialist support for the job in hand, it prevents an
enterprise-wide view of data.
With the capabilities of data science today, it is possible to leave the cock-
pit intact while also copying the data to the data lake. Just like a real lake, it is
then possible to use tools to waterski and to deep dive, exploring the data and
revealing insights. It is worth noting that the concept of a data lake also implies
that early judgment should not be applied regarding the usefulness of data. It is
possible that a seemingly useless piece of data can combine with another piece
of data in the data lake to create a valuable insight.
A fragmented data collection and management approach is analogous to a
skilled worker such as a carpenter, who has assembled a collection of tools over