Page 194 - Big Data Analytics for Connected Vehicles and Smart Cities
P. 194

174	  Big	Data	Analytics	for	Connected	Vehicles	and	Smart	Cities	  	  Building a Data Lake	  175


                 There is a multitude of alternatives with respect to technologies that can
            be deployed to establish and maintain a data lake. However, the most important
            aspect associated with developing data lakes lies in taking a robust planning ap-
            proach. Such an approach should take full advantage of previous experience and
            lessons learned to accommodate a flexible choice of technologies and solutions.
                 The emergence of large-scale data storage and manipulation technologies
            such as Hadoop [1] enables a new philosophy of data aggregation and consoli-
            dation into a single repository, rather than the earlier approach where data had
            to be divided and partitioned to make it manageable. A data lake is a virtual
            concept as it is feasible to allow data to remain in the existing source while mak-
            ing a copy available for use in the data lake. Again, this depends on the exact
            choice of the solution of technology to be deployed.
                 The contents of this chapter will be akin to a waterskiing adventure across
            the data lake, rather than a deep dive into specific technologies and products
            within data science and analytics.
                 As stated in Chapter 2, the data lake analogy is useful in as it suggests a
            clean or filtered body of water that contains useful and accessible data. It is not
            a data swamp, which would contain both useful and not so useful items in a
            mixture that would make the data less accessible. The data lake concept places
            an emphasis on bringing data together, making it accessible and visible across
            an organization or enterprise.
                 The creation of a data lake involves the removal of silos and partitions
            that are present because of the way the data has been collected managed and
            utilized in the past. Work assignments with several transportation agencies have
            revealed a natural tendency for data to be collected in what could be referred
            to as cockpits, with a cockpit being an array of data that is assembled by an
            individual or team with the objective of supporting a specific job function. For
            example, a traffic engineer might collect intersection turning movement and
            highway flow data to support the calculation of traffic signal timings. Making
            use of spreadsheets and other tools, the engineer can create a toolbox of data
            that is specifically designed to support the tasks involved in the job. Unfortu-
            nately, while this provides specialist support for the job in hand, it prevents an
            enterprise-wide view of data.
                 With the capabilities of data science today, it is possible to leave the cock-
            pit intact while also copying the data to the data lake. Just like a real lake, it is
            then possible to use tools to waterski and to deep dive, exploring the data and
            revealing insights. It is worth noting that the concept of a data lake also implies
            that early judgment should not be applied regarding the usefulness of data. It is
            possible that a seemingly useless piece of data can combine with another piece
            of data in the data lake to create a valuable insight.
                 A fragmented data collection and management approach is analogous to a
            skilled worker such as a carpenter, who has assembled a collection of tools over
   189   190   191   192   193   194   195   196   197   198   199