Page 196 - Big Data Analytics for Connected Vehicles and Smart Cities
P. 196

176	  Big	Data	Analytics	for	Connected	Vehicles	and	Smart	Cities	  	  Building a Data Lake	  177


            side note, experience in working with departments of transportation and cities
            in the United States indicates that a disproportionate value can be achieved
            through the creation of a data catalogue. While data analysts and data scientists
            would see this as a mere steppingstone toward the good stuff, it is obvious from
            a practical point of view that simply knowing what data the organization has
            collected and where it is located is extremely valuable to smart city and trans-
            portation professionals.



            9.4  Definition of a Data Lake

            Up to this point in the book, the focus has been describing big data and analyt-
            ics and the questions to be addressed. This chapter focuses on the techniques
            required to build and manage a big data repository. One explanation of the
            term data lake is as follows:

                 The idea of data lake is to have a single store of all data in the enterprise
                 ranging from raw data (which implies [an] exact copy of source system
                 data)  to  transformed  data,  which  is  used  for  various  tasks,  including
                 reporting, visualization, analytics, and machine learning. The [data lake]
                 includes  structured  data  from  relational  databases  (rows  and  columns),
                 semistructured data (CSV, logs, XML, JSON), unstructured data (e-mails,
                 documents,  PDFs)  and  even  binary  data  (images,  audio,  video)  thus
                 creating a centralized data store accommodating all forms of data [2].


                 The data repository is often referred to as a data lake, and this analogy
            will be used in this chapter. A data lake is a concept or analogy that is used to
            explain the centralization of data into a single repository. It is a collection of
            data from multiple sources that is accessible on an enterprise- or organization-
            wide basis and that takes advantage of the dramatically reduced cost of storing
            and manipulating data because of technologies such as Hadoop. It is a hardware
            and software environment that supports data sharing and supports the creation
            of a data catalogue. The creation of a data catalogue is an important dimension
            in the creation of a data lake as it informs the entire organization with respect
            to data available.
                 Data science capabilities continue to evolve and emerge. The latest evolu-
            tion allows for the conduct of real-time analytics on data as it is being streamed
            from the collection point to the storage area. For the purposes of this book, this
            technique is referred to as the data river, as it involves processing on a stream,
            rather than a static body. It is feasible that real-time processing on data streams
            on the way to the data lake and analytics conducted on static data already in a
            data lake can be supported within one framework for managing data and infor-
   191   192   193   194   195   196   197   198   199   200   201