Page 199 - Big Data Analytics for Connected Vehicles and Smart Cities
P. 199

180	       Big	Data	Analytics	for	Connected	Vehicles	and	Smart	Cities	                	                        Building a Data Lake	                    181


               • Other data programs: Programs that provide access to federal data sets
                and tools across government agencies.
               • Real-time  data  capture  and  management:  The  data  resources  testbed
                (DRT), the concepts and analysis testbed (CAT), and the cooperative
                vehicle highway testbed (CVHT).


               Data coming into the data lake could also be unstructured data such as
          PDF files, e-mail, and other documents.


          Data Ingestion
          Data can take the form of static archived data or real-time streams from field
          devices and other sources. The Internet of Things will generate large volumes
          of data from sensors and other connected devices. The data is ingested into the
          data lake to create a single repository that can be accessed for data exchange and
          for analytics purposes. This activity would also include the establishment of
          suitable data-sharing agreements to enable the data to be accessed and shared in
          a manner to make it accessible to the data lake.
          Data Preparation

          The data preparation element consists of wrangling, cleansing, and defining
          governance arrangements for the data. Data wrangling can be a manual or semi-
          automated process making use of decision support tools to bring data to com-
          mon formats and locational referencing systems. Data would also be verified at
          this stage by comparing the same data from different sources and identifying
          potential gaps or weaknesses in the data. Duplication and errors are removed
          in a cleansing process as part of data preparation. At this stage, data governance
          arrangements are identified to manage data sourcing, data access, and data dis-
          tribution. This will also include arrangements for sharing analytics that are de-
          rived from the data lake during the data discovery process. The U.S. DOT in
          developing a roadway transportation data business plan [3] also noted the need
          to address data quality. U.S. DOT recommendations include the development
          of a policy to define responsibilities for data quality and adopting data quality
          standards for data collection, processing, application, and reporting.
          Data Discovery

          In the data discovery element, data is searched, accessed, and analyzed. A range
          of search and access tools such as structured query language (SQL) and other
          statistical functions can be used to detect trends and patterns in the data to
          reveal new insights and understanding. Analytics functions that could be used
          during data discovery include statistical, cluster analysis, data transformation,
          past, pattern and time series, decision tree, text, and graphic.
   194   195   196   197   198   199   200   201   202   203   204