Page 199 - Big Data Analytics for Connected Vehicles and Smart Cities
P. 199
180 Big Data Analytics for Connected Vehicles and Smart Cities Building a Data Lake 181
• Other data programs: Programs that provide access to federal data sets
and tools across government agencies.
• Real-time data capture and management: The data resources testbed
(DRT), the concepts and analysis testbed (CAT), and the cooperative
vehicle highway testbed (CVHT).
Data coming into the data lake could also be unstructured data such as
PDF files, e-mail, and other documents.
Data Ingestion
Data can take the form of static archived data or real-time streams from field
devices and other sources. The Internet of Things will generate large volumes
of data from sensors and other connected devices. The data is ingested into the
data lake to create a single repository that can be accessed for data exchange and
for analytics purposes. This activity would also include the establishment of
suitable data-sharing agreements to enable the data to be accessed and shared in
a manner to make it accessible to the data lake.
Data Preparation
The data preparation element consists of wrangling, cleansing, and defining
governance arrangements for the data. Data wrangling can be a manual or semi-
automated process making use of decision support tools to bring data to com-
mon formats and locational referencing systems. Data would also be verified at
this stage by comparing the same data from different sources and identifying
potential gaps or weaknesses in the data. Duplication and errors are removed
in a cleansing process as part of data preparation. At this stage, data governance
arrangements are identified to manage data sourcing, data access, and data dis-
tribution. This will also include arrangements for sharing analytics that are de-
rived from the data lake during the data discovery process. The U.S. DOT in
developing a roadway transportation data business plan [3] also noted the need
to address data quality. U.S. DOT recommendations include the development
of a policy to define responsibilities for data quality and adopting data quality
standards for data collection, processing, application, and reporting.
Data Discovery
In the data discovery element, data is searched, accessed, and analyzed. A range
of search and access tools such as structured query language (SQL) and other
statistical functions can be used to detect trends and patterns in the data to
reveal new insights and understanding. Analytics functions that could be used
during data discovery include statistical, cluster analysis, data transformation,
past, pattern and time series, decision tree, text, and graphic.