Page 56 - CITP Review
P. 56
Balance skewed data
Typically, balanced databases tend to produce better prediction models than unbalanced ones. When
data is skewed, the distribution of the values does not follow normally distributive properties statistically.
If data in data sources are skewed, a stratified sampling would probably lead to a more balanced
database than random sampling. Another approach would be to over-sample the less represented data
values, or under-sample the more represented data values.
Reviewing the scope of necessary efforts in data preparation explains why probably 80% of the average
data analysis and reporting systems is involved with data preparation.
Data processing
Once data has been extracted, transformed, and loaded into the data analysis and reporting database
(DARB), tools and techniques can make effectual use of the information it can produce.
There are three key functions to the data analysis and reporting database: extraction, data mining, and
querying.
Extraction
One basic function is extraction. Users can extract data from the DARB to be used for analysis or
reporting. The result is similar to a data mart except it is user-defined — ad hoc and on demand — so
when business professionals have an emerging need for a particular set of data in DARB, that person can
extract the relevant data and perform the necessary process.
Data mining
Data mining is a process of examining large data sets for strategic purposes of learning something
previously unknown from the data itself. Data mining application can be seen as two broad types:
hypothesis testing and knowledge, or pattern, discovery. The objective can be to determine profiles of
certain people or entities, better marketing, knowledge discovery, or fraud detection.
Banks apply data mining to check effectiveness of loan and credit card application decisions; insurance
companies apply data mining for accident prevention and premium pricing; Amazon.com used data
mining to determine inventory policy; and hotels use data mining to use its capacity. Customer
relationship management is frequently the goal of a data mining system.
Data mining models generally follow one of several DM methodologies, including the following:
Memory-based reasoning
Cluster detection
Decision trees
Market-based analysis
Link analysis
Even artificial intelligence (AI) tools — such as neural networks, and genetic algorithms — are used for
data mining.
© 2019 Association of International Certified Professional Accountants. All rights reserved. 2-10