Page 383 - Using MIS
P. 383

Q3  How Do Organizations Use Data Warehouses and Data Marts to Acquire Data?   351


                                                            •  Dirty data    •   Wrong granularity
                                                            •  Missing values      – Too fine
                                                            •  Inconsistent data      – Not fine enough
                                                            •  Data not integrated •  Too much data
            Figure 9-14                                                          – Too many attributes
            Possible Problems with                                               – Too many data points
            Source Data




                                       missing values. Most data vendors state the percentage of missing values for each attribute in
                                       the data they sell. An organization buys such data because for some uses, some data is better
                                       than no data at all. This is especially true for data items whose values are difficult to obtain,
                                       such as Number of Adults in Household, Household Income, Dwelling Type, and Education of
                                       Primary Income Earner. However, care is required here because for some BI applications a few
                                       missing or erroneous data points can seriously bias the analysis.
                                           Inconsistent data, the third problem in Figure 9-14, is particularly common for data that
                                       has been gathered over time. When an area code changes, for example, the phone number for
                                       a given customer before the change will not match the customer’s number afterward. Likewise,
                                       part codes can change, as can sales territories. Before such data can be used, it must be recoded
                                       for consistency over the period of the study.
                                           Some data inconsistencies occur from the nature of the business activity. Consider a Web-
                                       based order-entry system used by customers worldwide. When the Web server records the
                                       time of order, which time zone does it use? The server’s system clock time is irrelevant to an
                                       analysis of customer behavior. Coordinated Universal Time (formerly called Greenwich Mean
                                       Time) is also meaningless. Somehow, Web server time must be adjusted to the time zone of the
                                       customer.
                                           Another problem is nonintegrated data. A particular BI analysis might require data from
                                       an ERP system, an e-commerce system, and a social networking application. Analysts may wish
                                       to integrate that organizational data with purchased consumer data. Such a data collection will
                                       likely have relationships that are not represented in primary key/foreign key relationships. It is
                                       the function of personnel in the data warehouse to integrate such data somehow.
                                           Data can also have the wrong granularity, a term that refers to the level of detail repre-
                                       sented by the data. Granularity can be too fine or too coarse. For the former, suppose we want
                                       to analyze the placement of graphics and controls on an order-entry Web page. It is possible to
                                       capture the customers’ clicking behavior in what is termed clickstream data. Those data, how-
                                       ever, include everything the customer does at the Web site. In the middle of the order stream are
                                       data for clicks on the news, email, instant chat, and a weather check. Although all of that data
                                       may be useful for a study of consumer browsing behavior, it will be overwhelming if all we want
                                       to know is how customers respond to an ad located differently on the screen. To proceed, the
                                       data analysts must throw away millions and millions of clicks.
                                           Data can also be too coarse. For example, a file of regional sales totals cannot be used to
                                       investigate the sales in a particular store in a region, and total sales for a store cannot be used to
                                       determine the sales of particular items within a store. Instead, we need to obtain data that is fine
                                       enough for the lowest-level report we want to produce.
                                           In general, it is better to have too fine a granularity than too coarse. If the granularity is
                                       too fine, the data can be made coarser by summing and combining. This is what Addison and
                                       Drew did with the sales data in Figure 9-6. Sales by Bill Year were too fine for their needs, so they
                                       summed sales data over those years. If the granularity is too coarse, however, there is no way to
                                       separate the data into constituent parts.
                                           The final problem listed in Figure 9-14 is to have too much data. As shown in the figure, we
                                       can have either too many attributes or too many data points. Think back to the discussion of
                                       tables in Chapter 5. We can have too many columns or too many rows.
   378   379   380   381   382   383   384   385   386   387   388