Page 189 - Six Sigma Advanced Tools for Black Belts and Master Black Belts
P. 189
OTE/SPH
OTE/SPH
August 31, 2006
JWBK119-12
Introduction to the Analysis of Categorical Data
174 2:58 Char Count= 0
probabilistic statement that the joint probability is equivalent to the product of the
marginal probabilities. The conclusion of this hypothesis test applies in more gen-
eral two-variable cases without reference to any distinction between response and
explanatory variables. Specifically, if the null hypothesis holds, the probability of an
observation falling in any row is independent of which column that observation al-
ready belongs to. Hence, if the null hypothesis is rejected, there is evidence suggesting
the presence of a relationship between the variables.
The marginal probabilities, π i+ and π + j , can be estimated by the sample marginal
probabilities, p i+ and p + j , respectively:
n i+
p i+ = , (12.1)
n
n + j
p + j = . (12.2)
n
0
Under the null hypothesis, the joint probability in each cell, π , is given by the product
ij
of these marginal probabilities and can be estimated by the sample estimators for these
marginal probabilities as follows:
0 n i+ n + j
p = p i+ p + j = . (12.3)
ij
n n
With the estimated joint probabilities under the null hypothesis, comparisons can be
made with the estimated joint probabilities of each cell obtained from the actual data.
Statistical tests based on the expected frequencies can be used to implement these
comparisons. The estimated expected frequency for each (i, j)th cell under the null
hypothesis is given by
n i+ · n + j
ˆ μ ij = np i+ p + j = . (12.4)
n
2
Given that the actual cell frequencies are n ij , the following Pearson X statistic can be
used to assess the independence between the two variables:
n ij − ˆμ ij
2
X = . (12.5)
ˆ μ ij
i, j
Another statistic based on likelihood ratios can also be used:
2
G = 2 n ij ln n ij . (12.6)
ˆ μ ij
i, j
2
For both of these statistics, the large-sample reference distribution is χ with (I − 1) ×
(J − 1) degrees of freedom. The null hypothesis is rejected when the computed statistic
exceed the critical value at a given significance level.
More detailed information can be obtained by looking at the contribution to the
2
2
overall X and G statistics from each variable combination (or cell) and the differ-
ence between the observed and expected cell frequencies in each cell. The differences
between observed and expected frequencies in each cell are also known as the residuals.