Page 203 - Six Sigma Advanced Tools for Black Belts and Master Black Belts
P. 203
OTE/SPH
OTE/SPH
August 31, 2006
JWBK119-12
Introduction to the Analysis of Categorical Data
188 2:58 Char Count= 0
The preceding observations are based on assumption that all other possible ex-
planatory variables are kept constant. As observed in Table 12.7, there are most prob-
ably other factors which simultaneously affect the presence of satellite crabs. More
complex logistic regression models that can take into account multiple explanatory
variables may be necessary to fully describe a model with higher predictive power.
The plausibility of such models is investigated in the next subsection.
12.4.4 Multiple logistic regression
The ability of simple logistic regression models with a single explanatory variable to
generalize to complex multiple logistic regression models with multiple variables is
similar to that of the generalization of simple linear regression models to multiple
linear regression models in OLS regression. A typical multiple logistic regression
model for binary responses with k categorical variables and l continuous variables
can be represented as follows:
π
ln = α + β C1 c 1 + β C2 c 2 +· · · + β k c k + β 1 x 1 + β 2 x 2 +· · · + β l x l ,
1 − π
where α is a constant, c i is the ith categorical explanatory variable and β Ci its slope
parameter, x i is the ith continuous explanatory variable and β i its slope parameter,
and π is the probability of success.
In the horseshoe crab data shown in Table 12.7, the possible categorical variables
are the crab color and spine condition. The continuous explanatory variables are
the weight of crab and carapace width. The fitted model could potentially be of the
following form:
π Pres
ln = α + β C1 c 1 + β C2 c 2 + β 1 x 1 + β 2 x 2 (12.17)
1 − π Pres
where, c 1 is the color variable, c 2 the spine condition, x 1 the width, x 2 the weight,
and π Pres the probability of finding a satellite crab nearby. The categorical variable c 1
has four levels whereas c 2 has three; hence, three dummy variables are necessary to
completely describe c 1 and two are needed for c 2 . The three dummy variables for c 1
are represented by c 1i for i = {1, 2, 3} and for c 2 are represented by c 2 j for j = {1, 2}.
MLEs of the parameters evaluated using MINITAB is shown in Table 12.11.
From Table 12.11, there appear to be no explanatory variables which are significant.
This contradicts the earlier analysis with single categorical and continuous variables.
2
Furthermore, the G statistic calculated with Equation (12.16) using MINITAB is 26.3.
The null hypothesis for the test based on this statistic states that the response is jointly
2
independent of all the explanatory variables. Based on an asymptotic null χ distri-
bution with 7 degrees of freedom, there is very strong evidence of the presence of
significant effects in at least one of the explanatory variables on the response. The fact
that all effects in the logistic regression table in Table 12.11 showed up as insignificant
instead could be indicative of the presence of significant multicollinearity between
the explanatory variables. Multicollinearity essentially refers to the presence of sig-
nificant relationships between the explanatory variables such that some or all of the