Page 201 - Six Sigma Advanced Tools for Black Belts and Master Black Belts
P. 201

OTE/SPH
 OTE/SPH
          August 31, 2006
 JWBK119-12
                          Introduction to the Analysis of Categorical Data
        186              2:58  Char Count= 0
        likelihood ratio statistic is given by equation (12.16). The ratio of two likelihoods
        can be written as a difference between 2 log likelihoods. A special case of the likeli-
        hood ratio written as a difference of two log likelihoods which is used in likelihood
        ratio GOF tests is the ‘deviance’ statistic given by
          Deviance =−2 (L N − L S )
        where L S is the maximum log likelihood for a saturated model (with a separate param-
        eter for each observation, resulting in a perfect fit of the observed data) and L N is the
        maximum log likelihood for a non-saturated model. For logistic regression involving
                                                                         2
        binomial random variables, the deviance statistic has an asymptotic null χ distribu-
        tion with N − p degrees of freedom, where N is the number of distinct observations
                                                                      2
        and p is the number of parameters in the model. By utilizing the χ distribution,
        a statistical hypothesis test can be conducted to validate the explanatory power of a
        postulated non-saturated model compared to the perfect explanatory power of a satu-
        rated model for a given data set. The null hypothesis for this test is that all parameters
        that are in the saturated model but not in the postulated non-saturated model are
        zero. For the horseshoe crab example with a single continuous explanatory variable
        (carapace width) the deviance of this model is calculated as 33.28 with 33 degrees of
        freedom. This gives a p-value of 0.454. Hence, the null hypothesis cannot be rejected
        at the 5% significance level.
          It should be noted that there are other GOF tests for categorical data analysis with
                                                                        2
        continuous explanatory variables such as the Hosmer--Lemeshow tests and those
        based on Pearson residuals. Such tests can be easily evaluated with statistical software
                                                            1
        such as MINITAB. Details of these tests are given by Agresti. The p-value is evaluated
        together with the GOF statistics in MINITAB and can be used to provide a more
        comprehensive judgment on the GOF of the estimated model.


        12.4.3 Logistic regression with single categorical explanatory variable
        Logistic regression can easily be extended to model categorical explanatory variables
        via the use of dummy variables. The procedure for handling categorical explanatory
        variables is described in this section using the above horseshoe crab example. The
        explanatory variable of interest is taken to be the spine condition. From Table 12.7, it
        can be observed that spine condition is essentially a categorical variable with three
        levels. Two binary dummy variables which take values of 0 or 1 are necessary to fully
        describe the three levels of spine condition. The two dummy variables are denoted
        as c S1 and c S2 . The logistic regression model with a single categorical explanatory
        variable is given by
               π Pres (c S1 , c S2 )

          In                   = α S + β S1 c S1 + β S2 c S2 .
              1 − π Pres (c S1 , c S2 )
          MINITAB offers the facility to analyze logistic regression models with categorical
        explanatory variables. The output from MINITAB using the horseshoe crab dataset is
        reproduced in Table 12.9. From the output, the MLE coefficients of both the constant
        and the parameters of the two dummy variables, ˆα S , ˆ β S1 and ˆ β S2 , are found to be
        significant at the 5% level by looking at the z-statistic and p-value. Similar to the
        logistic regression model with a single continuous explanatory variable, the p-value is
        basedonanasymptoticnormaldistribution.Thelogitfunctionforthespineconditions
   196   197   198   199   200   201   202   203   204   205   206