From: VJ on
Hello All,
Apologize in advance for the long mail, but I guess more information,
the better the response.
I am analyzing count data which has a 40 % (?high) proportion of
zeroes and remaining data ranges from 1-100 number of events in the
observation period. The response variable is the number of events a
patient experiences in a fixed observation period after receiving
either of 2 different treatments (parallel group). So I have just one
record per individual where the number of events is recorded from time
0 to t hours, with bunch of predictors for every individual.\

After failing to fit the data well with a regular poisson and negative
binomial, I shifted to analyze the data with a ZIP and a poisson-
hurdle (logit-poisson) model. There was definite improvement in the
model fit in terms of drop in -2 log likelihood /AIC values. Between
the ZIP and the hurdle model (fitting an intercept only model), the
intercept from the logit part of the ZIP model was not significant
whereas that from the hurdle is significant, even though the former
had a lower -2LL and AIC, which I don’t understand.

The logit part of the hurdle model (code adapted from a discussion
post by Dale and Wensui in May 2007) is giving me similar estimates
as the probability of zero P(y=0) from a logistic model, as expected.
However, the zero truncated linear function for poisson probability is
not being explained by the full set of predictors in hand. I
determined this by plotting the predicted lambda (lambda hat) vs. the
observed seizure count for each individual, and you should expect it
fall on a straight line, which it did not. I am not sure if this is
the right diagnostic to look at. I tested for overdispersion using the
formula of Cameron and Trivedi by checking the significance of the
regression of dep on predicted without an intercept.
dep = ((observed - predicted) ** 2 - observed) / predicted;
model dep = predicted / noint; /* FIT A OLS REGRESSION WITHOUT
INTERCEPT */

The above regression implied presence of overdispersion. So, I
introduced a random effect (eta1) into the poisson part of the model,
even though there is one only one observation per individual.
(Motivated by this SAS example
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/statug_nlmixed_sect042.htm
; surprisingly the results in the example show that the most of the
parameters are not significant or am I interpreting this
incorrectly?).
Now the diagnostic plot that I just talked about shows observed vs
predicted passing through the line of identity.Can I include this
random effect in my case? Few questions that come into mind;
1. Do I Have the right approach for model diagnosis?
2. What is the interpretation of eta1 (both in the code below and in
the SAS example and how is it different from k in NB)? Is it a random
effect adjusting for overdispersion at an individual level?
3. What kind of model fit plots (diagnostics) should we be looking at
when we deal with such models?
Finally here is the code I used , pretty much similar to what Dale
posted earlier (Thanks Dale);

proc nlmixed data=radarscap tech=newrap maxiter=2000 maxfunc=2000 ;
/* Linear function for zero-probability model */
eta_0 = b0_0+a1*trt;
exp_eta_0 = exp(eta_0);
p_0 = exp_eta_0/(1 + exp_eta_0);

/* Linear function for Poisson probability model */
eta_P = bP_0+b2*fourhrs+eta1;
mu_p = exp(eta_P);
pYyP = exp(-mu_P) * (mu_p**tsz) / fact(tsz);
pY0P = exp(-mu_p);

if tsz=0 then
loglike = eta_0 - log(1 + exp_eta_0);
else
loglike = log(1 - p_0) - mu_P + tsz*eta_P - lgamma(tsz+1)
- log(1 - pY0P);

model tsz ~ general(loglike);
random eta1 ~ normal(0, sd1*sd1) subject=id;
predict eta1 out=eta1 (rename=(pred=eta1));
predict mu_p out=lambda (rename=(pred=lambda));
predict p_0 out=p0 (rename=(pred=p0));
run; quit;

Thanks all,
VJ
From: Eli Y. Kling on
Sounds like a very interesting challenge. Your question is very well
described and merits an answer. I would be very interested in what
others will offer as a response. As to my small contribution:
a. I think your gut feeling is correct. If you have 40% zeros than you
would expect the relevant modelling component to be significant. As
you are finding it hard to fit a meaningful model I would suggest a
back to basics approach:
1. Divide the data by some meaningful strata such as time or centre
and re-analyze.
2. Look out for extreme values.
3. Try just modelling the probability of response/non-response. You
might find a confounded variable or an ill defined one.
4. Check out our code step by step – you might have done something in
the data-management you did not intend. For instance are all missing
values replaced with a zero?
5. The last point suggest another possibility: maybe all the non-
responders (zeros) have a missing value for one of the independent
variables. This will cause all the record to be dropped from the
analysis and result in an insignificant ‘Zero’ effect.

b. Another possible approached is balanced sampling where you take
equal number of observations from the responders and non-responders.
This will most likely reduce the noise but you must remember to
recalibrate the predicted probabilities with the prior sampling
ratios.