dev first guess returns nan for Poisson family GLM

Bug #603306 reported by amcmorl
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
statsmodels
Fix Committed
Undecided
Unassigned

Bug Description

Poisson famiy GLM fit to data with 0s in endog variable fails with error:

"ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported."

Problem can be reproduced with following code:

  import numpy as np
  import scikits.statsmodels as sm

  size = 1e5
  nbeta = 3

  beta = np.random.rand(nbeta)
  x = np.random.rand(size, nbeta)
  y = np.random.poisson(np.dot(x, beta))

  fam = sm.families.Poisson()
  glm = sm.GLM(y,x, family=fam)
  res = glm.fit()

If family=fam is left out of GLM constructor, then glm.fit() works as expected.

Revision history for this message
joep (josef-pktd) wrote :

family.Poisson.deviance has (Y*np.log(Y/mu)) which is nan for Y=0 which is a valid value for Poisson

a solution:

set 0*np.log(0) = 0 using for example numpy.where

(I'm surprised this wasn't caught by any tests)

Revision history for this message
Skipper Seabold (jsseabold) wrote :

Ah, yes. The Poisson example doesn't have any zeros in it. I'm not sure I want to use np.where because it still raises those horribly annoying warnings.

y = np.arange(10)
np.where(y==0, 0,np.log(y))
Warning: divide by zero encountered in log
array([ 0. , 0. , 0.69314718, 1.09861229, 1.38629436,
        1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

How does this sound for Deviance in Poisson

Yin = np.clip(Y, 1e-12, np.inf)
return 2*np.sum(Y*np.log(Yin/mu))/scale

That way 0*np.log(1e-12) is zero and it shouldn't effect anything else, just a little slower. Tests still pass.

Revision history for this message
joep (josef-pktd) wrote :

Creating a np.zeros array and then use putmask to assign nonzero values would avoid the unnecessary log computation.
I don't like np.where much for more expensive calculations, because it always evaluates both branches for every element.
Checking the timing whether clip or putmask is faster, would be useful

(In general, I think zero division warnings should be set to ignore. I have all warnings at ignore, since 1/0. 1/inf are valid operations on the real line)

Revision history for this message
Skipper Seabold (jsseabold) wrote :

putmask is faster by a factor of about 1.5 (though actually still pretty slow) for arrays of size 1e6, but it still spits out the warnings. I don't think we should do np.seterr(all='ignore') globally. Sometimes I like the warnings, but often they are a huge nuisance.

Revision history for this message
joep (josef-pktd) wrote :

I'm not getting any warnings even after changing the np.seterr to raise, but I think the following does not evaluate any log(0) and shouldn't raise a warning:

>>> a = np.array([1, 2, 4, 0, 2, 3, 0, 0, 1, 2])
>>> b = np.zeros(a.shape)
>>> mask = a!=0
>>> ared = a[mask]
>>> b[mask] = ared*np.log(ared)
>>> b
array([ 0. , 1.38629436, 5.54517744, 0. , 1.38629436,
        3.29583687, 0. , 0. , 0. , 1.38629436])

>>> c = np.zeros(a.shape)
>>> np.putmask(c, mask, ared*np.log(ared))
>>> c
array([ 0. , 1.38629436, 5.54517744, 0. , 3.29583687,
        0. , 0. , 0. , 1.38629436, 5.54517744])

using out in log and inplace multiplication would save some intermediate arrays, but would reduce readability a bit.

Revision history for this message
Skipper Seabold (jsseabold) wrote :

Oh, ok. I was using putmask differently. That works for me.

Changed in statsmodels:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.