statsmodels

dev first guess returns nan for Poisson family GLM

Bug #603306 reported by amcmorl on 2010-07-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	statsmodels	Fix Committed	Undecided	Unassigned

Bug Description

Poisson famiy GLM fit to data with 0s in endog variable fails with error:

"ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported."

Problem can be reproduced with following code:

import numpy as np
import scikits.statsmodels as sm

size = 1e5
nbeta = 3

  beta = np.random.rand(nbeta)
  x = np.random.rand(size, nbeta)
  y = np.random.poisson(np.dot(x, beta))

  fam = sm.families.Poisson()
  glm = sm.GLM(y,x, family=fam)
  res = glm.fit()

If family=fam is left out of GLM constructor, then glm.fit() works as expected.

Revision history for this message

joep (josef-pktd) wrote on 2010-07-08:

family.Poisson.deviance has (Y*np.log(Y/mu)) which is nan for Y=0 which is a valid value for Poisson

a solution:

set 0*np.log(0) = 0 using for example numpy.where

(I'm surprised this wasn't caught by any tests)

Revision history for this message

Skipper Seabold (jsseabold) wrote on 2010-07-08:

Ah, yes. The Poisson example doesn't have any zeros in it. I'm not sure I want to use np.where because it still raises those horribly annoying warnings.

y = np.arange(10)
np.where(y==0, 0,np.log(y))
Warning: divide by zero encountered in log
array([ 0. , 0. , 0.69314718, 1.09861229, 1.38629436,
1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

How does this sound for Deviance in Poisson

Yin = np.clip(Y, 1e-12, np.inf)
return 2*np.sum(Y*np.log(Yin/mu))/scale

That way 0*np.log(1e-12) is zero and it shouldn't effect anything else, just a little slower. Tests still pass.

Revision history for this message

joep (josef-pktd) wrote on 2010-07-08:

Creating a np.zeros array and then use putmask to assign nonzero values would avoid the unnecessary log computation.
I don't like np.where much for more expensive calculations, because it always evaluates both branches for every element.
Checking the timing whether clip or putmask is faster, would be useful

(In general, I think zero division warnings should be set to ignore. I have all warnings at ignore, since 1/0. 1/inf are valid operations on the real line)

Revision history for this message

Skipper Seabold (jsseabold) wrote on 2010-07-08:

putmask is faster by a factor of about 1.5 (though actually still pretty slow) for arrays of size 1e6, but it still spits out the warnings. I don't think we should do np.seterr(all='ignore') globally. Sometimes I like the warnings, but often they are a huge nuisance.

Revision history for this message

joep (josef-pktd) wrote on 2010-07-09:

I'm not getting any warnings even after changing the np.seterr to raise, but I think the following does not evaluate any log(0) and shouldn't raise a warning:

>>> a = np.array([1, 2, 4, 0, 2, 3, 0, 0, 1, 2])
>>> b = np.zeros(a.shape)
>>> mask = a!=0
>>> ared = a[mask]
>>> b[mask] = ared*np.log(ared)
>>> b
array([ 0. , 1.38629436, 5.54517744, 0. , 1.38629436,
3.29583687, 0. , 0. , 0. , 1.38629436])

>>> c = np.zeros(a.shape)
>>> np.putmask(c, mask, ared*np.log(ared))
>>> c
array([ 0. , 1.38629436, 5.54517744, 0. , 3.29583687,
0. , 0. , 0. , 1.38629436, 5.54517744])

using out in log and inplace multiplication would save some intermediate arrays, but would reduce readability a bit.

Revision history for this message

Skipper Seabold (jsseabold) wrote on 2010-07-09:

Oh, ok. I was using putmask differently. That works for me.

Skipper Seabold (jsseabold) on 2010-07-09

Changed in statsmodels:
status:	New → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.