nova-api fails to query ServiceGroup status from Zookeeper

Bug #1239864 reported by Jeff Dutton
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Sean Dague

Bug Description

I am running with the ZooKeeper servicegroup driver on CentOS 6.4 (Python 2.6) with the RDO distro of Grizzly.

All nova services are successfully connecting to ZooKeeper, which I've verified using zkCli.

However, when I run `nova service-list` I get an HTTP 500 error from nova-api. The nova-api log (/var/log/nova/api.log) shows:

2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/nova/servicegroup/api.py"\
, line 93, in service_is_up
2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack return self._driver.is_up(member)
2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers\
/zk.py", line 116, in is_up
2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack all_members = self.get_all(group_id)
2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers\
/zk.py", line 141, in get_all
2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack raise exception.ServiceGroupUnavailable(driver="ZooKeeperDrive\
r")
2013-10-14 16:33:15.110 6748 TRACE nova.api.openstack ServiceGroupUnavailable: The service from servicegroup driver ZooK\
eeperDriver is temporarily unavailable.

The problem seems to be around evzookeeper (using version 0.4.0).

To isolate the problem, I added some evzookeeper.ZKSession synchronous get() calls to test the roundtrip communication to ZooKeeper. When I do a `self._session.get(CONF.zookeeper.sg_prefix)` in the zk.py ZooKeeperDriver __init__() method it works fine. The logs show that this is immediately before the wsgi server starts up.

When I do the get() operation from within the ZooKeeperDriver get_all() method, the web request hangs indefinitely. However, if I recreate the evzookeeper.ZKSession within the get_all() method (after the wsgi server has started) the nova-api request is successful.

diff --git a/nova/servicegroup/drivers/zk.py b/nova/servicegroup/drivers/zk.py
index 2a3edae..7de2488 100644
--- a/nova/servicegroup/drivers/zk.py
+++ b/nova/servicegroup/drivers/zk.py
@@ -122,7 +122,14 @@ class ZooKeeperDriver(api.ServiceGroupDriver):
         monitor = self._monitors.get(group_id, None)
         if monitor is None:
             path = "%s/%s" % (CONF.zookeeper.sg_prefix, group_id)
- monitor = membership.MembershipMonitor(self._session, path)
+
+ null = open(os.devnull, "w")
+ local_session = evzookeeper.ZKSession(CONF.zookeeper.address,
+ recv_timeout=
+ CONF.zookeeper.recv_timeout,
+ zklog_fd=null)
+
+ monitor = membership.MembershipMonitor(local_session, path)
             self._monitors[group_id] = monitor
             # Note(maoy): When initialized for the first time, it takes a
             # while to retrieve all members from zookeeper. To prevent

Revision history for this message
Paul Green (paul-green-u) wrote :

Update: I work for the same company that Jeff worked for when he entered this report. This problem still manifests itself in Havana and Icehouse. We have reproduced the issue using the following packages:

CentOS 6.5
Python 2.6
evzookeeper 0.4.0
zc-zookeeper-static 3.4.4

The stack trace of our current failures is similar to the original report.

The code change posted by Jeff and included with this report resolves the issue.

Changed in nova:
status: New → Confirmed
tags: added: compute
tags: added: low-hanging-fruit
Revision history for this message
Paul Green (paul-green-u) wrote :

Posting the patch as a file attachment.

Changed in nova:
assignee: nobody → Paul Green (paul-green-u)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/102639

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/102639
Reason: Is this still active, the patch hasn't been updated in over a month. Feel free to restore this.

Sean Dague (sdague)
Changed in nova:
importance: Undecided → Low
Changed in nova:
assignee: Paul Green (paul-green-u) → Sean Dague (sdague)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/102639
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f37905a3c1fd09597898c93a1cbc3050f335cf61
Submitter: Jenkins
Branch: master

commit f37905a3c1fd09597898c93a1cbc3050f335cf61
Author: Paul Green <email address hidden>
Date: Mon Jun 23 18:31:20 2014 -0400

    Fix service groups with zookeeper

    Service groups using zookeeper don't work due to apparently improper
    handling of the zookeeper session creation.

    No additional unit tests as the mocking these interfaces probably
    doesn't actually provide substantial future guaruntees. The correct
    long term fix is to enable zk for unit testing upstream.

    Change-Id: I21691843cb4936d10a0b82df41aef3afe5bf2519
    Closes-bug: #1239864

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.