Servicegroups: Multi process nova-conductor is unable to join servicegroups when zk driver is used

Bug #1389782 reported by Pawel Palucki
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Michal Dulko

Bug Description

I have found that nova-conductor when run as multi process (default), shares the handle to zookeeper process that causes a lock probably inside zookeeper.c. Probably some internal zookeeper structures like sockets are shared and this is not allowed by zookeeper.

Checkout the consequences.

There is similar complementary bug but there are other effect - multiple unnecessary registration and over-use of resources.

https://bugs.launchpad.net/nova/+bug/1382153

How to reproduce:
-----------------

devstack + ubuntu 14.04 + zookeeper 3.4.5

nova.conf:

[DEFAULT]
servicegroup_driver = zk

[conductor]
workers = 2

then run nova-conductor.

We can observer in logs (with debug=True):

DEBUG evzookeeper.membership [req-xxx None None] Membership._join on /servicegroups/conductor/somehost

but there is no following expected:

DEBUG evzookeeper.membership [req-xxx None None ] created zknode /servicegroups/conductor/somehost

We can check that zookeeper conductor node wasn't created:

/usr/share/zookeeper/bin/zkCli.sh ls /servicegroups

I investigated that the problem lies only in zookeeper c library implementation and is not caused by python zookeeper bindings evzookeeper.

Here is a little snippet that show that program is blocked when zookeeper handle is used by child process (requires only zookeeper server and python).

http://paste.openstack.org/show/129636/ (attached)

We can check the logs in zookeeper-server and observer that the request for creation from client isn't send to zookeeper-server at all.

I was trying to go deeply inside internals of zookeeper.c but I couldn't find a clue why it isn't working.

From the point of evzookeeper (zk.driver), the callback isn't called so green thread just waiting infinitely for response.

Consequences
------------

Nova-conductor works fine (because communication with zookeeper is in backgrounded green thread) but:

a) the namespace in zookeeper /servicegroups/conductor isn't created (if namespace wasn't created before)
b) the ephemeral node for conductors in namespace isn't created (if namespace somehow exists)

The effects from the perspective of OpenStack cluster are:

* effect of a) causes internal exceptions in nova-api service and therefore 'novaclient service-list' and horizon/"System Information"/"Compute services" doesn't work because of
  exceptions 'NoNodeException: no node' followed by 'ServiceGroupUnavailable: The service from servicegroup driver ZooKeeperDriver is temporarily unavailable.'
  So it isn't to possible to list any working services only because the namespace for conductors wasn't prepared (in reality all services working, zookeeper is working)

  Additionally it causes internal horizon 500 TemplateSyntaxError in horizon when trying to list all hypervisors at /admin/hypervisors/.

* effect of b) causes that service-list or "System Information" gives a false negative: it shows service is down when in reality service is working

AFAIK only nova-conductor is affected by this for now, because it is the only one of nova services that passes `workers` argument to openstack.common.service.launch(server, workers) and it is based on that are service.Service (not WSGIService based).
If workers>1 `launch` function starts the service by ProcessLanucher. ProcessLauncher is responsible for forking. The problem is that service object is already created with initialized zk driver object (in parent process).
Zk driver object is already initialized with connection (handle) that will be shared by child processes. Then in Service.start (in fork) there is a try to join servicegroup that doesn't work.

I checked how sharing common resource (socket) affects other drivers. It's not a problem for memcache or db driver, because connection to memcache/db is created in lazy manner (connection/socket isn't created until required by child process).

Possible solutions:
1. simple but not clean: initialize zookeeper driver in lazy manner (like db/memcache), so each process will create own handle to zookeeper, ignoring the problem that each process tries to create the same node in zookeeper
2. refactor base nova.service.Service that only parent process is responsible for joining the servicegroups - requires a lot of work and maybe even a blueprint
3. based on first solution but with a difference that parent process registers the parent node (host) and each subproccess registers subnode (pid) for example: /servicesgroups/conductor/HOST/PID - then get_all shouldn't check if HOST node exist but if is empty

The problem with zookeeper and forking isn't new for openstack:

http://qnalist.com/questions/27169/how-to-deal-with-fork-properly-when-using-the-zkc-mt-lib

but the right solution wasn't found.

Revision history for this message
Pawel Palucki (pawel-palucki) wrote :
description: updated
description: updated
Changed in nova:
assignee: nobody → Pawel Palucki (pawel-palucki-q)
description: updated
description: updated
description: updated
description: updated
description: updated
tags: added: servicegroups
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/133479

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/133500

Changed in nova:
assignee: Pawel Palucki (pawel-palucki) → Sean Dague (sdague)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/133500
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e61bf70146c47a99394a143c598ebd73409eca47
Submitter: Jenkins
Branch: master

commit e61bf70146c47a99394a143c598ebd73409eca47
Author: Pawel Palucki <email address hidden>
Date: Fri Nov 7 14:41:49 2014 +0100

    Fix conductor processes race trying to join servicegroup (zk driver)

    When conductor is run in multi process manner and zk (zookeeper) driver
    is used as servicegroup driver, there is a problem because each process
    tries to manage own Membership object to the same zookeeper path.

    This ends with raising exceptions:

    RuntimeError: Duplicated membership name /servicegroups/conductor/MEMBER_ID

    Zookeeper driver uses Membership (evzookeeper) class with path related
    to service type and AFAIK it isn't correct that many process will be
    responsible for the same ephemeral node. From my research it is not
    supported but evzookeeper (Membership class) - so we can ignore the
    exception or give each process his own node.

    If we ignore exception (silent it) and when first registered process dies
    and the ephemeral node disappears, another process will create it. It will
    work but hides the information about overall structure of services and
    also causes that each process endlessly will be trying to create a node
    (sending invalid create node requests to zookeeper). IMO is not a clean solution.

    So there is another solution that each process has its own node. This
    fix does that.

    The best unique identifier for process is pid, so the chosen solution,
    reorganizes the structure of zookeeper tree by adding one more level
    with process ids.

    The zookeeper tree before looks like this:

    /servicesgroups/SERVICE/MEMBER

    and after path will look like this:

    /servicegroups/SERVICE/MEMBER/PID
    eg.
    /servicegroups/conductor/foo/12345

    This solution also assumes, that servicegroup driver will not check existence of
    member node, but existence of subnodes (pids) - which corresponds to existence
    of processes of given service.

    In general we will have more granular information about whole system -
    for exmaple we can check number of processes of given service on each node.

    To answer the question: is service on given node works, we have to check
    number of ephemeral "pids" nodes in get_all() method.

    Closes-bug: #1390511

    Related-bug: #1389782
    Related-bug: #1382153

    Change-Id: I478845b6921dcfb9e9af5a45283a8569051b4f4f

Changed in nova:
assignee: Sean Dague (sdague) → Michal Dulko (michal-dulko-f)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/133479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=afe86b6f29033a472cab1b52dd0724bb3c6dfb82
Submitter: Jenkins
Branch: master

commit afe86b6f29033a472cab1b52dd0724bb3c6dfb82
Author: Michal Dulko <email address hidden>
Date: Wed Feb 4 12:44:12 2015 +0100

    Fix conductor servicegroup joining when zk driver is used

    When conductor is run as multiprocess (default for multi core system) and
    zk (zookeeper) is used as servicegroup_driver then conductor is unable to join
    servicegroup because of shared zookeeper handle (and probably socket)
    between parent and children processes.

    It's found the problem lies in zookeeper c library implementation.
    Proof can be seen in related bug #1389782.

    This fix follows the idea used by memcache and db driver that
    servicegroup_api._driver object is used in lazy manner.
    This means that like connection to memcache and session to database,
    zookeeper handle (zk session in driver) isn't created until required by
    worker (child process).

    Additional note: before fix, during Service object creation the
    prefix in zookeeper was created. That was the probably reason the session was
    established so early. In my opinion the eagerness of this is not necessary
    and namespace can be created by child process as well.

    Closes-Bug: #1389782

    Related-bug: #1390511
    Related-bug: #1382153

    Change-Id: I9b386ef1f9268d19d04879ec89e5684170f3862a

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.