Servicegroups: Multi process nova-conductor is unable to join servicegroups when zk driver is used
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Michal Dulko |
Bug Description
I have found that nova-conductor when run as multi process (default), shares the handle to zookeeper process that causes a lock probably inside zookeeper.c. Probably some internal zookeeper structures like sockets are shared and this is not allowed by zookeeper.
Checkout the consequences.
There is similar complementary bug but there are other effect - multiple unnecessary registration and over-use of resources.
https:/
How to reproduce:
-----------------
devstack + ubuntu 14.04 + zookeeper 3.4.5
nova.conf:
[DEFAULT]
servicegroup_driver = zk
[conductor]
workers = 2
then run nova-conductor.
We can observer in logs (with debug=True):
DEBUG evzookeeper.
but there is no following expected:
DEBUG evzookeeper.
We can check that zookeeper conductor node wasn't created:
/usr/share/
I investigated that the problem lies only in zookeeper c library implementation and is not caused by python zookeeper bindings evzookeeper.
Here is a little snippet that show that program is blocked when zookeeper handle is used by child process (requires only zookeeper server and python).
http://
We can check the logs in zookeeper-server and observer that the request for creation from client isn't send to zookeeper-server at all.
I was trying to go deeply inside internals of zookeeper.c but I couldn't find a clue why it isn't working.
From the point of evzookeeper (zk.driver), the callback isn't called so green thread just waiting infinitely for response.
Consequences
------------
Nova-conductor works fine (because communication with zookeeper is in backgrounded green thread) but:
a) the namespace in zookeeper /servicegroups/
b) the ephemeral node for conductors in namespace isn't created (if namespace somehow exists)
The effects from the perspective of OpenStack cluster are:
* effect of a) causes internal exceptions in nova-api service and therefore 'novaclient service-list' and horizon/"System Information"
exceptions 'NoNodeException: no node' followed by 'ServiceGroupUn
So it isn't to possible to list any working services only because the namespace for conductors wasn't prepared (in reality all services working, zookeeper is working)
Additionally it causes internal horizon 500 TemplateSyntaxError in horizon when trying to list all hypervisors at /admin/
* effect of b) causes that service-list or "System Information" gives a false negative: it shows service is down when in reality service is working
AFAIK only nova-conductor is affected by this for now, because it is the only one of nova services that passes `workers` argument to openstack.
If workers>1 `launch` function starts the service by ProcessLanucher. ProcessLauncher is responsible for forking. The problem is that service object is already created with initialized zk driver object (in parent process).
Zk driver object is already initialized with connection (handle) that will be shared by child processes. Then in Service.start (in fork) there is a try to join servicegroup that doesn't work.
I checked how sharing common resource (socket) affects other drivers. It's not a problem for memcache or db driver, because connection to memcache/db is created in lazy manner (connection/socket isn't created until required by child process).
Possible solutions:
1. simple but not clean: initialize zookeeper driver in lazy manner (like db/memcache), so each process will create own handle to zookeeper, ignoring the problem that each process tries to create the same node in zookeeper
2. refactor base nova.service.
3. based on first solution but with a difference that parent process registers the parent node (host) and each subproccess registers subnode (pid) for example: /servicesgroups
The problem with zookeeper and forking isn't new for openstack:
http://
but the right solution wasn't found.
Changed in nova: | |
assignee: | nobody → Pawel Palucki (pawel-palucki-q) |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
tags: | added: servicegroups |
Changed in nova: | |
assignee: | Pawel Palucki (pawel-palucki) → Sean Dague (sdague) |
Changed in nova: | |
assignee: | Sean Dague (sdague) → Michal Dulko (michal-dulko-f) |
Changed in nova: | |
milestone: | none → kilo-3 |
status: | Fix Committed → Fix Released |
Changed in nova: | |
milestone: | kilo-3 → 2015.1.0 |
Fix proposed to branch: master /review. openstack. org/133479
Review: https:/