Detaching health policy fails to remove health check

Bug #1811161 reported by Duc Truong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
senlin
Fix Released
Undecided
Duc Truong

Bug Description

Steps to reproduce:
1. Create cluster with min size 1 and desired capacity 1
2. Create health policy and attach to cluster
3. Scale-in cluster
4. Detach health policy. This will generate a traceback in logs:

2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base Traceback (most recent call last):
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/base.py", line 646, in ActionProc
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base result, reason = action.execute()
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/cluster_action.py", line 1185, in execute
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base res, reason = self._execute(**kwargs)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/cluster_action.py", line 1152, in _execute
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base result, reason = method()
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 159, in wrapper
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base result = f(*args, **kwargs)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/cluster_action.py", line 1063, in do_detach_policy
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base res, reason = self.entity.detach_policy(self.context, policy_id)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/cluster.py", line 411, in detach_policy
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base res, reason = policy.detach(self)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/policies/health_policy.py", line 404, in detach
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base ret = health_manager.unregister(cluster.id)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/health_manager.py", line 828, in unregister
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base return notify(engine_id, 'unregister_cluster', cluster_id=cluster_id)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/health_manager.py", line 806, in notify
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base call_context.call(ctx, method, **kwargs)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base retry=self.retry)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 128, in _send
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base retry=retry)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base call_monitor_timeout, retry=retry)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 636, in _send
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base raise result
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base ValueError: list.remove(x): x not in list

The cluster scale-in operation will fail because it would take the cluster size below the min size. The scale-in operation disables the health checks and tries to execute the scale-in. However, after the scale-in fails, it does not enable the health checks again. Then when we try to detach the health policy, it fails with a traceback because the health check has been previously disabled.

Duc Truong (dtruong)
Changed in senlin:
assignee: nobody → Duc Truong (dtruong)
status: New → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to senlin (master)

Fix proposed to branch: master
Review: https://review.openstack.org/629689

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to senlin (master)

Reviewed: https://review.openstack.org/629689
Committed: https://git.openstack.org/cgit/openstack/senlin/commit/?id=f2fc46ddc4292c03e80182c8037542aeb868b7ea
Submitter: Zuul
Branch: master

commit f2fc46ddc4292c03e80182c8037542aeb868b7ea
Author: Duc Truong <email address hidden>
Date: Thu Jan 10 00:03:13 2019 +0000

    Enable health checks after failed operation

    - Always call policy post_op and set 'action_result' before post_op
      call for both cluster actions and node actions.
    - Each policy needs to decide inside post_op if it needs to perform its
      operation depending on action_result
    - Ignore ValueError exception when removing timer from threadgroup

    Change-Id: I9d5880f8e5aa12792eabe7509b2bb5626e27179c
    Closes-Bug: #1811161

Changed in senlin:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/senlin 7.0.0.0b1

This issue was fixed in the openstack/senlin 7.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.