senlin

Detaching health policy fails to remove health check

Bug #1811161 reported by Duc Truong on 2019-01-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	senlin	Fix Released	Undecided	Duc Truong

Bug Description

Steps to reproduce:
1. Create cluster with min size 1 and desired capacity 1
2. Create health policy and attach to cluster
3. Scale-in cluster
4. Detach health policy. This will generate a traceback in logs:

2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base Traceback (most recent call last):
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/base.py", line 646, in ActionProc
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base result, reason = action.execute()
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/cluster_action.py", line 1185, in execute
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base res, reason = self._execute(**kwargs)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/cluster_action.py", line 1152, in _execute
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base result, reason = method()
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 159, in wrapper
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base result = f(*args, **kwargs)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/actions/cluster_action.py", line 1063, in do_detach_policy
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base res, reason = self.entity.detach_policy(self.context, policy_id)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/cluster.py", line 411, in detach_policy
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base res, reason = policy.detach(self)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/policies/health_policy.py", line 404, in detach
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base ret = health_manager.unregister(cluster.id)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/health_manager.py", line 828, in unregister
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base return notify(engine_id, 'unregister_cluster', cluster_id=cluster_id)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/opt/stack/senlin/senlin/engine/health_manager.py", line 806, in notify
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base call_context.call(ctx, method, **kwargs)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base retry=self.retry)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 128, in _send
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base retry=retry)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base call_monitor_timeout, retry=retry)
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 636, in _send
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base raise result
2019-01-09 23:09:13.645 TRACE senlin.engine.actions.base ValueError: list.remove(x): x not in list

The cluster scale-in operation will fail because it would take the cluster size below the min size. The scale-in operation disables the health checks and tries to execute the scale-in. However, after the scale-in fails, it does not enable the health checks again. Then when we try to detach the health policy, it fails with a traceback because the health check has been previously disabled.

See original description

Duc Truong (dtruong) on 2019-01-09

Changed in senlin:
assignee:	nobody → Duc Truong (dtruong)
status:	New → In Progress
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-10: Fix proposed to senlin (master)

Fix proposed to branch: master
Review: https://review.openstack.org/629689

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-29: Fix merged to senlin (master)

Reviewed: https://review.openstack.org/629689
Committed: https://git.openstack.org/cgit/openstack/senlin/commit/?id=f2fc46ddc4292c03e80182c8037542aeb868b7ea
Submitter: Zuul
Branch: master

commit f2fc46ddc4292c03e80182c8037542aeb868b7ea
Author: Duc Truong <email address hidden>
Date: Thu Jan 10 00:03:13 2019 +0000

Enable health checks after failed operation

    - Always call policy post_op and set 'action_result' before post_op
      call for both cluster actions and node actions.
    - Each policy needs to decide inside post_op if it needs to perform its
      operation depending on action_result
    - Ignore ValueError exception when removing timer from threadgroup

Change-Id: I9d5880f8e5aa12792eabe7509b2bb5626e27179c
Closes-Bug: #1811161

Changed in senlin:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-28: Fix included in openstack/senlin 7.0.0.0b1

This issue was fixed in the openstack/senlin 7.0.0.0b1 development milestone.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.