Node check cannot properly handle resource deletion

Bug #1680758 reported by yangyide
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
senlin
Fix Released
High
yangyide

Bug Description

My senin version is master.

My cluster bind health_policy_poll policy, when I delete vm that belong this cluster, node cannot auto recover through policy.

Below is error log. I hope node can auto recover, because that is health policy meaning.

2017-04-07 15:04:32.331 21107 INFO senlin.engine.service [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] Checking cluster '25018444-961b-4301-a86a-52452ec6a718'.
2017-04-07 15:04:33.738 21107 INFO senlin.engine.service [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] Cluster check action queued: 21dd55d7-eb13-43be-a30d-14039f27fa15.
2017-04-07 15:04:35.103 21107 INFO senlin.engine.event [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] test_cluster [25018444] CLUSTER_CHECK - start: None
2017-04-07 15:04:36.205 21107 INFO senlin.engine.event [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] node-25018444-232 [81187992] NODE_CHECK - start: None
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] ResourceNotFound: No Server found for 42377008-aca4-4bb0-829e-5afcf9c93a61, Instance 42377008-aca4-4bb0-829e-5afcf9c93a61 could not be found.
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk Traceback (most recent call last):
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk File "/opt/senlin/senlin/drivers/openstack/sdk.py", line 96, in invoke_with_catch
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk return func(driver, *args, **kwargs)
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk File "/opt/senlin/senlin/drivers/openstack/nova_v2.py", line 50, in server_get
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk return self.conn.compute.servers(details=details, **query)
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk File "/usr/lib/python2.7/site-packages/openstack/compute/v2/_proxy.py", line 371, in get_server
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk return self._get(_server.Server, server)
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk File "/usr/lib/python2.7/site-packages/openstack/proxy2.py", line 37, in check
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk return method(self, expected, actual, *args, **kwargs)
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk File "/usr/lib/python2.7/site-packages/openstack/proxy2.py", line 225, in _get
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk http_status=e.http_status, cause=e.cause)
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk ResourceNotFound: ResourceNotFound: No Server found for 42377008-aca4-4bb0-829e-5afcf9c93a61, Instance 42377008-aca4-4bb0-829e-5afcf9c93a61 could not be found.
2017-04-07 15:04:36.578 21107 ERROR senlin.drivers.openstack.sdk
2017-04-07 15:04:36.705 21107 ERROR senlin.engine.event [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] node-25018444-232 [81187992] NODE_CHECK - error: Node check failed.
2017-04-07 15:04:37.082 21107 WARNING senlin.engine.health_manager [req-9801a74c-e95d-4ed3-9748-ff8feabbfdc5 - - - - -] Cluster check action failed

Revision history for this message
yangyide (yangyide01) wrote :
Changed in senlin:
assignee: nobody → yangyide (yangyide01)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to senlin (master)

Fix proposed to branch: master
Review: https://review.openstack.org/455503

Changed in senlin:
status: New → In Progress
Revision history for this message
Qiming Teng (tengqim) wrote : Re: health_policy_poll cannot recover node which vm was deleted

Please describe how you deleted the VM.

Revision history for this message
yangyide (yangyide01) wrote :

I directly use nova delete vm-uuid, that is to say ,if other people delete my servers, but if my cluster binding
health policy, should auto recover my nodes.

Revision history for this message
yangyide (yangyide01) wrote :

I think for senlin/profiles/os/nova/server.py do_check, if checked vm exist nova instances table, function should

return true that vm statue is active, else return false even if this vm was deleted. if not exist nova instances table

,function raise exception.

Revision history for this message
Qiming Teng (tengqim) wrote :

to clarify, senlin never peeks into nova (or any other services' db table). We don't care how data are organized and stored by the corresponding service. We do, however, trust a service's public APIs.

In this context, a GET to nova servers/<server_id> will return 404 if the server is deleted. The response will be captured by openstacksdk and the latter raises a NotFound exception. Senlin does an exception type translation to make code better managed.

There is no such thing as a nova server is gone but Senlin is misunderstanding it.

The problem you described could be a bug how senlin treats a ResourceNotFound exception when it does a node check. If the node is not found, we should still treat it as a node failure and continue to do a recover.

With that said, the patch proposed is an incorrect solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to senlin (master)

Fix proposed to branch: master
Review: https://review.openstack.org/456187

Revision history for this message
yangyide (yangyide01) wrote : Re: health_policy_poll cannot recover node which vm was deleted

I dont know why raise exception in senlin/profiles/os/nova/server.py do_check function, Can we return false when happen
exception?

Revision history for this message
Qiming Teng (tengqim) wrote :

The reason we raise exceptions instead of returning False is that:

1. An exception can carry more contextual information about the error that occurred.

2. The 'False' return value may and may not convey the complete message. Sometimes it causes confusion.

Qiming Teng (tengqim)
Changed in senlin:
importance: Undecided → High
milestone: none → pike-2
summary: - health_policy_poll cannot recover node which vm was deleted
+ Node check cannot properly handle resource deletion
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to senlin (master)

Reviewed: https://review.openstack.org/456187
Committed: https://git.openstack.org/cgit/openstack/senlin/commit/?id=616d4ed2778ac9cb9f171010e5355a0e4b599f34
Submitter: Jenkins
Branch: master

commit 616d4ed2778ac9cb9f171010e5355a0e4b599f34
Author: YiDe Yang <yangyide01@126.com>
Date: Wed Apr 12 19:41:10 2017 +0800

    Improve check_object for health_policy_poll recover

    check_object return false when happen exception,
    for recover nodes which cluster policy binding health_policy_poll.

    Closes-Bug: #1680758

    Change-Id: I34d189e0a8f29191deb3b1722de66bd51bd471fb

Changed in senlin:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on senlin (master)

Change abandoned by yangyide (yangyide01@126.com) on branch: master
Review: https://review.openstack.org/455503

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/senlin 4.0.0.0b2

This issue was fixed in the openstack/senlin 4.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.