masakari

If failure host is the reserved host, the compute service of the failure host finally beccomes the enable status

Bug #1670940 reported by takahara.kengo on 2017-03-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	masakari	Fix Released	Undecided	takahara.kengo

Bug Description

* Actual behavior:
If failure host is the reserved host, the compute service of the failure host finally becomes the enable status.

* Expected behavior:
If failure host is the reserved host, I think that the compute service of the failure host should finally become the disable status.
Furthermore, in this case, I think that evacuate API should not be executed.

* Environmental situation
** compute-node1, compute-node2, compute-node3 are in a cluster.
** The 'recovery_method' of the segment to which the failure host belongs is 'reserved_host'.
** The host that 'reserved' is True is compute-node3 only.
** masakari version is v3.0.0.

* Reproduction procedure
1. Create an instance on compute-node3.
2. Shutdown the compute-node3.

* Analysis
I checked the logs and codes of masakari-engine, the processing flows as follows.
1. Disable the compute service of compute-node3(=failure host).
2, Get list of instances on compute-node3(=failure host).
3. Enable the compute service of compute-node3(=reserved host).
4. Execute the evacuate API to compute-node3(=reserved host) and fail.
5. Status of notification ends with 'error'.
6. Periodic task reprocesses the notification of 'error' status.
7. Status of notification ends with 'failed' because 'reserved' of compute-node3 is already changed with False, and no reserved hosts available for evacuation.

As a result, the compute service of compute-node3 is finally enable.

takahara.kengo (takahara.kengo) on 2017-03-13

Changed in masakari:
assignee:	nobody → takahara.kengo (takahara.kengo)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-13: Fix proposed to masakari (master)

Fix proposed to branch: master
Review: https://review.openstack.org/444801

Changed in masakari:
status:	New → In Progress

Revision history for this message

Abhishek Kekane (abhishek-kekane) wrote on 2017-03-22:

Hi All,

As mentioned in bug description, compute-node3 is reserved-host which means compute service should be disabled on that host (this is because we need to ensure that reserved-host should not be selected by nova-scheduler to launch the instance). In next step it is mentioned that spawn instance on compute-node3. If compute service is disabled on that host then instance will never spawn on that host.

This issue can be reproducible if reserved_host goes down then using pacemaker host-monitor will notifiy this to masakari and masakari will start executing the workflow. In task 'PrepareHAEnabledInstancesTask' it will try create the list of instances on that host which will be empty list and taskflow will be marked as finished as there will be no instances on that host for recovery.

IMO for this scenario, as commented by Tushar San we should remove check if failed host is present in the list of available reserved_host and if yes then remove that host from the reserved_host list.

Revision history for this message

takahara.kengo (takahara.kengo) wrote on 2017-03-22:

Hi Abhishek San,

In my understanding, I think that the results of the following two ideas are the same.
- Your idea: Get all reserved hosts, and then remove the failure host from the list before passing it to the execute_host_failure method.
- My idea: Remove the failure host by updating 'reserved' before getting all available reserved hosts.

I think that creating an empty list at 'PrepareHAEnabledInstancesTask' and deleting failure_host from reserved_host_list are irrelevant, is it correct?

And I think my idea is simpler to fix.
Could you tell me the reason why you think your idea is better?
Does the problem occur in the subseqent processing?
If there is a reason, I would like to fix according to your idea.

OpenStack Infra (hudson-openstack) on 2017-03-30

Changed in masakari:
assignee:	takahara.kengo (takahara.kengo) → Rikimaru Honjo (honjo-rikimaru-c6)

Rikimaru Honjo (honjo-rikimaru-c6) on 2017-03-30

Changed in masakari:
assignee:	Rikimaru Honjo (honjo-rikimaru-c6) → takahara.kengo (takahara.kengo)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-03: Fix merged to masakari (master)

Reviewed: https://review.openstack.org/444801
Committed: https://git.openstack.org/cgit/openstack/masakari/commit/?id=5cc52525dbfad44b9579f85001dc59cf80317c3d
Submitter: Jenkins
Branch: master

commit 5cc52525dbfad44b9579f85001dc59cf80317c3d
Author: Kengo Takahara <email address hidden>
Date: Mon Mar 13 19:19:49 2017 +0900

Delete the failure host from reserved_host

    If the failure host is selected as reserved_host,
    evacuation to the failure host will be executed.
    This patch added processing to change the 'reserved' column
    of hosts to False so that the failure host will not be selected
    as reserved_host.

Change-Id: I6e5f1087baf64787d759ed7c5b318d4a14aecb4c
Closes-Bug: #1670940

Changed in masakari:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.