[Nova] Evacuation doesn't respect anti-affinity rules

Bug #1735407 reported by Alexander Rubtsov
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
High
Oleksiy Molchanov
9.x
High
Oleksiy Molchanov
OpenStack Compute (nova)
Medium
Balazs Gibizer

Bug Description

--- Environment ---
MOS: 9.2
Nova: 13.1.1-7~u14.04+mos20
3 compute nodes

--- Steps to reproduce ---
1. Create a new server group:
nova server-group-create anti anti-affinity

2. Launch 2 VMs in this server group:
nova boot --image TestVM --flavor m1.tiny --nic net-id=889e4e01-9b38-4007-829d-b69d53269874 --hint group=def58398-4a00-4066-a2aa-13f1b6e7e327 vm-1
nova boot --image TestVM --flavor m1.tiny --nic net-id=889e4e01-9b38-4007-829d-b69d53269874 --hint group=def58398-4a00-4066-a2aa-13f1b6e7e327 vm-2

3. Stop nova-compute on the nodes where these 2 VMs are running:
nova show vm-1 | grep "hypervisor"
OS-EXT-SRV-ATTR:hypervisor_hostname | node-12.domain.tld
nova show vm-2 | grep "hypervisor"
OS-EXT-SRV-ATTR:hypervisor_hostname | node-13.domain.tld
[root@node-12 ~]$ service nova-compute stop
nova-compute stop/waiting
[root@node-13 ~]$ service nova-compute stop
nova-compute stop/waiting

4. Evacuate both VMs almost at once:
nova evacuate vm-1
nova evacuate vm-2

5. Check where these 2 VMs are running:
nova show vm-1 | grep "hypervisor"
nova show vm-2 | grep "hypervisor"

--- Actual behavior ---
Both VMs have been evacuated on the same node:
[root@node-11 ~]$ virsh list
 Id Name State
----------------------------------------------------
 2 instance-00000001 running
 3 instance-00000002 running

--- Expected behavior ---
According to the anti-affinity rule, only 1 VM is evacuated.
Another one failed to evacuate with the appropriate message.

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 9.0-updates

Changed in mos:
importance: Undecided → High
tags: added: customer-found sla1
Changed in mos:
assignee: nobody → Oleksiy Molchanov (omolchanov)
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

The issue actually only reproduces on simultaneous evacuating of several instances. It looks like there is a race condition. If you evacuate instances one by one with a good time gap between them - everything works as expected. IMHO this is barely fixable without a big change in the threads logic.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is a valid bug, and the analysis in comment #2 is correct.

Since you can't evacuate multiple instances in the same request, the scheduler doesn't know about their relationship to each other and therefore might pick a host that ultimately is wrong.

There is a late affinity policy check in the compute for this latent scheduler behavior issue but it's only applied on the initial server create:

https://github.com/openstack/nova/blob/mitaka-eol/nova/compute/manager.py#L1447

The reason this "works" (at least a higher chance of working) when you space the evacuate calls out is that the computes have time to send information about their instances to the scheduler:

https://github.com/openstack/nova/blob/mitaka-eol/nova/compute/manager.py#L1777

https://github.com/openstack/nova/blob/mitaka-eol/nova/compute/manager.py#L1803

So when the subsequent evacuate requests are made, the scheduler has the information to pick a host properly based on the anti-affinity policy.

At this point, even in master (queens right now), we don't have a solution for this. Long-term we want to model affinity (distance between resource providers) in the Placement service and have the scheduler use that for these filtering decisions, but that's not something being worked on in Queens and I'm not even sure if we'll get the for the Rocky release.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Some potential options (even bad ones):

1. We add the late-affinity check on the compute during evacuate. This could make the evacuate fail, but at least we wouldn't be breaking the server group policy. Since we don't have a reschedule loop for evacuate, the evacuate would just end there and the user would have to retry, but maybe that's OK in this case (it's maybe better than the alternative). I don't want to add a reschedule loop for the evacuate operation since reschedule loops are basically technical debt at this point, papering over bad scheduling decisions that we're trying to fix with Placement.

2. We could modify the evacuate API to allow you to evacuate an entire server group, which would get you the multi-instance evacuate in the same request functionality, which should allow the ServerGroupAntiAffinityFilter to work properly, but I doubt people would like that idea.

3. We wait for Placement to model affinity and hope that cures all problems...

Changed in nova:
importance: Undecided → Medium
status: New → Triaged
tags: added: evacuate scheduler
Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I think we should go with option 1) now and try to prioritize the work for option 3) in the future.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/525242

Changed in nova:
status: Triaged → In Progress
Changed in mos:
status: New → Won't Fix
Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

Oleksiy,

Could you please clarify why is this bug marked as "Won't fix" for 9.x?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/526095

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

So, I proposed a fix that adds a late server group policy check to the evacuation code path. However that fix only decrease the window of the race condition between two parallel evacuations.

So the fix [1] eliminates the race between two evacuates if one of the rebuild request reaches the target compute significantly later than the other rebuild request. Therefore the other rebuild request can finish in the driver before the late check introduced in [1] is run for the first request.

In the other hand if the second rebuild request reaches the compute manager while the first rebuild request is handled by the virt driver then the race can still occur as the late check is added before the virt driver rebuild call.

[1] https://review.openstack.org/525242

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

... and the instance.host only set to the target host _after_ the virt driver rebuild returned.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Options to further limit the window of the race condition
1) Move setting the instance.host before the virt driver rebuild call. I'm afraid it will result is a lot of unwanted side effect.
2) Move the late check after the virt driver rebuild call. This might cause that for a short period of time the second instance will run on the target host violating the policy before it is destroyed by the late check. This sound bad for the user.
3) During the late check consider the target hosts of ongoing migrations as host used by the server group.

By eliminating the first two options I only left with the third to consider. But I'm open for suggestions.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think in general, adding the late affinity check during rebuild is a positive step forward in trying to handle this issue. As noted by gibi, we can still race and fail, so I'd leave option #3 from comment 11 in our pocket as a further improvement opportunity if it turns out that simply doing the late affinity check isn't good enough for the majority of cases.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

As Matt pointed out on IRC [1] the fix has a problem. It introduce a new up call from the cell to the api db to get the RequestSpec object. We want to get rid of all such upcalls so adding a new one is not acceptable. This means that we have to pass down the RequestSpec object via RPC instead. For that we need to bump the RPC version which also means that the fix will not be backportable to stable branches even if the problem is visible on MOS 9.x based on Mitaka.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2017-12-06.log.html#t2017-12-06T16:31:20

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Alexander, I can clarify. Unfortunately after a deep investigation we don't see how it can be fixed in MOS 9.x, since it's a stable branch and API changes are prohibited there. Also the workaround is to add more time between evacuation of VMs.

Revision history for this message
Chris Friesen (cbf123) wrote :

For what it's worth, we've been maintaining some local patches to robustify server group affinity checks. One of them basically does what you describe in item #3 in comment #11 (consider both src/dest hosts of migrating instances as hosts that are part of the group). This is actually needed for cold and live migration as well as evacuation.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Chris: Thanks for the info. I think item #3 is worth implementing as a follow up after the current bugfix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/526095
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=33d87707d8f36eea3dd18ecb7770cf306bfcfbbb
Submitter: Zuul
Branch: master

commit 33d87707d8f36eea3dd18ecb7770cf306bfcfbbb
Author: Balazs Gibizer <email address hidden>
Date: Wed Dec 6 16:05:21 2017 +0100

    Add regression test for bug 1735407

    Parallel evacuation can violate server group policy. This patch adds
    functional test to recreate the bug.

    The _wait_for_migration_status helper needed to be moved upper in the
    test calls hierarchy so it can be reused for this test too.

    Change-Id: I5bc50f9c0913a5b02c42f4ab201e9bf097395b54
    Related-Bug: #1735407

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/525242
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=edeeaf9102eccb78f1a2555c7e18c3d706f07639
Submitter: Zuul
Branch: master

commit edeeaf9102eccb78f1a2555c7e18c3d706f07639
Author: Balazs Gibizer <email address hidden>
Date: Mon Dec 4 16:18:30 2017 +0100

    Add late server group policy check to rebuild

    The affinity and anti-affinity server group policy is enforced by the
    scheduler but two parallel scheduling could cause that such policy is
    violated. During instance boot a late policy check was performed in
    the compute manager to prevent this. This check was missing in case
    of rebuild. Therefore two parallel evacuate command could cause that
    the server group policy is violated. This patch introduces the late
    policy check to rebuild to prevent such situation. When the violation
    is detected during boot a re-scheduling happens. However the rebuild
    action does not have the re-scheduling implementation so in this case
    the rebuild will fail and the evacuation needs to be retried by the
    user. Still this is better than allowing a parallel evacuation to
    break the server group affinity policy.

    To make the late policy check possible in the compute/manager the
    rebuild_instance compute RPC call was extended with a request_spec
    parameter.

    Co-Authored-By: Richard Zsarnoczai <email address hidden>

    Change-Id: I752617066bb2167b49239ab9d17b0c89754a3e12
    Closes-Bug: #1735407

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0rc1

This issue was fixed in the openstack/nova 17.0.0.0rc1 release candidate.

Revision history for this message
jichenjc (jichenjc) wrote :

looks like still some race situation per bug

https://bugs.launchpad.net/nova/+bug/1768473

found the code link to this bug after open 1768473, so feel free to take that bug if it does have some problem

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/649953

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers