OpenStack Compute (nova)

[SRU] live migration break the anti-affinity policy of server group simultaneously

Bug #1821755 reported by Boxiang Zhu on 2019-03-26

This bug affects 10 people

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Boxiang Zhu
Train	Fix Committed	Undecided	Unassigned
Ussuri	Fix Released	Undecided	Unassigned
Victoria	Fix Released	Undecided	Unassigned
Wallaby	Fix Released	Undecided	Unassigned
Ubuntu Cloud Archive	Fix Released	Undecided	Unassigned
Stein	Fix Released	Undecided	Unassigned
Train	Fix Released	Undecided	Unassigned

Bug Description

--------------------------------
NOTE: SRU template at the bottom
--------------------------------

Description
===========
If we live migrate two instance simultaneously, the instances will break the instance group policy.

Steps to reproduce
==================
OpenStack env with three compute nodes(node1, node2 and node3). Then we create two VMs(vm1, vm2) with the anti-affinity policy.
At last, we live migrate two VMs simultaneously.

Before live-migration, the VMs are located as followed:
node1 -> vm1
node2 -> vm2
node3

* nova live-migration vm1
* nova live-migration vm2

Expected result
===============
Fail to live migrate vm1 and vm2.

Actual result
=============
node1
node2
node3 -> vm1,vm2

Environment
===========
master branch of openstack

As described above, the live migration could not check the in-progress live-migration and just select the host by scheduler filter. So that they are migrated to the same host.

----------------------------------------------------

===============
SRU Description
===============

[Impact]

When performing multiple live migration, cold migration or resize simultaneously, the affinity or anti-affinity policy is violated, allowing the migrated VM to land in a host that conflicts with the policy.

[Test case]

1. Setting up the env

1a. Deploy env with 5 compute nodes

1b. Confirm that all nodes have the same CPU architecture (so live-migration works between them) either by running lscpu or "openstack hypervisor show <node>" on each of the nodes

1c. Create anti-affinity policy

openstack server group create anti-aff --policy anti-affinity

1c. Create flavor

openstack flavor create --vcpu 1 --ram 1024 --disk 0 --id 100 test-flavor

1d. Create volumes

openstack volume create --image cirros --size 1 vol1
openstack volume create --source vol1 --size 1 vol2 && openstack volume create --source vol1 --size 1 vol3

2. Prepare to reproduce the bug

2a. Get group ID

GROUP_ID=$(openstack server group show anti-aff -c id -f value)

2b. Create VMs

openstack server create --network private --volume vol1 --flavor 100 --hint group=$GROUP_ID ins1 && openstack server create --network private --volume vol2 --flavor 100 --hint group=$GROUP_ID ins2 && openstack server create --network private --volume vol3 --flavor 100 --hint group=$GROUP_ID ins3

2c. Confirm each one is in a different host by running "openstack server list --long" and take note of the hosts

3. Reproducing the bug (Live migration)

3a. Perform set of steps (2) if hasn't.

3b. openstack server migrate ins1 --live-migration & openstack server migrate ins2 --live-migration & openstack server migrate ins3 --live-migration

3c. watch "openstack server list --long" until all migrations are finished

3d. Confirm that at least 1 host is in the same host as another host. Otherwise, repeat steps 3a - 3c.

4. Reproducing the bug (Cold Migration)

4a. Perform set os steps (2) if hasn't

4b. openstack server migrate ins1 & openstack server migrate ins2 & openstack server migrate ins3

4c. watch "openstack server list --long" until all statuses are "VERIFY_RESIZE"

4d. Confirm that at least 1 host is in the same host as another host. Otherwise, repeat steps 4a - 4c.

4e. Confirm all the resizes running "openstack server resize confirm <vm>"

5a. Install package that contains the fixed code on all compute nodes

5b. Cleanup all the VMs

6. Confirm fix (Live migration)

6a. Perform steps 3a - 3c

6b. Confirm there are no VMs in the same hosts nor VMs with ERROR status.

6c. Confirm there are VMs that have ACTIVE status and did not move hosts. Otherwise, repeat step 6a.

6d. Run "openstack server event list <vm-id>, then "openstack server event show <vm-id> <req-id>" for the live-migration event of the VMs assessed in step 6c. Confirm the "message" field is "error" and the traceback is part of the "compute_check_can_live_migrate_destination" or "compute_pre_live_migration" events with result=Error and the traceback ends in the _do_validation function. Repeat this step to capture both events.

6e. Check the logs for messages related to the VMs assessed in step (6c), where:
- For compute_check_can_live_migrate_destination: egrep -rnIi "MigrationPreCheckError: Migration pre-check error: Failed to validate instance group policy due to.*e9ec173a-4491-4541-9bd4-951692e48c8f.*Anti-affinity instance group policy was violated" /var/log/nova
- For compute_pre_live_migration: grep -rnIi "RescheduledException_Remote: Build of instance c55889d9-6cbe-409a-b118-7b4a8d808972 was re-scheduled: Anti-affinity instance group policy was violated." /var/log/nova

7. Confirm fix (Cold migration)

7a. Perform steps 4a - 4c, while taking note of the the timestamp (by running $(date)) before running the migration command

7b. Confirm there are no VMs in the same same hosts nor VMs with ERROR status. There should be VMs with "VERIFY_RESIZE" and "ACTIVE" statuses. If there are no ACTIVE instances, confirm the resizes and repeat step 7a.

7c. For the ones that are ACTIVE, check logs for error messages. There should be message with error about "anti-affinity":

egrep -rnIi "3e926491-d0dc-4611-8e87-75604c67f308.*Anti-affinity instance group policy was violated" /var/log/nova

/var/log/nova/nova-compute.log:40797:2021-07-22 19:19:54.075 1692 ERROR oslo_messaging.rpc.server nova.exception.RescheduledException: Build of instance 3e926491-d0dc-4611-8e87-75604c67f308 was re-scheduled: Anti-affinity instance group policy was violated.

7d. Confirm that the log timestamp matches a few seconds after the migration command was issued.

7e. Run "openstack server event list <vm-id>", then "openstack server event show <vm-id> <req-id>" for the migration event. Confirm the "message" field is "error" and the "events" field include a "No Valid Host" final message, with the "compute_prep_resize" event with result=Error and ending the traceback in the _do_validation function.

[Regression Potential]

Part of the new code path has been tested in upstream CI in happy migration paths. Concurrency has not been tested in the CI to trigger the error in a negative test. The exception handling code is executed only in case the exception is raised (in case of policy violation), so this code path is being tested manually as part of the upstream patch work and SRU.

[Other Info]

None

See original description

Tags:

Related branches

~rodrigo-barbieri2010/ubuntu/+source/nova:bug/1821755_train

Merged into ~ubuntu-openstack-dev/ubuntu/+source/nova:stable/train at revision 5452a13c92557e8618371d92318580326293246e

Chris MacNaughton: Pending requested 2021-09-27

~rodrigo-barbieri2010/ubuntu/+source/nova:bug/1821755_stein

Merged into ~ubuntu-openstack-dev/ubuntu/+source/nova:stable/stein at revision 3a93008b5ce6b1cb0e40f8ac9d32295596110f60

Chris MacNaughton: Pending requested 2021-09-22

Boxiang Zhu (bxzhu-5355) on 2019-03-26

description:

updated

Matt Riedemann (mriedem) on 2019-03-27

tags:

added: live-migration scheduler

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-03-27:

This is a long-standing known issue I believe, same for server build and evacuate (evacuate was fixed later in Rocky I think). There is a late affinity check in the compute service to check for the race in the scheduler and then reschedule for server create to another host, or fail in the case of evacuate. There is no such late affinity check for other move operations like live migration, cold migration (resize) or unshelve.

I believe StarlingX's nova fork has some server group checks in the live migration task though, so maybe those fixes could be 'upstreamed' to nova:

https://github.com/starlingx-staging/stx-nova/blob/3155137b8a0f00cfdc534e428037e1a06e98b871/nova/conductor/tasks/live_migrate.py#L88

Looking at that StarlingX code, they basically check to see if the server being live migrated is in an anti-affinity group and if so they restrict scheduling via external lock to one live migration at a time, which might be OK in a small edge node with 1-2 compute nodes but would be pretty severe in a large public cloud with lots of concurrent live migrations. Granted it's only the scheduling portion of the live migration task, not the actual live migration of the guest itself once a target host is selected. I'm also not sure if that external lock would be sufficient if you have multiple nova-conductors running on different hosts unless you were using a distributed lock manager like etcd, which nova upstream does not use (I'm not sure if oslo.concurrency can be configured for etcd under the covers or not).

Long-term this should all be resolved with placement when we can model affinity and anti-affinity in the placement service.

tags:	added: starlingx
Changed in nova:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Chris Friesen (cbf123) wrote on 2019-03-27:

Just to add a comment...I can confirm the external lock would not be sufficient if you have nova-conductor services running on multiple physical hosts.

Revision history for this message

Boxiang Zhu (bxzhu-5355) wrote on 2019-04-02:

Hi Matt

To summarize your comments, I think there are three ways to fix this issue:
1. The same way like build or evacuate(both of them were fixed before Stein) to make a late affinity check in compute service to check for the race in scheduler and the re-scheduler or fail at last. [1] we can also add the validate function for other move operations.
2. Like the starlingX codes, to add the external lock(maybe replace to use distributed lock like Tooz library in cinder which is not used in nova).
3. For long-term, model affinity and anti-affinity in placement service.

For short-term, I'd like to choose the first way to fix it. How about it?

[1] https://github.com/openstack/nova/blob/stable/stein/nova/compute/manager.py#L1358-L1411

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-08:

I agree that option #1 (late affinity check in compute - probably during ComputeManager.pre_live_migration) is the easiest way to go, but could still potentially be racy although it should (for the most part anyway) solve a scheduling race where concurrent live migration requests are made and there are multiple schedulers running which pick the same host for servers in an anti-affinity group.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651969

Changed in nova:
assignee:	nobody → Boxiang Zhu (bxzhu-5355)
status:	Triaged → In Progress

Revision history for this message

Tomi Juvonen (tomi-juvonen-q) wrote on 2019-05-10: Re: live migration break the anti-affinity policy of server group simultaneously

Well also if you do live migration to host where an instance of same anti-affinity group member was live_migrated away within ~70sec, it will fail as anti-affinity filter still thinks there is an instance of the same group present. I have had a manual fix in the code for a couple of years to have updated information in AntiAffinity filter code straight from DB. What seems to work also is SIGHUP to nova-scheduler. So some parallel migration checking might not be enough if the information used is not up-to-date.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02: Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/784166
Committed: https://opendev.org/openstack/nova/commit/33c8af1f8c46c9c37fcc28fb3409fbd3a78ae39f
Submitter: "Zuul (22348)"
Branch: master

commit 33c8af1f8c46c9c37fcc28fb3409fbd3a78ae39f
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations

    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.

    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

Closes-bug: #1821755
Change-Id: I32e6214568bb57f7613ddeba2c2c46da0320fabc

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02: Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/794328

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-09: Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/795542

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-11: Change abandoned on nova (master)

#10

Change abandoned by "Boxiang Zhu <zhu.boxiang@99cloud.net>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/651969

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-15: Fix merged to nova (stable/wallaby)

#11

Reviewed: https://review.opendev.org/c/openstack/nova/+/794328
Committed: https://opendev.org/openstack/nova/commit/8b62a4ec9bf617dfb2da046c25a9f76b33516508
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 8b62a4ec9bf617dfb2da046c25a9f76b33516508
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations

    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.

    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

    Closes-bug: #1821755
    Change-Id: I32e6214568bb57f7613ddeba2c2c46da0320fabc
    (cherry picked from commit 33c8af1f8c46c9c37fcc28fb3409fbd3a78ae39f)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-16: Fix merged to nova (stable/victoria)

#12

Reviewed: https://review.opendev.org/c/openstack/nova/+/795542
Committed: https://opendev.org/openstack/nova/commit/6ede6df7f41db809de19e124d3d4994180598f19
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 6ede6df7f41db809de19e124d3d4994180598f19
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations

    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.

    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-16: Fix proposed to nova (stable/ussuri)

#13

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/796719

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-29: Fix merged to nova (stable/ussuri)

#14

Reviewed: https://review.opendev.org/c/openstack/nova/+/796719
Committed: https://opendev.org/openstack/nova/commit/bf90a1e06181f6b328b967124e538c6e2579b2e5
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit bf90a1e06181f6b328b967124e538c6e2579b2e5
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations

    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.

    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

tags:

added: in-stable-ussuri

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-29: Fix proposed to nova (stable/train)

#15

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/798717

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-08: Fix merged to nova (stable/train)

#16

Reviewed: https://review.opendev.org/c/openstack/nova/+/798717
Committed: https://opendev.org/openstack/nova/commit/a22d1b04de9e6ebc33b5ab9871b86f8e4022e7a9
Submitter: "Zuul (22348)"
Branch: stable/train

commit a22d1b04de9e6ebc33b5ab9871b86f8e4022e7a9
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations

    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.

    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

    Closes-bug: #1821755
    Change-Id: I32e6214568bb57f7613ddeba2c2c46da0320fabc
    (cherry picked from commit 33c8af1f8c46c9c37fcc28fb3409fbd3a78ae39f)
    (cherry picked from commit 8b62a4ec9bf617dfb2da046c25a9f76b33516508)
    (cherry picked from commit 6ede6df7f41db809de19e124d3d4994180598f19)
    (cherry picked from commit bf90a1e06181f6b328b967124e538c6e2579b2e5)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-08: Fix proposed to nova (stable/stein)

#17

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/800114

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-15: Fix included in openstack/nova 23.0.2

#18

This issue was fixed in the openstack/nova 23.0.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-15: Fix included in openstack/nova 22.2.2

#19

This issue was fixed in the openstack/nova 22.2.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-15: Fix included in openstack/nova 21.2.2

#20

This issue was fixed in the openstack/nova 21.2.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-15: Fix merged to nova (stable/stein)

#21

Reviewed: https://review.opendev.org/c/openstack/nova/+/800114
Committed: https://opendev.org/openstack/nova/commit/5fa8718fe57e59b178d081e2068109151fdc3926
Submitter: "Zuul (22348)"
Branch: stable/stein

commit 5fa8718fe57e59b178d081e2068109151fdc3926
Author: Rodrigo Barbieri <email address hidden>
Date: Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations

    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.

    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration

The improved method of detection now locks based on group_id
and considers other migrations in-progress as well.

Conflicts:
nova/tests/unit/compute/test_compute_mgr.py

NOTE: Conflicts are because the following changes are not in Stein:

      * Ia00277ac8a68a635db85f9e0ce2c6d8df396e0d8 (Set migrate_data.vifs only when using multiple port bindings)
      * I3c917796cb30d11e7db1e235ac1625d2a743aaa2 (NUMA live migration support)
      * I2f3434f06489d8b6cb80933bcb1ea1e841049ba5 (Support migrating SRIOV port with bandwidth)
      * I292a0e2d840bbf657ba6d0932f9a3decbcb2778f ([FUP] Follow-up patch for SR-IOV live migration)
      * I734cc01dce13f9e75a16639faf890ddb1661b7eb (SR-IOV Live migration indirect port support)

Summary of conflicts:

    Cherry-picked pulled many non-related newer unit tests.
    Cleaned those up and adjusted:
    - Removed 2 extra params of check_can_live_migrate_destination
      invocations.
    - Adjusted request_spec variable of
      unit test test_prep_resize_errors_migration.
    - Removed extra tab spacing on a unit test.

Reviewed:  https://review.opendev.org/c/openstack/nova/+/800114
Committed: https://opendev.org/openstack/nova/commit/5fa8718fe57e59b178d081e2068109151fdc3926
Submitter: "Zuul (22348)"
Branch:    stable/stein

commit 5fa8718fe57e59b178d081e2068109151fdc3926
Author: Rodrigo Barbieri <rodrigo.barbieri2010@gmail.com>
Date:   Wed Mar 31 11:06:49 2021 -0300

Error anti-affinity violation on migrations
    
    Error-out the migrations (cold and live) whenever the
    anti-affinity policy is violated. This addresses
    violations when multiple concurrent migrations are
    requested.
    
    Added detection on:
    - prep_resize
    - check_can_live_migration_destination
    - pre_live_migration
    
    The improved method of detection now locks based on group_id
    and considers other migrations in-progress as well.
    
    Conflicts:
        nova/tests/unit/compute/test_compute_mgr.py
    
    NOTE: Conflicts are because the following changes are not in Stein:
    
      * Ia00277ac8a68a635db85f9e0ce2c6d8df396e0d8 (Set migrate_data.vifs only when using multiple port bindings)
      * I3c917796cb30d11e7db1e235ac1625d2a743aaa2 (NUMA live migration support)
      * I2f3434f06489d8b6cb80933bcb1ea1e841049ba5 (Support migrating SRIOV port with bandwidth)
      * I292a0e2d840bbf657ba6d0932f9a3decbcb2778f ([FUP] Follow-up patch for SR-IOV live migration)
      * I734cc01dce13f9e75a16639faf890ddb1661b7eb (SR-IOV Live migration indirect port support)
    
    Summary of conflicts:
    
    Cherry-picked pulled many non-related newer unit tests.
    Cleaned those up and adjusted:
    - Removed 2 extra params of check_can_live_migrate_destination
      invocations.
    - Adjusted request_spec variable of
      unit test test_prep_resize_errors_migration.
    - Removed extra tab spacing on a unit test.
    
    Closes-bug: #1821755
    Change-Id: I32e6214568bb57f7613ddeba2c2c46da0320fabc
    (cherry picked from commit 33c8af1f8c46c9c37fcc28fb3409fbd3a78ae39f)
    (cherry picked from commit 8b62a4ec9bf617dfb2da046c25a9f76b33516508)
    (cherry picked from commit 6ede6df7f41db809de19e124d3d4994180598f19)
    (cherry picked from commit bf90a1e06181f6b328b967124e538c6e2579b2e5)
    (cherry picked from commit a22d1b04de9e6ebc33b5ab9871b86f8e4022e7a9)

tags:

added: in-stable-stein

Rodrigo Barbieri (rodrigo-barbieri2010) on 2021-07-22

description:

updated

Rodrigo Barbieri (rodrigo-barbieri2010) on 2021-07-22

summary:

- live migration break the anti-affinity policy of server group
+ [SRU] live migration break the anti-affinity policy of server group
simultaneously

Rodrigo Barbieri (rodrigo-barbieri2010) on 2021-07-22

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-09-17: Fix included in openstack/nova 24.0.0.0rc1

#22

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Chris MacNaughton (chris.macnaughton) on 2021-09-23

Changed in cloud-archive:
status:	New → Fix Released

Rodrigo Barbieri (rodrigo-barbieri2010) on 2021-10-11

tags:

added: sts-sru-needed

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2021-11-08:

#23

Hello Boxiang, or anyone else affected,

Accepted nova into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:stein-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-stein-needed

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2021-11-08:

#24

Hello Boxiang, or anyone else affected,

Accepted nova into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:train-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-train-needed to verification-train-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-train-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-train-needed

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-17:

#25

I have been trying to perform validation on Train for the past 2 days but I have hit a few issues:

- On train there is a race condition that prevents the fix from working. The bug is still present
- Re-tested on wallaby, the race condition is not present and therefore the fix works

I am investigating the race condition further

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-17:

#26

I just re-tested in focal-ussuri, there is no race condition, the fix works. I'm going to narrow down further why it does not work in Train, but I believe it is very unlikely that we will move forward with the Train SRU.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-17:

#27

I've just hit the bug in focal-ussuri through a different race condition. I guess there are too many race conditions causing this bug. I'm going to do more testing.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-22:

#28

having done further testing I conclude that the effectiveness of the fix will vary by version of OpenStack and number of available threads/workload in the hypervisor.

The main purpose of the fix is to mitigate the issue by adding locks and additional checks. The code with the fix is still prone to race conditions.

The primary race condition is that 2 threads running prep_resize (dest_node) concurrently will not detect affinity violation because the locked code does not set any property after performing the validation. The property is set at a later part of the code running in the source node as part of another async RPC.

The secondary race condition is that depending on workload, if the threads are not competing at prep_resize, the validation of the 2nd thread may fail if the RPC on the source node has not been run yet.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-22:

#29

validation_1821755_train.txt Edit (223.7 KiB, text/plain)

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-22:

#30

validation_1821755_stein.txt Edit (293.2 KiB, text/plain)

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-22:

#31

Upon discussing internally with the OSE team, we decided to allow the SRU to move forward, as 2 out of the 3 code paths affected by the fix were successfully validated. Code paths are:

1) check_can_live_migrate_destination
2) pre_live_migration
3) prep_resize

I redeployed my env as my previous one was dirty due to the investigation. On this fresh new Train env, I was able to validate all 3 code paths successfully on the first try.

I then proceeded to deploy and validate on Stein, the prep_resize code path failed once, and I got it on the second try.

Attached evidence.

tags:

added: verification-stein-done verification-train-done
removed: verification-stein-needed verification-train-needed

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2021-11-23:

#32

Thanks for testing, Rodrigo. Can you clarify on a couple of questions?
* What releases do races still exist on?
* Is the net result that there are less anti-affinity violations with the patch?
* Were any regressions found in testing?

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2021-11-23:

#33

* What releases do races still exist on?
In theory, all of them, however I was not able to reproduce the issue in Wallaby, maybe due to other code moving around or just chance.
* Is the net result that there are less anti-affinity violations with the patch?
Yes, certainly, especially in Ussuri+
* Were any regressions found in testing?
No regressions, if the race condition happens, the code behaves as if the fix is not there.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2021-11-23:

#34

Thanks for the replies. While this doesn't completely fix the bug it does help reduce occurrences. With that in mind I think it's worth releasing this fix.

Rodrigo Barbieri (rodrigo-barbieri2010) on 2021-11-24

tags:

added: sts

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-11-11: Fix included in openstack/nova stein-eol

#35

This issue was fixed in the openstack/nova stein-eol release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-08-31: Fix included in openstack/nova train-eol

#36

This issue was fixed in the openstack/nova train-eol release.