libvirt CPU power management does not support live migration

Bug #2056613 reported by Artom Lifshitz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned

Bug Description

Description
===========
libvirt CPU power management does not support live migration

Steps to reproduce
==================
1. Turn on libvirt CPU power management
2. Boot an instance with hw:cpu_policy=dedicated
3. Live migrate the instance

Expected result
===============
Live migration succeeds.

Actual result
=============
Live migration fails with the following libvirt error in the source nova-compute logs:

[instance: afdd5e62-2a97-4b58-a7e7-bb92152f4165] Migration operation thread notification {{(pid=103809) thread_finished /opt/stack/nova/nova/virt/libvirt/driver.py:10668}}
Feb 21 19:21:15.045216 np0036828692 nova-compute[103809]: Traceback (most recent call last):
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: timer()
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: cb(*args, **kw)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/event.py", line 173, in _do_send
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: waiter.switch(result)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/greenthread.py", line 264, in main
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: result = function(*args, **kwargs)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/nova/nova/utils.py", line 664, in context_wrapper
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: return func(*args, **kwargs)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 10322, in _live_migration_operation
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: with excutils.save_and_reraise_exception():
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: self.force_reraise()
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: raise self.value
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 10311, in _live_migration_operation
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: guest.migrate(self._live_migration_uri(dest),
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 648, in migrate
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: self._domain.migrateToURI3(
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/tpool.py", line 186, in doit
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: result = proxy_call(self._autowrap, f, *args, **kwargs)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/tpool.py", line 144, in proxy_call
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: rv = execute(f, *args, **kwargs)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/tpool.py", line 125, in execute
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: raise e.with_traceback(tb)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/tpool.py", line 82, in tworker
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: rv = meth(*args, **kwargs)
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: File "/usr/lib/python3/dist-packages/libvirt.py", line 2126, in migrateToURI3
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: raise libvirtError('virDomainMigrateToURI3() failed')
Feb 21 19:21:15.045387 np0036828692 nova-compute[103809]: libvirt.libvirtError: cannot set CPU affinity on process 48279: Invalid argument

Environment
===========
This was originally noticed in a whitebox CI job [1] on devstack master.

Additional info
===============
Regardless of whether NUMA live migration has changed the underlying CPU pinnings, it's necessary to make sure the cores are powered up on the destination, otherwise libvirt attempts to pin the instance to an offline core. Nova doesn't handle that. With some refactoring to the code itself, it's possible to observe the cores not being powered on in functional tests.

[1] https://zuul.opendev.org/t/openstack/build/532b30767df54147a01508e7616930f5/logs

Changed in nova:
importance: Undecided → Critical
Changed in nova:
status: New → In Progress
Revision history for this message
Artom Lifshitz (notartom) wrote :

Seems to me that without live migration, libvirt CPU power management if of very limited use, so setting this to critical.

Changed in nova:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/912320
Committed: https://opendev.org/openstack/nova/commit/29dc044a7aa1b1dd35ea4695c055feb5136ba1e5
Submitter: "Zuul (22348)"
Branch: master

commit 29dc044a7aa1b1dd35ea4695c055feb5136ba1e5
Author: Artom Lifshitz <email address hidden>
Date: Fri Mar 8 13:00:30 2024 -0500

    pwr mgmt: make API into a per-driver object

    We want to test power management in our functional tests in multinode
    scenarios (ex: live migration).

    This was previously impossible because all the methods in
    nova.virt.libvirt.cpu.api and were at the module level, meaning both
    source and destination libvirt drivers would call the same method to
    online and offline cores. This made it impossible to maintain distinct
    core power state between source and destination.

    This patch inserts a nova.virt.libvirt.cpu.api.API class, and gives
    the libvirt driver a cpu_api attribute with an instance of that
    class. Along with the tiny API.core() helper, this allows new
    functional tests in the subsequent patches to stub out the core
    "model" code with distinct objects on the source and destination
    libvirt drivers, and enables a whole bunch of testing (and fixes!)
    around live migration.

    Related-bug: 2056613
    Change-Id: I052535249b9a3e144bb68b8c588b5995eb345b97

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/910022
Committed: https://opendev.org/openstack/nova/commit/1f5e3421ec1e15a58b9d9bdb9fc4312373ec4408
Submitter: "Zuul (22348)"
Branch: master

commit 1f5e3421ec1e15a58b9d9bdb9fc4312373ec4408
Author: Artom Lifshitz <email address hidden>
Date: Fri Feb 23 11:03:22 2024 -0500

    Reproducer test for live migration with power management

    Building on the previous patch's refactor, we can now do functional
    testing of live migration with CPU power management. We quickly notice
    that it's mostly broken, leaving the CPUs powered up on the source,
    and not powering them up on the dest.

    Related-bug: 2056613
    Change-Id: Ib4de77d68ceeffbc751bca3567ada72228b750af

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/909806
Committed: https://opendev.org/openstack/nova/commit/c1ccc1a3165ec1556c605b3b036274e992b0a09d
Submitter: "Zuul (22348)"
Branch: master

commit c1ccc1a3165ec1556c605b3b036274e992b0a09d
Author: Artom Lifshitz <email address hidden>
Date: Wed Feb 21 19:58:32 2024 -0500

    pwr mgmt: handle live migrations correctly

    Previously, live migrations completely ignored CPU power management.
    This patch makes sure that we correctly:

    * Power up the cores on the destination during pre_live_migration, as
      we need them powered up before the instance starts on the
      destination.
    * If the live migration is successful, power down the vacated cores on
      the source.
    * In case of a rollback, power down the cores previously powered up on
      pre_live_migration.

    Closes-bug: 2056613
    Change-Id: I787bd7807950370cd865f29b95989d489d4826d0

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2023.2)

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/913196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/913197

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/913198

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/913225

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/913226

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/913227

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 29.0.0.0rc1

This issue was fixed in the openstack/nova 29.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/913196
Committed: https://opendev.org/openstack/nova/commit/2a0e63828b6fefc1e83349f0bdac7f9eb8ea6678
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 2a0e63828b6fefc1e83349f0bdac7f9eb8ea6678
Author: Artom Lifshitz <email address hidden>
Date: Fri Mar 8 13:00:30 2024 -0500

    pwr mgmt: make API into a per-driver object

    We want to test power management in our functional tests in multinode
    scenarios (ex: live migration).

    This was previously impossible because all the methods in
    nova.virt.libvirt.cpu.api and were at the module level, meaning both
    source and destination libvirt drivers would call the same method to
    online and offline cores. This made it impossible to maintain distinct
    core power state between source and destination.

    This patch inserts a nova.virt.libvirt.cpu.api.API class, and gives
    the libvirt driver a cpu_api attribute with an instance of that
    class. Along with the tiny API.core() helper, this allows new
    functional tests in the subsequent patches to stub out the core
    "model" code with distinct objects on the source and destination
    libvirt drivers, and enables a whole bunch of testing (and fixes!)
    around live migration.

    Related-bug: 2056613
    Change-Id: I052535249b9a3e144bb68b8c588b5995eb345b97
    (cherry picked from commit 29dc044a7aa1b1dd35ea4695c055feb5136ba1e5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/913197
Committed: https://opendev.org/openstack/nova/commit/95bbb0432ae6507b417aa3a16d8df56aa71f004e
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 95bbb0432ae6507b417aa3a16d8df56aa71f004e
Author: Artom Lifshitz <email address hidden>
Date: Fri Feb 23 11:03:22 2024 -0500

    Reproducer test for live migration with power management

    Building on the previous patch's refactor, we can now do functional
    testing of live migration with CPU power management. We quickly notice
    that it's mostly broken, leaving the CPUs powered up on the source,
    and not powering them up on the dest.

    Related-bug: 2056613
    Change-Id: Ib4de77d68ceeffbc751bca3567ada72228b750af
    (cherry picked from commit 1f5e3421ec1e15a58b9d9bdb9fc4312373ec4408)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/913198
Committed: https://opendev.org/openstack/nova/commit/c5a73e6c7227f199bfa0b66c6ef0c2730d70c3b2
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit c5a73e6c7227f199bfa0b66c6ef0c2730d70c3b2
Author: Artom Lifshitz <email address hidden>
Date: Wed Feb 21 19:58:32 2024 -0500

    pwr mgmt: handle live migrations correctly

    Previously, live migrations completely ignored CPU power management.
    This patch makes sure that we correctly:

    * Power up the cores on the destination during pre_live_migration, as
      we need them powered up before the instance starts on the
      destination.
    * If the live migration is successful, power down the vacated cores on
      the source.
    * In case of a rollback, power down the cores previously powered up on
      pre_live_migration.

    NOTE(artom) Conflicts in nova/compute/manager.py around the do_cleanup
    determination because mdev live migration is not in Bobcat.

    Closes-bug: 2056613
    Change-Id: I787bd7807950370cd865f29b95989d489d4826d0
    (cherry picked from commit c1ccc1a3165ec1556c605b3b036274e992b0a09d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/913225
Committed: https://opendev.org/openstack/nova/commit/874acc1ed16d3125db16add1e03dea65d8c542ca
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 874acc1ed16d3125db16add1e03dea65d8c542ca
Author: Artom Lifshitz <email address hidden>
Date: Fri Mar 8 13:00:30 2024 -0500

    pwr mgmt: make API into a per-driver object

    We want to test power management in our functional tests in multinode
    scenarios (ex: live migration).

    This was previously impossible because all the methods in
    nova.virt.libvirt.cpu.api and were at the module level, meaning both
    source and destination libvirt drivers would call the same method to
    online and offline cores. This made it impossible to maintain distinct
    core power state between source and destination.

    This patch inserts a nova.virt.libvirt.cpu.api.API class, and gives
    the libvirt driver a cpu_api attribute with an instance of that
    class. Along with the tiny API.core() helper, this allows new
    functional tests in the subsequent patches to stub out the core
    "model" code with distinct objects on the source and destination
    libvirt drivers, and enables a whole bunch of testing (and fixes!)
    around live migration.

    Related-bug: 2056613
    Change-Id: I052535249b9a3e144bb68b8c588b5995eb345b97
    (cherry picked from commit 29dc044a7aa1b1dd35ea4695c055feb5136ba1e5)
    (cherry picked from commit 2a0e63828b6fefc1e83349f0bdac7f9eb8ea6678)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/nova/+/913226
Committed: https://opendev.org/openstack/nova/commit/6d48c129ca06f269620e64d7950662b396b740b1
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 6d48c129ca06f269620e64d7950662b396b740b1
Author: Artom Lifshitz <email address hidden>
Date: Fri Feb 23 11:03:22 2024 -0500

    Reproducer test for live migration with power management

    Building on the previous patch's refactor, we can now do functional
    testing of live migration with CPU power management. We quickly notice
    that it's mostly broken, leaving the CPUs powered up on the source,
    and not powering them up on the dest.

    Related-bug: 2056613
    Change-Id: Ib4de77d68ceeffbc751bca3567ada72228b750af
    (cherry picked from commit 1f5e3421ec1e15a58b9d9bdb9fc4312373ec4408)
    (cherry picked from commit 95bbb0432ae6507b417aa3a16d8df56aa71f004e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/913227
Committed: https://opendev.org/openstack/nova/commit/c6e1b44134d4c82c81b34e90cdb68644c7485f08
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit c6e1b44134d4c82c81b34e90cdb68644c7485f08
Author: Artom Lifshitz <email address hidden>
Date: Wed Feb 21 19:58:32 2024 -0500

    pwr mgmt: handle live migrations correctly

    Previously, live migrations completely ignored CPU power management.
    This patch makes sure that we correctly:

    * Power up the cores on the destination during pre_live_migration, as
      we need them powered up before the instance starts on the
      destination.
    * If the live migration is successful, power down the vacated cores on
      the source.
    * In case of a rollback, power down the cores previously powered up on
      pre_live_migration.

    Closes-bug: 2056613
    Change-Id: I787bd7807950370cd865f29b95989d489d4826d0
    (cherry picked from commit c1ccc1a3165ec1556c605b3b036274e992b0a09d)
    (cherry picked from commit c5a73e6c7227f199bfa0b66c6ef0c2730d70c3b2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 27.3.0

This issue was fixed in the openstack/nova 27.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 28.1.0

This issue was fixed in the openstack/nova 28.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.