network-vif-plugged event timeouts during resize-confirm can resutlt vms enterign error state with a mix of the old and new flavor

Bug #2003377 reported by sean mooney
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Medium
Unassigned

Bug Description

if a network vif plugged events times out in resize confirm the VM will enter an error state.
if the VM is not using numa then a hard reboot should be enough to fix that.
if it has a numa toplgoy the instnace_numa_topogy and flavor can disagree on the number of vcpu requested depending on when the failure happen.

in this case the VM can try to boot with the instance numa toplgoy for the new flavor on the dest host but the flavor.vcpus form the old flavor.

ideally if we have such a failure the vms should either revert to verify_resize or you should be able to do resize_confirm again to try and finish the resize.
alternately we could provide a nova-manage command to help fix the embedded flavor and or flavor in the request spec and reconsile those with the instance numa topology.

the intest would be to ensure its possible to recover the VM either with a second cofnrim or by using the nova manage command and then hard rebooting the isntnace.

Tags: compute resize
Revision history for this message
sean mooney (sean-k-mooney) wrote (last edit ):
Download full text (11.6 KiB)

if the instnace numa toplogy disagree on the CPU count the error presents in the log like this

2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/exception_wrapper.py", line 79, in wrapped
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server function_name, call_dict, binary, tb)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server raise value
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/exception_wrapper.py", line 69, in wrapped
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 191, in decorated_function
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server "Error: %s", e, instance=instance)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server raise value
2023-01-17 23:34:32.972 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-pac...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

the XML looks kind fo like this

 <domain type="kvm">
  <uuid>******</uuid>
  <name>****</name>
  <memory>8388608</memory>
  <memoryBacking>
    <hugepages>
      <page size="1048576" nodeset="0" unit="KiB"/>
    </hugepages>
  </memoryBacking>
  <numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="0"/>
  </numatune>
  <vcpu>2</vcpu>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      ....
      <nova:flavor name="******">
        <nova:memory>8192</nova:memory>
        <nova:disk>80</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>2</nova:vcpus>
      </nova:flavor>
    ....
    </nova:instance>
  </metadata>
  ...
  <cputune>
    <shares>2048</shares>
    <emulatorpin cpuset="28,32,84,88"/>
    <vcpupin vcpu="0" cpuset="32"/>
    <vcpupin vcpu="1" cpuset="88"/>
    <vcpupin vcpu="2" cpuset="28"/>
    <vcpupin vcpu="3" cpuset="84"/>
  </cputune>
  ...
  <cpu mode="host-model" match="exact">
    <topology sockets="1" cores="1" threads="2"/>
    <numa>
      <cell id="0" cpus="0-3" memory="16777216" memAccess="shared"/>
    </numa>
  </cpu>
  <devices>
  ...
  </devices>
</domain>
: libvirt.libvirtError: internal error: Number of CPUs in <numa> exceeds the <vcpu> count

note the flavor metadata and topology reference 2 vcpus but the CPU pinning and numa affinity has 4

the non numa memory assignment is 8GB and the numa memory assignment is 16

so the new element used the info form the new flavor and the non numa ones used the old flavor.

this was form a resize with 2cpu and 8GB ram to 4 CPU and 16GB of ram

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I agree that not able to retry the failed (resize confirm) operation is not optimal. But I believe hard reboot is never officially advertised as capable of fixing every VM in an ERROR state. Anyhow I agree to improve on the state handling at a failed resize confirm if the vif plug times out to put the VM back to verify_resize and allow retrying the confirm.

tags: added: compute resize
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
status: Confirmed → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.