cpu power management can fail with OSError: [Errno 16] Device or resource busy

Bug #2065927 reported by sean mooney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
sean mooney
2024.1
Fix Committed
Low
Unassigned
Antelope
Triaged
Low
Unassigned
Bobcat
Triaged
Low
Unassigned

Bug Description

as reported downstream in https://issues.redhat.com/browse/OSPRH-7103

if you create a vm, reboot the host, start the vm,
and finally delete it.

that may fail

May 16 15:54:26 edpm-compute-0 nova_compute[3396]: Traceback (most recent call last):
May 16 15:54:26 edpm-compute-0 nova_compute[3396]: File "/usr/lib/python3.9/site-packages/nova/filesystem.py", line 57, in write_sys
May 16 15:54:26 edpm-compute-0 nova_compute[3396]: fd.write(data)
May 16 15:54:26 edpm-compute-0 nova_compute[3396]: OSError: [Errno 16] Device or resource busy

this prevents the VM from being deleted on the inial request but it can then be deleted if you try again

this race condition with the kernel is unlikely to happen and appeared to be timing related.

i.e. there is a short period of time where onlineing or offlining of a CPU may not be possible

to mitigation this nova should retry the operation with a backoff and then eventually squash the error allowing the vm to delete without failing if we cant offline the core.

power management of the core should never block or cause the vm delete to fail.

Tags: libvirt
Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/920119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/920203

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/920119
Committed: https://opendev.org/openstack/nova/commit/ee581a5c9d1c0b7c0d8830a08f55fe8bc2fbcd0f
Submitter: "Zuul (22348)"
Branch: master

commit ee581a5c9d1c0b7c0d8830a08f55fe8bc2fbcd0f
Author: Sean Mooney <email address hidden>
Date: Tue May 21 17:53:07 2024 +0100

    add functional repoducer for bug 2065927

    Today if the write sys call to offline a cpu when
    deleting an instnace fails due to an OSERROR or ValueERROR
    the instance delete fails and the instance goes to error.

    as reported in bug: #2065927 this can happen as a result of
    OSError: [Errno 16] Device or resource busy if the vm is
    deleted shortly after its started.

    Related-Bug: #2065927
    Change-Id: I1352a3a1e28cfe14ec8f32042ed35cb25e70338e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2024.1)

Related fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/922950

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/920203
Committed: https://opendev.org/openstack/nova/commit/44c1b48b3121682cf959c90b3adaf2a3f92e318c
Submitter: "Zuul (22348)"
Branch: master

commit 44c1b48b3121682cf959c90b3adaf2a3f92e318c
Author: Sean Mooney <email address hidden>
Date: Wed May 22 18:59:02 2024 +0100

    retry write_sys call on device busy

    This change adds a retry_if_busy decorator
    to the read_sys and write_sys functions in the filesystem
    module that will retry reads and writes up to 5 times with
    an linear backoff.

    This allows nova to tolerate short periods of time where
    sysfs retruns device busy. If the reties are exausted
    and offlineing a core fails a warning is log and the failure is
    ignored. onling a core is always treated as a hard error if
    retries are exausted.

    Closes-Bug: #2065927
    Change-Id: I2a6a9f243cb403167620405e167a8dd2bbf3fa79

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2024.1)

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/922984

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2023.2)

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/922985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/nova/+/922986

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/922987

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/922988

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/922950
Committed: https://opendev.org/openstack/nova/commit/f1c46802b109db0b8e62f461ef0b432fa7c0984e
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit f1c46802b109db0b8e62f461ef0b432fa7c0984e
Author: Sean Mooney <email address hidden>
Date: Tue May 21 17:53:07 2024 +0100

    add functional repoducer for bug 2065927

    Today if the write sys call to offline a cpu when
    deleting an instnace fails due to an OSERROR or ValueERROR
    the instance delete fails and the instance goes to error.

    as reported in bug: #2065927 this can happen as a result of
    OSError: [Errno 16] Device or resource busy if the vm is
    deleted shortly after its started.

    Related-Bug: #2065927
    Change-Id: I1352a3a1e28cfe14ec8f32042ed35cb25e70338e
    (cherry picked from commit ee581a5c9d1c0b7c0d8830a08f55fe8bc2fbcd0f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/922984
Committed: https://opendev.org/openstack/nova/commit/1581f6695f00c6b4fb694b6e946a4ed204c1680f
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 1581f6695f00c6b4fb694b6e946a4ed204c1680f
Author: Sean Mooney <email address hidden>
Date: Wed May 22 18:59:02 2024 +0100

    retry write_sys call on device busy

    This change adds a retry_if_busy decorator
    to the read_sys and write_sys functions in the filesystem
    module that will retry reads and writes up to 5 times with
    an linear backoff.

    This allows nova to tolerate short periods of time where
    sysfs retruns device busy. If the reties are exausted
    and offlineing a core fails a warning is log and the failure is
    ignored. onling a core is always treated as a hard error if
    retries are exausted.

    Closes-Bug: #2065927
    Change-Id: I2a6a9f243cb403167620405e167a8dd2bbf3fa79
    (cherry picked from commit 44c1b48b3121682cf959c90b3adaf2a3f92e318c)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.