Bug #1662867 “update_available_resource_for_node racing instance...” : Series queens : Bugs : OpenStack Compute (nova)

Maciej Szankin (mszankin) on 2017-02-09

tags:

added: placement resource-tracker

Sylvain Bauza (sylvain-bauza) on 2017-02-23

Changed in nova:
status:	New → Confirmed

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2017-02-24:

#1

Another example on stable/ocata :

http://logs.openstack.org/30/431530/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b5e565e/logs/screen-n-cpu.txt.gz?level=ERROR#_2017-02-09_14_14_09_377

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-08-28:

#2

The logs are gone. What did this actually look like? Can you find a version of it on stable/ocata and paste the actual error in here?

Changed in nova:
status:	Confirmed → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-10-28:

#3

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-03-14:

#4

Download full text (7.0 KiB)

We still see this in CI runs periodically:

http://logs.openstack.org/70/552170/2/check/legacy-tempest-dsvm-multinode-live-migration/474e6b5/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_13_22_24_03_425580

We can see that in this case, the instance in question has it's files deleted right before the libvirt driver, via the update_available_resource periodic task, gets to processing that instance:

Mar 13 22:24:03.367217 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deleting instance files /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del
Mar 13 22:24:03.367725 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deletion of /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del complete
Mar 13 22:24:03.425580 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] Error updating resources for node ubuntu-xenial-rax-iad-0002937850.: InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk : Unexpected error while running command.
Mar 13 22:24:03.425969 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk --force-share
Mar 13 22:24:03.426340 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Exit code: 1
Mar 13 22:24:03.426705 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stdout: u''
Mar 13 22:24:03.427044 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stderr: u"qemu-img: Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': No such file or directory\n"
Mar 13 22:24:03.427380 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Traceback (most recent call last):
Mar 13 22:24:03.427724 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 7320, in update_available_resource_for_node
Mar 13 22:24:03.428084 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager rt.update_available_resource(context, nodename)
Mar 13 22:24:03.428410 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 664, in update_available_resour...

We still see this in CI runs periodically:

http://logs.openstack.org/70/552170/2/check/legacy-tempest-dsvm-multinode-live-migration/474e6b5/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_13_22_24_03_425580

We can see that in this case, the instance in question has it's files deleted right before the libvirt driver, via the update_available_resource periodic task, gets to processing that instance:

Mar 13 22:24:03.367217 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deleting instance files /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del
Mar 13 22:24:03.367725 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deletion of /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del complete
Mar 13 22:24:03.425580 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] Error updating resources for node ubuntu-xenial-rax-iad-0002937850.: InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk : Unexpected error while running command.
Mar 13 22:24:03.425969 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk --force-share
Mar 13 22:24:03.426340 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Exit code: 1
Mar 13 22:24:03.426705 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stdout: u''
Mar 13 22:24:03.427044 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stderr: u"qemu-img: Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': No such file or directory\n"
Mar 13 22:24:03.427380 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Traceback (most recent call last):
Mar 13 22:24:03.427724 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/compute/manager.py", line 7320, in update_available_resource_for_node
Mar 13 22:24:03.428084 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
Mar 13 22:24:03.428410 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 664, in update_available_resource
Mar 13 22:24:03.428653 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     resources = self.driver.get_available_resource(nodename)
Mar 13 22:24:03.428898 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6432, in get_available_resource
Mar 13 22:24:03.429219 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     disk_over_committed = self._get_disk_over_committed_size_total()
Mar 13 22:24:03.429576 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7977, in _get_disk_over_committed_size_total
Mar 13 22:24:03.429850 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     config, block_device_info)
Mar 13 22:24:03.430176 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7896, in _get_instance_disk_info_from_config
Mar 13 22:24:03.430498 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     virt_size = disk_api.get_disk_size(path)
Mar 13 22:24:03.430768 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/disk/api.py", line 140, in get_disk_size
Mar 13 22:24:03.431081 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     return images.qemu_img_info(path).virtual_size
Mar 13 22:24:03.431316 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager   File "/opt/stack/new/nova/nova/virt/images.py", line 82, in qemu_img_info
Mar 13 22:24:03.431546 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager     raise exception.InvalidDiskInfo(reason=msg)
Mar 13 22:24:03.431785 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk : Unexpected error while running command.
Mar 13 22:24:03.432017 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk --force-share
Mar 13 22:24:03.432266 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Exit code: 1
Mar 13 22:24:03.432498 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Stdout: u''
Mar 13 22:24:03.432742 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Stderr: u"qemu-img: Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': No such file or directory\n"
Mar 13 22:24:03.432981 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager

We should likely just catch the ProcessExecutionError, make a best effort attempt at checking the return code (1) and error message for "No such file or directory" and if that's the case, skip that instance and don't fail the entire process of building the current set of resources for the host.

I thought about adding the deleted=False filter to this query:

https://github.com/openstack/nova/blob/96fdce7cab7c50736fb96bcaa622dac825031a2f/nova/virt/libvirt/driver.py#L7932

But at this point, libvirt has already told us the domain still exists on the host so filtering by deleted=False doesn't really help here since the instance isn't deleted yet, it's still on the host according to the hypervisor, but it's in the process of being deleted and the files are already cleaned up by the time we hit the error.

Changed in nova:
status:	Expired → Triaged
importance:	Undecided → Medium

Matt Riedemann (mriedem) on 2018-03-14

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-03-14:

#5

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22InvalidDiskInfo%5C%22%20AND%20message%3A%5C%22qemu-img%3A%20Could%20not%20open%5C%22%20AND%20message%3A%5C%22No%20such%20file%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

20 hits in 7 days.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-14: Fix proposed to nova (master)

#6

Fix proposed to branch: master
Review: https://review.openstack.org/553067

Changed in nova:
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2018-03-20

Changed in nova:
assignee:	Matt Riedemann (mriedem) → sahid (sahid-ferdjaoui)

Matt Riedemann (mriedem) on 2018-03-20

Changed in nova:
assignee:	sahid (sahid-ferdjaoui) → Matt Riedemann (mriedem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-11: Fix merged to nova (master)

#7

Reviewed: https://review.openstack.org/553067
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5f16e714f58336344752305f94451e7c7c55742c
Submitter: Zuul
Branch: master

commit 5f16e714f58336344752305f94451e7c7c55742c
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
Closes-Bug: #1662867

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-20: Fix included in openstack/nova 18.0.0.0b1

#8

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-31: Fix proposed to nova (stable/queens)

#9

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/571424

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-31: Fix proposed to nova (stable/pike)

#10

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/571426

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-31: Fix proposed to nova (stable/ocata)

#11

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/571432

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-31: Fix merged to nova (stable/queens)

#12

Reviewed: https://review.openstack.org/571424
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a4c6913a37f912489543abd5e12a54feeeb89e2
Submitter: Zuul
Branch: stable/queens

commit 5a4c6913a37f912489543abd5e12a54feeeb89e2
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)

Reviewed:  https://review.openstack.org/571424
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a4c6913a37f912489543abd5e12a54feeeb89e2
Submitter: Zuul
Branch:    stable/queens

commit 5a4c6913a37f912489543abd5e12a54feeeb89e2
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource
    
    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.
    
    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.
    
    When that happens, the entire periodic update fails.
    
    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.
    
    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.
    
    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04: Fix merged to nova (stable/pike)

#13

Reviewed: https://review.openstack.org/571426
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d251b95083731829ba104dc5c7f642dd5097d510
Submitter: Zuul
Branch: stable/pike

commit d251b95083731829ba104dc5c7f642dd5097d510
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)
    (cherry picked from commit 5a4c6913a37f912489543abd5e12a54feeeb89e2)

Reviewed:  https://review.openstack.org/571426
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d251b95083731829ba104dc5c7f642dd5097d510
Submitter: Zuul
Branch:    stable/pike

commit d251b95083731829ba104dc5c7f642dd5097d510
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource
    
    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.
    
    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.
    
    When that happens, the entire periodic update fails.
    
    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.
    
    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.
    
    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)
    (cherry picked from commit 5a4c6913a37f912489543abd5e12a54feeeb89e2)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04: Fix included in openstack/nova 17.0.5

#14

This issue was fixed in the openstack/nova 17.0.5 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04: Fix included in openstack/nova 16.1.4

#15

This issue was fixed in the openstack/nova 16.1.4 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-05: Fix merged to nova (stable/ocata)

#16

Reviewed: https://review.openstack.org/571432
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0c7d6c8816d0c21b2913adc7a3d6cbd59604cd0
Submitter: Zuul
Branch: stable/ocata

commit e0c7d6c8816d0c21b2913adc7a3d6cbd59604cd0
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Conflicts:
            nova/virt/libvirt/driver.py
            nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(lyarwood): Conflicts due to the substantial refactoring of
    _get_instance_disk_info via I9616a602ee0605f7f1dc1f47b6416f01895e025b
    and removal of _LW etc during Pike,

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)
    (cherry picked from commit 5a4c6913a37f912489543abd5e12a54feeeb89e2)
    (cherry picked from commit d251b95083731829ba104dc5c7f642dd5097d510)

Reviewed:  https://review.openstack.org/571432
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0c7d6c8816d0c21b2913adc7a3d6cbd59604cd0
Submitter: Zuul
Branch:    stable/ocata

commit e0c7d6c8816d0c21b2913adc7a3d6cbd59604cd0
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Wed Mar 14 16:43:22 2018 -0400

libvirt: handle DiskNotFound during update_available_resource
    
    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.
    
    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.
    
    When that happens, the entire periodic update fails.
    
    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.
    
    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.
    
    Conflicts:
            nova/virt/libvirt/driver.py
            nova/tests/unit/virt/libvirt/test_driver.py
    
    NOTE(lyarwood): Conflicts due to the substantial refactoring of
    _get_instance_disk_info via I9616a602ee0605f7f1dc1f47b6416f01895e025b
    and removal of _LW etc during Pike,
    
    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)
    (cherry picked from commit 5a4c6913a37f912489543abd5e12a54feeeb89e2)
    (cherry picked from commit d251b95083731829ba104dc5c7f642dd5097d510)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-11: Fix included in openstack/nova 15.1.3

#17

This issue was fixed in the openstack/nova 15.1.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Fix included in openstack/nova 16.1.5

#18

This issue was fixed in the openstack/nova 16.1.5 release.

OpenStack Compute (nova)

update_available_resource_for_node racing instance deletion

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Matt Riedemann
Ocata	Fix Committed	Medium	Lee Yarwood
Pike	Fix Committed	Medium	Lee Yarwood
Queens	Fix Committed	Medium	Lee Yarwood