update_available_resource_for_node racing instance deletion

Bug #1662867 reported by Lee Yarwood
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Ocata
Fix Committed
Medium
Lee Yarwood
Pike
Fix Committed
Medium
Lee Yarwood
Queens
Fix Committed
Medium
Lee Yarwood

Bug Description

Description
===========
The following trace was seen multiple times during a CI run for https://review.openstack.org/#/c/383859/ :

http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-devstack-plugin-nfs-nv/a4c1057/logs/screen-n-cpu.txt.gz?level=ERROR#_2017-02-07_19_10_25_548
http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-devstack-plugin-nfs-nv/a4c1057/logs/screen-n-cpu.txt.gz?level=ERROR#_2017-02-07_19_15_26_004

In the first example a request to terminate the instance 60b7cb32 appears to race an existing run of the update_available_resource_for_node periodic task :

req-fa96477b-34d2-4ab6-83bf-24c269ed7c28

http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-devstack-plugin-nfs-nv/a4c1057/logs/screen-n-cpu.txt.gz?#_2017-02-07_19_10_25_478

req-dc60ed89-d3da-45f6-b98c-8f57c767d751

http://logs.openstack.org/09/395709/7/check/gate-tempest-dsvm-full-devstack-plugin-nfs-nv/a4c1057/logs/screen-n-cpu.txt.gz?#_2017-02-07_19_10_25_548

Steps to reproduce
==================
Delete an instance while update_available_resource_for_node is running

Expected result
===============
Either swallow the exception and move on or lock instances in such a way that they can't be removed while this periodic task is running.

Actual result
=============
update_available_resource_for_node fails and stops.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   https://review.openstack.org/#/c/383859/ - but it should reproduce against master.

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   n/a

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   n/a

tags: added: placement resource-tracker
Changed in nova:
status: New → Confirmed
Revision history for this message
Lee Yarwood (lyarwood) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

The logs are gone. What did this actually look like? Can you find a version of it on stable/ocata and paste the actual error in here?

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (7.0 KiB)

We still see this in CI runs periodically:

http://logs.openstack.org/70/552170/2/check/legacy-tempest-dsvm-multinode-live-migration/474e6b5/logs/screen-n-cpu.txt.gz?level=INFO#_Mar_13_22_24_03_425580

We can see that in this case, the instance in question has it's files deleted right before the libvirt driver, via the update_available_resource periodic task, gets to processing that instance:

Mar 13 22:24:03.367217 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deleting instance files /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del
Mar 13 22:24:03.367725 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: INFO nova.virt.libvirt.driver [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] [instance: 0460cf87-16f1-4aa3-a964-7cab159327dc] Deletion of /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc_del complete
Mar 13 22:24:03.425580 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager [None req-d0181533-0010-4268-bea5-fda2a392e8f1 tempest-LiveMigrationRemoteConsolesV26Test-2100428245 tempest-LiveMigrationRemoteConsolesV26Test-2100428245] Error updating resources for node ubuntu-xenial-rax-iad-0002937850.: InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk : Unexpected error while running command.
Mar 13 22:24:03.425969 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk --force-share
Mar 13 22:24:03.426340 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Exit code: 1
Mar 13 22:24:03.426705 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stdout: u''
Mar 13 22:24:03.427044 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: Stderr: u"qemu-img: Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': Could not open '/opt/stack/data/nova/instances/0460cf87-16f1-4aa3-a964-7cab159327dc/disk': No such file or directory\n"
Mar 13 22:24:03.427380 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager Traceback (most recent call last):
Mar 13 22:24:03.427724 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 7320, in update_available_resource_for_node
Mar 13 22:24:03.428084 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager rt.update_available_resource(context, nodename)
Mar 13 22:24:03.428410 ubuntu-xenial-rax-iad-0002937850 nova-compute[30634]: ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 664, in update_available_resour...

Read more...

Changed in nova:
status: Expired → Triaged
importance: Undecided → Medium
Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/553067

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: Matt Riedemann (mriedem) → sahid (sahid-ferdjaoui)
Matt Riedemann (mriedem)
Changed in nova:
assignee: sahid (sahid-ferdjaoui) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/553067
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5f16e714f58336344752305f94451e7c7c55742c
Submitter: Zuul
Branch: master

commit 5f16e714f58336344752305f94451e7c7c55742c
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

    libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

    When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/571424

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/571426

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/571432

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/571424
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a4c6913a37f912489543abd5e12a54feeeb89e2
Submitter: Zuul
Branch: stable/queens

commit 5a4c6913a37f912489543abd5e12a54feeeb89e2
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

    libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

    When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/571426
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d251b95083731829ba104dc5c7f642dd5097d510
Submitter: Zuul
Branch: stable/pike

commit d251b95083731829ba104dc5c7f642dd5097d510
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

    libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

    When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)
    (cherry picked from commit 5a4c6913a37f912489543abd5e12a54feeeb89e2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.5

This issue was fixed in the openstack/nova 17.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.4

This issue was fixed in the openstack/nova 16.1.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/571432
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0c7d6c8816d0c21b2913adc7a3d6cbd59604cd0
Submitter: Zuul
Branch: stable/ocata

commit e0c7d6c8816d0c21b2913adc7a3d6cbd59604cd0
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 14 16:43:22 2018 -0400

    libvirt: handle DiskNotFound during update_available_resource

    The update_available_resource periodic task in the compute manager
    eventually calls through to the resource tracker and virt driver
    get_available_resource method, which gets the guests running on
    the hypervisor, and builds up a set of information about the host.
    This includes disk information for the active domains.

    However, the periodic task can race with instances being deleted
    concurrently and the hypervisor can report the domain but the driver
    has already deleted the backing files as part of deleting the
    instance, and this leads to failures when running "qemu-img info"
    on the disk path which is now gone.

    When that happens, the entire periodic update fails.

    This change simply tries to detect the specific failure from
    'qemu-img info' and translate it into a DiskNotFound exception which
    the driver can handle. In this case, if the associated instance is
    undergoing a task state transition such as moving to another host or
    being deleted, we log a message and continue. If the instance is in
    steady state (task_state is not set), then we consider it a failure
    and re-raise it up.

    Note that we could add the deleted=False filter to the instance query
    in _get_disk_over_committed_size_total but that doesn't help us in
    this case because the hypervisor says the domain is still active
    and the instance is not actually considered deleted in the DB yet.

    Conflicts:
            nova/virt/libvirt/driver.py
            nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(lyarwood): Conflicts due to the substantial refactoring of
    _get_instance_disk_info via I9616a602ee0605f7f1dc1f47b6416f01895e025b
    and removal of _LW etc during Pike,

    Change-Id: Icec2769bf42455853cbe686fb30fda73df791b25
    Closes-Bug: #1662867
    (cherry picked from commit 5f16e714f58336344752305f94451e7c7c55742c)
    (cherry picked from commit 5a4c6913a37f912489543abd5e12a54feeeb89e2)
    (cherry picked from commit d251b95083731829ba104dc5c7f642dd5097d510)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.3

This issue was fixed in the openstack/nova 15.1.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.5

This issue was fixed in the openstack/nova 16.1.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.