InstanceNotFound stack trace in nova-compute logs if instance is deleted during snapshot

Bug #1737024 reported by Matt Riedemann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
jichenjc

Bug Description

I saw this in some zKVM CI logs, the instance is deleted while it's being snapshot:

http://sng01.objectstorage.softlayer.net/v1/AUTH_1940ea10-6e82-4501-b2f9-eb236510e575/ibmzkvmci/production/525787/4/check-tempest-dsvm-neutron-full-ubuntu-xenial-s390x/f737a5b/logs/screen-n-cpu.txt.gz

Dec 07 17:45:42.360741 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver [None req-76cb1ae9-b516-49d4-bb52-625ad96cb932 tempest-ImagesTestJSON-1741486695 tempest-ImagesTestJSON-1741486695] Failed to snapshot image: InstanceNotFound_Remote: Instance 89568634-e208-4973-b5e6-71fb262c49b5 could not be found.
Dec 07 17:45:42.360885 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: Traceback (most recent call last):
Dec 07 17:45:42.361123 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/conductor/manager.py", line 123, in _object_dispatch
Dec 07 17:45:42.361219 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: return getattr(target, method)(*args, **kwargs)
Dec 07 17:45:42.361314 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 226, in wrapper
Dec 07 17:45:42.361409 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: return fn(self, *args, **kwargs)
Dec 07 17:45:42.361513 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/objects/instance.py", line 799, in save
Dec 07 17:45:42.361795 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: columns_to_join=_expected_cols(expected_attrs))
Dec 07 17:45:42.361892 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/db/api.py", line 875, in instance_update_and_get_original
Dec 07 17:45:42.361988 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: expected=expected)
Dec 07 17:45:42.362099 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 182, in wrapper
Dec 07 17:45:42.362205 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: return f(*args, **kwargs)
Dec 07 17:45:42.362411 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 147, in wrapper
Dec 07 17:45:42.362530 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ectxt.value = e.inner_exc
Dec 07 17:45:42.362635 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Dec 07 17:45:42.362747 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: self.force_reraise()
Dec 07 17:45:42.362850 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Dec 07 17:45:42.362961 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: six.reraise(self.type_, self.value, self.tb)
Dec 07 17:45:42.363065 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 135, in wrapper
Dec 07 17:45:42.363172 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: return f(*args, **kwargs)
Dec 07 17:45:42.363274 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 253, in wrapped
Dec 07 17:45:42.363377 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: return f(context, *args, **kwargs)
Dec 07 17:45:42.363490 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 2736, in instance_update_and_get_original
Dec 07 17:45:42.363593 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: columns_to_join=columns_to_join)
Dec 07 17:45:42.363700 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 1894, in _instance_get_by_uuid
Dec 07 17:45:42.363806 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: raise exception.InstanceNotFound(instance_id=uuid)
Dec 07 17:45:42.363909 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: InstanceNotFound: Instance 89568634-e208-4973-b5e6-71fb262c49b5 could not be found.
Dec 07 17:45:42.364018 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver Traceback (most recent call last):
Dec 07 17:45:42.364126 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 1711, in snapshot
Dec 07 17:45:42.364239 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver expected_state=task_states.IMAGE_PENDING_UPLOAD)
Dec 07 17:45:42.364404 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/opt/stack/new/nova/nova/compute/manager.py", line 3290, in update_task_state
Dec 07 17:45:42.364519 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver instance.save(expected_task_state=expected_state)
Dec 07 17:45:42.364626 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 210, in wrapper
Dec 07 17:45:42.364732 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver ctxt, self, fn.__name__, args, kwargs)
Dec 07 17:45:42.364842 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 245, in object_action
Dec 07 17:45:42.364952 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver objmethod=objmethod, args=args, kwargs=kwargs)
Dec 07 17:45:42.365100 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174, in call
Dec 07 17:45:42.365211 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver retry=self.retry)
Dec 07 17:45:42.365384 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131, in _send
Dec 07 17:45:42.365498 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver timeout=timeout, retry=retry)
Dec 07 17:45:42.365606 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send
Dec 07 17:45:42.365710 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver retry=retry)
Dec 07 17:45:42.365860 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 550, in _send
Dec 07 17:45:42.365971 ubuntu-xenial-s390x-cloud-zfwcec175-46 nova-compute[11373]: ERROR nova.virt.libvirt.driver raise result

This is due to this code in the libvirt driver calling the update_task_state callback which updates the instance in the database but if the instance has been deleted it raises InstanceNotFound, which the libvirt driver isn't handling:

https://github.com/openstack/nova/blob/81544829d14aebaaa717ffbe6ec33c67a2f11b7f/nova/virt/libvirt/driver.py#L1710

https://github.com/openstack/nova/blob/81544829d14aebaaa717ffbe6ec33c67a2f11b7f/nova/virt/libvirt/driver.py#L1772

jichenjc (jichenjc)
Changed in nova:
assignee: nobody → jichenjc (jichenjc)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/541152

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/557152

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/557152
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=825529a3de45606839f6d44a924ca6263c970dfe
Submitter: Zuul
Branch: master

commit 825529a3de45606839f6d44a924ca6263c970dfe
Author: jichen <email address hidden>
Date: Wed Mar 28 13:47:20 2018 +0800

    Move update_task_state out of try/except

    During snapshot there might be some race condition lead
    to instance be deleted, compute layer has handle on the
    InstanceNotFound exception so move update_task_state
    out of try/except in snapshot function to avoid callback
    trace.

    Also, note follow up patch will handle additional race
    condition on instance disappear during other exception
    such as NotImplemented occurs scenario.

    Change-Id: Id9a009c8ab2a02f6f860399eeb145337de3c92e9
    Closes-Bug: 1737024

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/562901

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/562901
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2d66f6acad033f16a251fb41bcf054dd0021a081
Submitter: Zuul
Branch: stable/queens

commit 2d66f6acad033f16a251fb41bcf054dd0021a081
Author: jichen <email address hidden>
Date: Wed Mar 28 13:47:20 2018 +0800

    Move update_task_state out of try/except

    During snapshot there might be some race condition lead
    to instance be deleted, compute layer has handle on the
    InstanceNotFound exception so move update_task_state
    out of try/except in snapshot function to avoid callback
    trace.

    Also, note follow up patch will handle additional race
    condition on instance disappear during other exception
    such as NotImplemented occurs scenario.

    Change-Id: Id9a009c8ab2a02f6f860399eeb145337de3c92e9
    Closes-Bug: 1737024
    (cherry picked from commit 825529a3de45606839f6d44a924ca6263c970dfe)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.5

This issue was fixed in the openstack/nova 17.0.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.