ceph job fails intermittently with "ImageNotFound: [errno 2] error opening image volume-ac0586ce-26c6-4e04-a4d6-78a941a93ad7 at snapshot None"

Bug #1765845 reported by melanie witt
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Ivan Kolodyazhny
Nominated for Ocata by melanie witt
Nominated for Pike by melanie witt
Nominated for Queens by melanie witt

Bug Description

Saw this on a change I'm working on to fix the ceph job on master:

  https://review.openstack.org/563238

where the ceph job failed with this trace [0]:

Apr 20 19:47:40.765248 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Apr 20 19:47:40.765384 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Apr 20 19:47:40.765508 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Apr 20 19:47:40.765633 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Apr 20 19:47:40.765780 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Apr 20 19:47:40.765917 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Apr 20 19:47:40.766046 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Apr 20 19:47:40.766174 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "<decorator-gen-242>", line 2, in delete_volume
Apr 20 19:47:40.766294 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/coordination.py", line 151, in _synchronized
Apr 20 19:47:40.766413 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server return f(*a, **k)
Apr 20 19:47:40.766539 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "<decorator-gen-241>", line 2, in delete_volume
Apr 20 19:47:40.766658 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/objects/cleanable.py", line 207, in wrapper
Apr 20 19:47:40.766788 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server result = f(*args, **kwargs)
Apr 20 19:47:40.766923 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/manager.py", line 872, in delete_volume
Apr 20 19:47:40.767054 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server self.publish_service_capabilities(context)
Apr 20 19:47:40.767168 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/manager.py", line 2527, in publish_service_capabilities
Apr 20 19:47:40.767314 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server self._report_driver_status(context)
Apr 20 19:47:40.767487 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/manager.py", line 2424, in _report_driver_status
Apr 20 19:47:40.767611 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server volume_stats = self.driver.get_volume_stats(refresh=True)
Apr 20 19:47:40.767747 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/drivers/rbd.py", line 528, in get_volume_stats
Apr 20 19:47:40.767877 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server self._update_volume_stats()
Apr 20 19:47:40.768006 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/drivers/rbd.py", line 514, in _update_volume_stats
Apr 20 19:47:40.768124 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server total_gbi = self._get_usage_info()
Apr 20 19:47:40.768250 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/drivers/rbd.py", line 408, in _get_usage_info
Apr 20 19:47:40.768375 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server ioctx=client.ioctx) as v:
Apr 20 19:47:40.768495 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "/opt/stack/new/cinder/cinder/volume/drivers/rbd.py", line 147, in __init__
Apr 20 19:47:40.768647 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server read_only=read_only)
Apr 20 19:47:40.768777 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server File "rbd.pyx", line 1392, in rbd.Image.__init__ (/build/ceph-7hHYeL/ceph-12.2.4/obj-x86_64-linux-gnu/src/pybind/rbd/pyrex/rbd.c:13545)
Apr 20 19:47:40.768897 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server ImageNotFound: [errno 2] error opening image volume-ac0586ce-26c6-4e04-a4d6-78a941a93ad7 at snapshot None
Apr 20 19:47:40.769016 ubuntu-xenial-ovh-bhs1-0003655341 cinder-volume[2422]: ERROR oslo_messaging.rpc.server

I found an old related bug:

  https://bugs.launchpad.net/cinder/+bug/1698786

that was fixed in the past by:

  https://review.openstack.org/475400

but then I found this change re-introduced the problem:

  https://review.openstack.org/486734/4/cinder/volume/drivers/rbd.py@381

Chatting on IRC with e0ne, he pointed out this is likely a race:

[13:50:31] <e0ne> do you see this error on each patch?
[13:50:49] <melwitt> no, it's intermittent
[13:55:27] <e0ne> it could be some race between delete_volume and get_pool_stats calls
[13:56:56] <melwitt> ah yeah, that would make sense
[13:57:34] <e0ne> 19:47:40.530030 - deleting volume
[13:57:53] <e0ne> 19:47:40.765109 - ImageNotFound exception raised
[14:00:50] <melwitt> yeah. definitely makes sense to be resilient there because all it's doing is listing images, then assuming each item in the list exists, when really it could be deleted out from underneath
[14:02:22] <melwitt> between the RBDProxy().list and the RBDVolumeProxy()

[0] http://logs.openstack.org/38/563238/1/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/e42009b/logs/screen-c-vol.txt.gz?level=TRACE#_Apr_20_19_47_40_765248

Revision history for this message
melanie witt (melwitt) wrote :
Changed in cinder:
status: New → In Progress
Ivan Kolodyazhny (e0ne)
Changed in cinder:
assignee: nobody → Ivan Kolodyazhny (e0ne)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/563313
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=3cc8cbdee7c98e1fcf41bb27d81e34fef3a2d032
Submitter: Zuul
Branch: master

commit 3cc8cbdee7c98e1fcf41bb27d81e34fef3a2d032
Author: Ivan Kolodyazhny <email address hidden>
Date: Sat Apr 21 00:02:13 2018 +0300

    RBD: Handle ImageNotFound exception in _get_usage_info correctly

    Change https://review.openstack.org/#/c/486734 moved try-except
    statement to the wrong place. We need to open image in Ceph inside
    try-except block.

    This patch also fixes race condition between volume deletion and
    _get_usage_info calls.

    Closed-Bug: #1765845
    Change-Id: I7d3d006b023ca4b7963c4c684e4c036399d1295c
    Co-Authored-By: Melanie Witt <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/564615

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/564618

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/564619

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/queens)

Reviewed: https://review.openstack.org/564615
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=c935eb8305f21745480abb49e1e380f1c9053d9a
Submitter: Zuul
Branch: stable/queens

commit c935eb8305f21745480abb49e1e380f1c9053d9a
Author: Ivan Kolodyazhny <email address hidden>
Date: Sat Apr 21 00:02:13 2018 +0300

    RBD: Handle ImageNotFound exception in _get_usage_info correctly

    Change https://review.openstack.org/#/c/486734 moved try-except
    statement to the wrong place. We need to open image in Ceph inside
    try-except block.

    This patch also fixes race condition between volume deletion and
    _get_usage_info calls.

    Closed-Bug: #1765845
    Change-Id: I7d3d006b023ca4b7963c4c684e4c036399d1295c
    Co-Authored-By: Melanie Witt <email address hidden>
    (cherry picked from commit 3cc8cbdee7c98e1fcf41bb27d81e34fef3a2d032)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/pike)

Reviewed: https://review.openstack.org/564618
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=c9ede461d794d26b9bb647f04e4625d7ec3c2b28
Submitter: Zuul
Branch: stable/pike

commit c9ede461d794d26b9bb647f04e4625d7ec3c2b28
Author: Ivan Kolodyazhny <email address hidden>
Date: Sat Apr 21 00:02:13 2018 +0300

    RBD: Handle ImageNotFound exception in _get_usage_info correctly

    Change https://review.openstack.org/#/c/486734 moved try-except
    statement to the wrong place. We need to open image in Ceph inside
    try-except block.

    This patch also fixes race condition between volume deletion and
    _get_usage_info calls.

    Co-Authored-By: Melanie Witt <email address hidden>

    Closed-Bug: #1765845

     Conflicts:
     cinder/tests/unit/volume/drivers/test_rbd.py
     cinder/volume/drivers/rbd.py

    NOTE(melwitt): The conflicts were from the fact that the RBDVolumeProxy
    in Pike does not take the client and ioctx kwargs when constructed.

    Change-Id: I7d3d006b023ca4b7963c4c684e4c036399d1295c
    (cherry picked from commit 3cc8cbdee7c98e1fcf41bb27d81e34fef3a2d032)
    (cherry picked from commit c935eb8305f21745480abb49e1e380f1c9053d9a)

tags: added: in-stable-pike
Revision history for this message
melanie witt (melwitt) wrote :
Changed in cinder:
status: In Progress → Fix Released
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/ocata)

Reviewed: https://review.openstack.org/564619
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=cecb44ee66f751be789c8a666a1dde825a277780
Submitter: Zuul
Branch: stable/ocata

commit cecb44ee66f751be789c8a666a1dde825a277780
Author: Ivan Kolodyazhny <email address hidden>
Date: Sat Apr 21 00:02:13 2018 +0300

    RBD: Handle ImageNotFound exception in _get_usage_info correctly

    Change https://review.openstack.org/#/c/486734 moved try-except
    statement to the wrong place. We need to open image in Ceph inside
    try-except block.

    This patch also fixes race condition between volume deletion and
    _get_usage_info calls.

    Co-Authored-By: Melanie Witt <email address hidden>

    Closed-Bug: #1765845

     Conflicts:
     cinder/tests/unit/volume/drivers/test_rbd.py

    NOTE(melwitt): The conflict was from the fact that test__get_usage_info
    did not already exist in Ocata.

    Change-Id: I7d3d006b023ca4b7963c4c684e4c036399d1295c
    (cherry picked from commit 3cc8cbdee7c98e1fcf41bb27d81e34fef3a2d032)
    (cherry picked from commit c935eb8305f21745480abb49e1e380f1c9053d9a)
    (cherry picked from commit c9ede461d794d26b9bb647f04e4625d7ec3c2b28)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.