Can't create incremental backups when using python3 and newer Ceph versions.

Bug #1838691 reported by Sofia Enriquez on 2019-08-01
22
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Sofia Enriquez
Ubuntu Cloud Archive
High
Unassigned
Rocky
High
Unassigned
Stein
Undecided
Unassigned
Train
High
Unassigned
cinder (Ubuntu)
High
Unassigned
Disco
High
Unassigned
Eoan
High
Unassigned

Bug Description

During the incremental backups creation, unexpectedly the '_cleanup_incomplete_backup_operations' method is called aborting the backup:

DEBUG cinder.backup.drivers.ceph Volume file is RBD: attempting incremental backup. backup /usr/lib/python3.6/site-packages/cinder/backup/drivers/ceph.py:955

INFO cinder.service [-] Starting cinder-backup node (version 14.0.1)

DEBUG oslo_concurrency.lockutils Lock "cleanup_incomplete_backups_1" acquired by "cinder.backup.manager.BackupManager._cleanup_incomplete_backup_operations" :: waited 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327

INFO cinder.backup.manager Cleaning up incomplete backup operations.

This happens when using ceph mimic with Python 3.x. The connections with Librbd[1] suddenly abort connections before accessing the data. [2]

How to reproduce:

>>> import rbd
>>> import rados
>>> cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
>>> cluster.connect()
>>> ioctx = cluster.open_ioctx('backups')
>>> image = rbd.Image(ioctx, 'volume-2d2c563d-c567-431a-9b25-ea06af83c598.backup.base')
>>> snaps = image.list_snaps()
>>> image.close()
>>> list(snaps)
Aborted (core dumped)

Expected output: snapshots list
>>> list(snaps)
[{'id': 4, 'size': 1073741824, 'name': 'backup.0dbaab38-595e-4e8a-8ecd-7ab60a879903.snap.1563912753.3520913', 'namespace': 0}, {'id': 5, 'size': 1073741824, 'name': 'backup.c9aaaf88-9ce1-4502-aa23-1dddfb1d45bb.snap.1564609930.8021717', 'namespace': 0}]

Actual output: core dumped

[1] https://docs.ceph.com/docs/giant/rbd/librbdpy/
[2] https://opendev.org/openstack/cinder/src/branch/master/cinder/backup/drivers/ceph.py#L649

Changed in cinder:
assignee: nobody → Sofia Enriquez (lsofia-enriquez)

Fix proposed to branch: master
Review: https://review.opendev.org/674152

Changed in cinder:
status: New → In Progress

Change abandoned by Sofia Enriquez (<email address hidden>) on branch: master
Review: https://review.opendev.org/674152
Reason: I just discovered this fix was already published. Moving to https://review.opendev.org/#/c/625396/.

Ceph upstream bug report : https://tracker.ceph.com/issues/41076

Reviewed: https://review.opendev.org/625396
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=76569672599eda764f946ad6ee2ce0bdaddad36b
Submitter: Zuul
Branch: master

commit 76569672599eda764f946ad6ee2ce0bdaddad36b
Author: Sven Wegener <email address hidden>
Date: Fri Dec 14 12:56:49 2018 +0100

    Fix ceph: only close rbd image after snapshot iteration is finished

    list_snaps() returns an iterator object that requires resources from the
    Image object. Closing the Image object deallocates these resources.

    On Ceph mimic, which contains changes to the functions being invoked,
    iterating over the snapshots of a closed image results in the following
    assertion being triggered:

    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f35a4812700 time 2018-12-14 10:10:49.843120
    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: 110: FAILED assert(r == 0)
    cinder-backup[29383]: ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
    cinder-backup[29383]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f35903440f2]
    cinder-backup[29383]: 2: (()+0x3162b7) [0x7f35903442b7]
    cinder-backup[29383]: 3: (Mutex::Lock(bool)+0x1de) [0x7f359031901e]
    cinder-backup[29383]: 4: (()+0x8f452) [0x7f358dafc452]
    cinder-backup[29383]: 5: (()+0x11280d) [0x7f358db7f80d]
    cinder-backup[29383]: 6: (rbd_snap_get_namespace_type()+0x29) [0x7f358dacf549]

    The lock being tried to acquire and the object that contained it have
    already been destroyed by closing the image. The assertion terminates
    the cinder-backup service via SIGABRT.

    On Ceph luminous we've seen occasional segfaults of the cinder-backup
    service, which might have the same underlying cause.

    Closes-Bug: #1838691

    Change-Id: If6043d82cc57d9247a290816d90b95cdf719eaac
    Signed-off-by: Sven Wegener <email address hidden>

Changed in cinder:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/676195
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=e78b220f448c7eb0de35aafd1d49b00547384d41
Submitter: Zuul
Branch: stable/stein

commit e78b220f448c7eb0de35aafd1d49b00547384d41
Author: Sven Wegener <email address hidden>
Date: Fri Dec 14 12:56:49 2018 +0100

    Fix ceph: only close rbd image after snapshot iteration is finished

    list_snaps() returns an iterator object that requires resources from the
    Image object. Closing the Image object deallocates these resources.

    On Ceph mimic, which contains changes to the functions being invoked,
    iterating over the snapshots of a closed image results in the following
    assertion being triggered:

    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f35a4812700 time 2018-12-14 10:10:49.843120
    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: 110: FAILED assert(r == 0)
    cinder-backup[29383]: ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
    cinder-backup[29383]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f35903440f2]
    cinder-backup[29383]: 2: (()+0x3162b7) [0x7f35903442b7]
    cinder-backup[29383]: 3: (Mutex::Lock(bool)+0x1de) [0x7f359031901e]
    cinder-backup[29383]: 4: (()+0x8f452) [0x7f358dafc452]
    cinder-backup[29383]: 5: (()+0x11280d) [0x7f358db7f80d]
    cinder-backup[29383]: 6: (rbd_snap_get_namespace_type()+0x29) [0x7f358dacf549]

    The lock being tried to acquire and the object that contained it have
    already been destroyed by closing the image. The assertion terminates
    the cinder-backup service via SIGABRT.

    On Ceph luminous we've seen occasional segfaults of the cinder-backup
    service, which might have the same underlying cause.

    Closes-Bug: #1838691

    Change-Id: If6043d82cc57d9247a290816d90b95cdf719eaac
    Signed-off-by: Sven Wegener <email address hidden>
    (cherry picked from commit 76569672599eda764f946ad6ee2ce0bdaddad36b)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/676211
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=57b127459189e39e993426f8e401e288f6394d4b
Submitter: Zuul
Branch: stable/rocky

commit 57b127459189e39e993426f8e401e288f6394d4b
Author: Sven Wegener <email address hidden>
Date: Fri Dec 14 12:56:49 2018 +0100

    Fix ceph: only close rbd image after snapshot iteration is finished

    list_snaps() returns an iterator object that requires resources from the
    Image object. Closing the Image object deallocates these resources.

    On Ceph mimic, which contains changes to the functions being invoked,
    iterating over the snapshots of a closed image results in the following
    assertion being triggered:

    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f35a4812700 time 2018-12-14 10:10:49.843120
    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: 110: FAILED assert(r == 0)
    cinder-backup[29383]: ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
    cinder-backup[29383]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f35903440f2]
    cinder-backup[29383]: 2: (()+0x3162b7) [0x7f35903442b7]
    cinder-backup[29383]: 3: (Mutex::Lock(bool)+0x1de) [0x7f359031901e]
    cinder-backup[29383]: 4: (()+0x8f452) [0x7f358dafc452]
    cinder-backup[29383]: 5: (()+0x11280d) [0x7f358db7f80d]
    cinder-backup[29383]: 6: (rbd_snap_get_namespace_type()+0x29) [0x7f358dacf549]

    The lock being tried to acquire and the object that contained it have
    already been destroyed by closing the image. The assertion terminates
    the cinder-backup service via SIGABRT.

    On Ceph luminous we've seen occasional segfaults of the cinder-backup
    service, which might have the same underlying cause.

    Closes-Bug: #1838691

    Change-Id: If6043d82cc57d9247a290816d90b95cdf719eaac
    Signed-off-by: Sven Wegener <email address hidden>
    (cherry picked from commit 76569672599eda764f946ad6ee2ce0bdaddad36b)
    (cherry picked from commit e78b220f448c7eb0de35aafd1d49b00547384d41)

tags: added: in-stable-rocky

This issue was fixed in the openstack/cinder 15.0.0.0rc1 release candidate.

This issue was fixed in the openstack/cinder 14.0.2 release.

This issue was fixed in the openstack/cinder 13.0.7 release.

Reviewed: https://review.opendev.org/676272
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=9e5d33ac244e8dce31ff6d2c7119ee4bd53a9908
Submitter: Zuul
Branch: stable/queens

commit 9e5d33ac244e8dce31ff6d2c7119ee4bd53a9908
Author: Sven Wegener <email address hidden>
Date: Fri Dec 14 12:56:49 2018 +0100

    Fix ceph: only close rbd image after snapshot iteration is finished

    list_snaps() returns an iterator object that requires resources from the
    Image object. Closing the Image object deallocates these resources.

    On Ceph mimic, which contains changes to the functions being invoked,
    iterating over the snapshots of a closed image results in the following
    assertion being triggered:

    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f35a4812700 time 2018-12-14 10:10:49.843120
    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: 110: FAILED assert(r == 0)
    cinder-backup[29383]: ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
    cinder-backup[29383]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f35903440f2]
    cinder-backup[29383]: 2: (()+0x3162b7) [0x7f35903442b7]
    cinder-backup[29383]: 3: (Mutex::Lock(bool)+0x1de) [0x7f359031901e]
    cinder-backup[29383]: 4: (()+0x8f452) [0x7f358dafc452]
    cinder-backup[29383]: 5: (()+0x11280d) [0x7f358db7f80d]
    cinder-backup[29383]: 6: (rbd_snap_get_namespace_type()+0x29) [0x7f358dacf549]

    The lock being tried to acquire and the object that contained it have
    already been destroyed by closing the image. The assertion terminates
    the cinder-backup service via SIGABRT.

    On Ceph luminous we've seen occasional segfaults of the cinder-backup
    service, which might have the same underlying cause.

    Closes-Bug: #1838691

    Change-Id: If6043d82cc57d9247a290816d90b95cdf719eaac
    Signed-off-by: Sven Wegener <email address hidden>
    (cherry picked from commit 76569672599eda764f946ad6ee2ce0bdaddad36b)
    (cherry picked from commit e78b220f448c7eb0de35aafd1d49b00547384d41)
    (cherry picked from commit 57b127459189e39e993426f8e401e288f6394d4b)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/676273
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=0fdff12b260550190be01c59080a8be0c109db28
Submitter: Zuul
Branch: stable/pike

commit 0fdff12b260550190be01c59080a8be0c109db28
Author: Sven Wegener <email address hidden>
Date: Fri Dec 14 12:56:49 2018 +0100

    Fix ceph: only close rbd image after snapshot iteration is finished

    list_snaps() returns an iterator object that requires resources from the
    Image object. Closing the Image object deallocates these resources.

    On Ceph mimic, which contains changes to the functions being invoked,
    iterating over the snapshots of a closed image results in the following
    assertion being triggered:

    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f35a4812700 time 2018-12-14 10:10:49.843120
    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: 110: FAILED assert(r == 0)
    cinder-backup[29383]: ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
    cinder-backup[29383]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f35903440f2]
    cinder-backup[29383]: 2: (()+0x3162b7) [0x7f35903442b7]
    cinder-backup[29383]: 3: (Mutex::Lock(bool)+0x1de) [0x7f359031901e]
    cinder-backup[29383]: 4: (()+0x8f452) [0x7f358dafc452]
    cinder-backup[29383]: 5: (()+0x11280d) [0x7f358db7f80d]
    cinder-backup[29383]: 6: (rbd_snap_get_namespace_type()+0x29) [0x7f358dacf549]

    The lock being tried to acquire and the object that contained it have
    already been destroyed by closing the image. The assertion terminates
    the cinder-backup service via SIGABRT.

    On Ceph luminous we've seen occasional segfaults of the cinder-backup
    service, which might have the same underlying cause.

    Closes-Bug: #1838691

    Change-Id: If6043d82cc57d9247a290816d90b95cdf719eaac
    Signed-off-by: Sven Wegener <email address hidden>
    (cherry picked from commit 76569672599eda764f946ad6ee2ce0bdaddad36b)
    (cherry picked from commit e78b220f448c7eb0de35aafd1d49b00547384d41)
    (cherry picked from commit 57b127459189e39e993426f8e401e288f6394d4b)
    (cherry picked from commit 9e5d33ac244e8dce31ff6d2c7119ee4bd53a9908)

tags: added: in-stable-pike

Reviewed: https://review.opendev.org/676274
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=a2e79da5e18b2e65e56d1dead5c0aba03415b45f
Submitter: Zuul
Branch: stable/ocata

commit a2e79da5e18b2e65e56d1dead5c0aba03415b45f
Author: Sven Wegener <email address hidden>
Date: Fri Dec 14 12:56:49 2018 +0100

    Fix ceph: only close rbd image after snapshot iteration is finished

    list_snaps() returns an iterator object that requires resources from the
    Image object. Closing the Image object deallocates these resources.

    On Ceph mimic, which contains changes to the functions being invoked,
    iterating over the snapshots of a closed image results in the following
    assertion being triggered:

    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f35a4812700 time 2018-12-14 10:10:49.843120
    cinder-backup[29383]: /build/ceph-13.2.2/src/common/Mutex.cc: 110: FAILED assert(r == 0)
    cinder-backup[29383]: ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
    cinder-backup[29383]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f35903440f2]
    cinder-backup[29383]: 2: (()+0x3162b7) [0x7f35903442b7]
    cinder-backup[29383]: 3: (Mutex::Lock(bool)+0x1de) [0x7f359031901e]
    cinder-backup[29383]: 4: (()+0x8f452) [0x7f358dafc452]
    cinder-backup[29383]: 5: (()+0x11280d) [0x7f358db7f80d]
    cinder-backup[29383]: 6: (rbd_snap_get_namespace_type()+0x29) [0x7f358dacf549]

    The lock being tried to acquire and the object that contained it have
    already been destroyed by closing the image. The assertion terminates
    the cinder-backup service via SIGABRT.

    On Ceph luminous we've seen occasional segfaults of the cinder-backup
    service, which might have the same underlying cause.

    Closes-Bug: #1838691

    Change-Id: If6043d82cc57d9247a290816d90b95cdf719eaac
    Signed-off-by: Sven Wegener <email address hidden>
    (cherry picked from commit 76569672599eda764f946ad6ee2ce0bdaddad36b)
    (cherry picked from commit e78b220f448c7eb0de35aafd1d49b00547384d41)
    (cherry picked from commit 57b127459189e39e993426f8e401e288f6394d4b)
    (cherry picked from commit 9e5d33ac244e8dce31ff6d2c7119ee4bd53a9908)
    (cherry picked from commit 0fdff12b260550190be01c59080a8be0c109db28)

tags: added: in-stable-ocata
James Page (james-page) on 2019-10-14
Changed in cinder (Ubuntu Eoan):
status: New → Fix Released
importance: Undecided → High
Changed in cinder (Ubuntu Disco):
status: New → Triaged
importance: Undecided → High
James Page (james-page) wrote :

Ubuntu SRU information

[Impact]
Unable to create backups of ceph volumes via cinder.

[Test Case]
<Deploy OpenStack using charms>
openstack volume create --size 5 testvolume
openstack volume backup create testvolume
openstack volume backup create testvolume
openstack volume backup create testvolume

at least one backup will go into error status.

[Regression Potential]
Part of the 13.0.7 stable release from upstream openstack so low regression potential.

James Page (james-page) wrote :

Test packages for Bionic/Rocky (UCA):

ppa:james-page/rocky

Hello Sofia, or anyone else affected,

Accepted cinder into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

The verification of the Stable Release Update for cinder has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package cinder - 2:13.0.7-0ubuntu1~cloud0
---------------

 cinder (2:13.0.7-0ubuntu1~cloud0) bionic; urgency=medium
 .
   * New upstream point release for OpenStack Rocky (LP: #1848153):
     - Includes fix for issue with cinder-backup/ceph (LP: #1838691).
     - d/p/*: Drop all patches as included in release or not used.

This issue was fixed in the openstack/cinder 12.0.10 release.

Hello Sofia, or anyone else affected,

Accepted cinder into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cinder/2:14.0.2-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in cinder (Ubuntu Disco):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-disco

All autopkgtests for the newly accepted cinder (2:14.0.2-0ubuntu1) for disco have finished running.
The following regressions have been reported in tests triggered by the package:

cinder/2:14.0.2-0ubuntu1 (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/disco/update_excuses.html#cinder

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Hello Sofia, or anyone else affected,

Accepted cinder into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.