[SRU] cinder-volume fails on start when rbd pool contains partially deleted images
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
Fix Released
|
Undecided
|
Ivan Kolodyazhny | ||
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Ocata |
Fix Released
|
High
|
Unassigned |
Bug Description
[Impact]
* Cinder-volume service gets marked as down when rbd pool contains partially deleted images
[Test Case]
1) Use this bundle (pastebin: http://
2) Force a volume deletion
root@juju-
rbd image 'volume-
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
root@juju-
root@juju-
rbd: error opening image volume-
3) Wait for a few seconds. The following exception gets raised and the cinder-volume
service gets reported as down.
2020-02-10 20:42:47.491 14050 ERROR cinder.
2020-02-10 20:42:47.491 14050 ERROR cinder.
2020-02-10 20:42:47.491 14050 ERROR cinder.
2020-02-10 20:42:47.491 14050 ERROR cinder.
2020-02-10 20:42:47.491 14050 ERROR cinder.
2020-02-10 20:42:47.491 14050 ERROR cinder.
2020-02-10 20:42:47.491 14050 ERROR cinder.
[Regression Potential]
* Minor backport, added a new exception to the try block.
[Other Info]
If `rbd_remove` image operation fails by some reason [*] the image being deleted may be left in a state, when its data and part of metadata (rbd_header object) is deleted, but it still has an entry in rbd_directory object. As a result the image is seen in `rbd_list` output but `open` fails.
To calculate rbd pool capacity the cinder-volume scans the pool images using `rbd_list` and then tries to get images size by opening every image. If there is such a partially removed image this causes cinder-volume failure like below:
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.045 26352 ERROR cinder.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
2017-06-15 17:47:58.050 26352 ERROR oslo_service.
[*] Situations when `rbd_remove` fails leaving partially removed image linked in rbd_directory is not avoidable in general case. The operation involves scanning and removing many objects and can't be atomic. It may be interrupted by many different reasons: user intervention, client crash, network or Ceph cluster error. For this reason removal from rbd_directory is done as the last operation so users could still see such images and could complete the removal by rerunning `rbd remove`.
Note, if `rbd_remove` fails for some reason it should return an error, so this can be detected.
Changed in cinder: | |
assignee: | nobody → Ivan Kolodyazhny (e0ne) |
status: | New → Confirmed |
Changed in cinder: | |
assignee: | Ivan Kolodyazhny (e0ne) → Jon Bernard (jbernard) |
Changed in cinder: | |
assignee: | Jon Bernard (jbernard) → Ivan Kolodyazhny (e0ne) |
tags: | added: regression-update |
Fix proposed to branch: master /review. openstack. org/475400
Review: https:/