ceph incremental backup fails in mitaka

Bug #1578036 reported by dob
118
This bug affects 18 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Eric Harney
os-brick
Fix Committed
Low
Xiaojun Liao

Bug Description

When I try to backup volume (Ceph backend) via "cinder backup" to 2nd Ceph cluster cinder create a full backup each time instead diff backup.

mitaka release

cinder-backup 2:8.0.0-0ubuntu1 all Cinder storage service - Scheduler server
cinder-common 2:8.0.0-0ubuntu1 all Cinder storage service - common files
cinder-volume 2:8.0.0-0ubuntu1 all Cinder storage service - Volume server
python-cinder 2:8.0.0-0ubuntu1 all Cinder Python libraries

My steps are:
1. cinder backup-create a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a
2. cinder backup-create --incremental --force a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a

and what I have in Ceph backup cluster:
rbd --cluster bak -p backups du
volume-a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a.backup.37cddcbf-4a18-4f44-927d-5e925b37755f 1024M 1024M
volume-a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a.backup.55e5c1a3-8c0c-4912-b98a-1ea7e6396f85 1024M 1024M

Revision history for this message
dob (nnex) wrote :
Download full text (6.5 KiB)

I try to debug those error.

ceph.py at 859 line:
do_full_backup = False
        if self._file_is_rbd(volume_file):
            # If volume an RBD, attempt incremental backup.
            try:
                self._backup_rbd(backup_id, volume_id, volume_file,
                                 volume_name, length)
            except exception.BackupRBDOperationFailed:
                LOG.debug("Forcing full backup of volume %s.", volume_id)
                do_full_backup = True
        else:
            do_full_backup = True

        if do_full_backup:
            self._full_backup(backup_id, volume_id, volume_file,
                              volume_name, length)

but something goes wrong and function _file_is_rbd did return FALSE.

let's see to '_file_is_rbd' function in ceph.py at 683 line:
def _file_is_rbd(self, volume_file):
        """Returns True if the volume_file is actually an RBD image."""
        return hasattr(volume_file, 'rbd_image')

It mean that attribute 'rbd_image' did not assigned.

'rbd_image' attribute belong to class 'RBDImageIOWrapper' at cinder/volume/drivers/rbd.py.

I try print out contents 'volume_file' array:
['__abs
tractmethods__', '__class__', '__delattr__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__metaclass__', '__module__', '__new__', '__reduce__', '__re
duce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_cache', '_abc_negative_cache', '_abc_negative_cache_version', '_abc_registry', '_checkClosed', '_checkReadable', '_checkSe
ekable', '_checkWritable', '_inc_offset', 'close', 'closed', 'fileno', 'flush', 'isatty', 'next', 'read', 'readable', 'readall', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write
', 'writelines']'

those cells of array is in class RBDVolumeIOWrapper at os_brick/initiator/linuxrbd.py

In sum cinder-backup didn't identify image as rbd.

my cinder.conf at block node:
[DEFAULT]
debug = true
rootwrap_config = /etc/cinder/rootwrap.conf
api_paste_confg = /etc/cinder/api-paste.ini
iscsi_helper = tgtadm
volume_name_template = volume-%s
volume_group = cinder-volumes
verbose = True
auth_strategy = keystone
state_path = /var/lib/cinder
lock_path = /var/lock/cinder
volumes_dir = /var/lib/cinder/volumes
rpc_backend = rabbit
auth_strategy = keystone
my_ip = 10.30.17.21
enabled_backends = lvm,rbd
glance_host = controller
control_exchange = cinder
notification_driver = messagingv2
backup_driver = cinder.backup.drivers.ceph
backup_ceph_conf = /etc/ceph/bak.conf
backup_ceph_user = cinder-backup
backup_ceph_chunk_size = 134217728
backup_ceph_pool = backups
backup_ceph_stripe_unit = 0
backup_ceph_stripe_count = 0
restore_discard_excess_bytes = true
[database]
connection = mysql://cinder:XXXXXXX@controller/cinder
[oslo_messaging_rabbit]
rabbit_host = controller
rabbit_userid = openstack
rabbit_password = XXXXXXXXX
[keystone_authtoken]
auth_uri = http://controller:5000/v2.0
identity_uri = http://controller:35357
admin_tenant_name = service
admin_user = cinder
admin_password = XXXXXXX
[lvm]
volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver
volume_group = cinder-vo...

Read more...

Tom Barron (tpb)
Changed in cinder:
assignee: nobody → Tom Barron (tpb)
summary: - backup to 2nd ceph cluster
+ ceph incremental backup fails in mitaka
Revision history for this message
Tom Barron (tpb) wrote :

Thanks for the great debugging.

I've confirmed that this issue still exists in master and is present even with a single ceph cluster. Hence I've changed the bug title a bit.

Working hypothesis is that since the ceph backup driver code has not changed, this issue was introduced by the work done in mitaka to decouple backup and volume services. That work relies on using brick connector and rpc to the volume service to get the volume that is to be backed up and likely something in that chain is losing the 'rbd_image' attribute of the volume file so that the ceph backup driver treats ceph volumes as if they were from foreign backends and does a full backup every time.

I'll instrument the code and see if we can figure out a fix.

Changed in cinder:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/319554

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.openstack.org/319554
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=876d797a1f5207ba905fdc23413b6cd1cb4074c9
Submitter: Jenkins
Branch: master

commit 876d797a1f5207ba905fdc23413b6cd1cb4074c9
Author: Tom Barron <email address hidden>
Date: Sat May 21 05:46:58 2016 -0400

    Add debug messages and comments for ceph backup

    Help operators and engineers analyzing issues with
    ceph backup driver by adding some useful debug messages
    and comments. Related to this fix [1].

    TrivialFix

    [1] I3d6833f8c1665272d914c95654775662df007aa9

    Change-Id: Ib23388d0f698ecf6931d4f4d5dfbe354f4f746bd
    Related-bug: # 1578036

Tom Barron (tpb)
tags: added: backup-service mitaka-backport-potential
tags: added: ceph
Revision history for this message
dob (nnex) wrote :

I've got an error after patch:
2016-05-23 10:28:25.088 17726 INFO cinder.backup.drivers.ceph [req-06a0e581-c999-4efe-af86-31b3a47b3d7d 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] RBD diff op failed - (ret=33 stderrImporting image diff: 0% complete...failed.
rbd: import-diff failed: (33) Numerical argument out of domain
)

each backup is full.

Revision history for this message
Lisa Li (lisali) wrote :

Yes, this is a problem. Tom, seems that it is better to make RBDImageIOWrapper same in both linuxrbd and rbd driver.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi @nnex, I have also observed this and it appears to be a result of the switch (I think in Mitaka) to using os_brick.initiator.linuxrbd.RBDImageIOWrapper from cinder.volume.drivers.rbd.RBDImageIOWrapper in cinder.backup.manager. Problem is that the newer brick RBDImageIOWrapper class does not have an 'rbdimage' attribute so the ceph backup driver fails to establish that the source volume is of type rbd and therefore falls back to doing a full copy. One way to fix this would be to add that rbdimage attr to the brick class making it the same as its cinder counterpart.

Revision history for this message
dob (nnex) wrote :

Tom, do you have any plans to fix it?

Revision history for this message
alan zhang (alan-zhang) wrote :

hi,have fixed it now?

Revision history for this message
dob (nnex) wrote :

Hi, Alan.

No, the problem is still existing

Revision history for this message
Jon Bernard (jbernard) wrote :

Hey Tom, is this still in your queue? If not I can take this one if you want.

Revision history for this message
alan zhang (alan-zhang) wrote :
Download full text (14.8 KiB)

Hi, I patch this https://review.openstack.org/319554 get this:

[root@controller ~]# cinder backup-create --incremental --force 64ad570c-9556-43f1-91eb-16f506efe4ea
+-----------+--------------------------------------+
| Property | Value |
+-----------+--------------------------------------+
| id | 3df9976c-0f01-4548-9e45-ed9ad672775b |
| name | None |
| volume_id | 64ad570c-9556-43f1-91eb-16f506efe4ea |
+-----------+--------------------------------------+

2016-10-21 10:33:40.945 2661171 DEBUG oslo_messaging._drivers.amqpdriver [req-c32eb48c-3109-4500-ae17-228bc69d533e 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] received message msg_id: None reply to None __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:201
2016-10-21 10:33:41.033 2661171 INFO cinder.backup.manager [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] Create backup started, backup: 3df9976c-0f01-4548-9e45-ed9ad672775b volume: 64ad570c-9556-43f1-91eb-16f506efe4ea.
2016-10-21 10:33:41.049 2661171 DEBUG oslo_concurrency.processutils [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf cat /etc/iscsi/initiatorname.iscsi execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344
2016-10-21 10:33:41.187 2661171 DEBUG oslo_concurrency.processutils [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf cat /etc/iscsi/initiatorname.iscsi" returned: 0 in 0.138s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374
2016-10-21 10:33:41.189 2661171 DEBUG oslo_concurrency.processutils [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf systool -c fc_host -v execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:344
2016-10-21 10:33:41.329 2661171 DEBUG oslo_concurrency.processutils [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf systool -c fc_host -v" returned: 1 in 0.140s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:374
2016-10-21 10:33:41.331 2661171 DEBUG oslo_concurrency.processutils [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] u'sudo cinder-rootwrap /etc/cinder/rootwrap.conf systool -c fc_host -v' failed. Not Retrying. execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:422
2016-10-21 10:33:41.333 2661171 DEBUG oslo_concurrency.processutils [req-8437d817-9d29-4954-8836-8c609cdf1e17 597e55d40136460680b3039b7087eb43 98189fbdeb8d4db998a4d2eb3afa4d68 - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwra...

Revision history for this message
Tom Barron (tpb) wrote :

Jon, It would be great if you take this one. My initial idea for a fix didn't work out and I'm pretty busy now working on manila.

Changed in cinder:
assignee: Tom Barron (tpb) → nobody
Revision history for this message
alan zhang (alan-zhang) wrote :

hi,Jon. Please help us take this one. Thanks.

Revision history for this message
dob (nnex) wrote :
Download full text (42.2 KiB)

Hello, all.

I try again with patch:
1. cinder backup-create --force b7054216-e902-44bd-b1ea-6df8580808ce:
2016-11-04 21:55:44.726 167493 DEBUG oslo_messaging._drivers.amqpdriver [req-d7b6a1c5-d062-4b3f-afc9-9d128feb7067 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] received message msg_id: None reply to None __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:201
2016-11-04 21:55:44.786 167493 INFO cinder.backup.manager [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] Create backup started, backup: 09feeaf1-6bbc-4107-93e8-bd90a224e9ee volume: b7054216-e902-44bd-b1ea-6df8580808ce.
2016-11-04 21:55:44.788 167493 DEBUG oslo_messaging._drivers.amqpdriver [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] CAST unique_id: 7a15e8c19ed74801a63cdf515962994e NOTIFY exchange 'cinder' topic 'notifications.info' _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:438
2016-11-04 21:55:44.825 167493 DEBUG oslo_concurrency.processutils [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf cat /etc/iscsi/initiatorname.iscsi execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:326
2016-11-04 21:55:44.947 167493 DEBUG oslo_concurrency.processutils [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf cat /etc/iscsi/initiatorname.iscsi" returned: 1 in 0.122s execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:356
2016-11-04 21:55:44.948 167493 DEBUG oslo_concurrency.processutils [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] u'sudo cinder-rootwrap /etc/cinder/rootwrap.conf cat /etc/iscsi/initiatorname.iscsi' failed. Not Retrying. execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:404
2016-11-04 21:55:44.949 167493 WARNING os_brick.initiator.connector [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] Could not find the iSCSI Initiator File /etc/iscsi/initiatorname.iscsi
2016-11-04 21:55:44.950 167493 DEBUG oslo_concurrency.processutils [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf systool -c fc_host -v execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:326
2016-11-04 21:55:45.065 167493 DEBUG oslo_concurrency.processutils [req-2f3fe860-2bc0-4a23-9700-4de20a017920 8d384cc45d0c467eb46ef3f980cf314d 3029462986ac400ca2c2c6ac4c9af5a7 - - -] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf systool -c fc_host -v" returned: 1 in 0.116s execute /usr/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:356
2016-11-04 21:55:45.066 167493 DEBUG oslo_concurren...

Jon Bernard (jbernard)
Changed in cinder:
assignee: nobody → Jon Bernard (jbernard)
Revision history for this message
Gaudenz Steinlin (gaudenz-debian) wrote :

I stumbled over this issue today. In mitaka there are actually 3 related issues which prevent differential backups for Ceph from working:

* https://review.openstack.org/#/c/326696/ fixed the detection of RBD volumes and is fixed in Newton.

Once this issue is fixed, differential backups of volumes which are in state 'available' work. For volumes which are in-use the issue from comment #5 appears. This is because in CephBackupDriver.backup tries to get the id and name rbd image name of the volume by querying the database for the original volume to backup. But since change
https://review.openstack.org/#/c/262395/ the original volume is cloned if it's in use and a backup of the cloned volume is taken. When calling rbd export-diff the ceph driver mixes up this cloned volume and the original volume.

The third issue is not directly related to this bug:

* https://review.openstack.org/#/c/378945/ prevents leaving all connections to the Ceph cluster open

Jon Bernard (jbernard)
Changed in cinder:
assignee: Jon Bernard (jbernard) → nobody
importance: Undecided → Medium
Revision history for this message
Shubham (shubham0d) wrote :

I will try to fix this issue but start my working on this bug after some time maybe from december starting.

Changed in cinder:
assignee: nobody → Shubham (shubham0d)
Revision history for this message
Alexander Kashirin (ax.kashirin) wrote :

Hi, Shubham.

In master there are at least two independent errors which prevent differential ceph-backups. Each is masqueraded as exception.BackupRBDOperationFailed in the CephBackupDriver. Each forces the full backup.

1. CephBackupDriver._snap_exists:
 ...
    if snap.name == snap_name: # => AttributeError: 'dict' object has no attribute 'name'
 ...

2. CephBackupDriver.backup is called with wrong 'volume_file' argument: the volume_file.rbd_conf provides pathname of unexistent ceph-conf file of source volume. As a result the _rbd_diff_transfer fails. Reason is in the os_brick/initiator/connectors/rbd.py: RBDConnector._create_ceph_conf writes the conf to a temp file and deletes this file right after.

After these bugs were fixed in my devstack, I'm able to create the increment backups without problems. My patches are in attachment. I hope it'll help you.

Eric Harney (eharney)
tags: removed: mitaka-backport-potential
Revision history for this message
int32bit (int32bit) wrote :

@ax.kashirin, Have you test your code if the volume is attached on a vm(status is in-use)? I pick your code and seems not work on that situation.

Revision history for this message
Fikry (dxtxteam) wrote :

Hi,dob.Did you solve the problem?I also encountered this problem in the Newton.

Revision history for this message
lihaijing (lihaijing) wrote :

Hi, we meet this problem too.
Openstack: Newton
os-brick:1.13.2
ceph:10.2.5

With patch in comment #18, the volumes in availabel status can do rbd incremental backup. But it doesn't work for "in-use" status volumes and raise this problem:
https://bugs.launchpad.net/oslo.privsep/+bug/1593743

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

Same issue here
OpenStack: Ocata
Ceph: 10.2.3

with ax.kashirin solution, it works.

Revision history for this message
alan zhang (alan-zhang) wrote :

This patch for mitaka.
I have test it,it work for me.

You have to patch os_brick.

Eric Harney (eharney)
Changed in cinder:
assignee: Shubham (shubham0d) → nobody
Xiaojun Liao (wwba)
Changed in cinder:
status: Confirmed → New
status: New → Confirmed
Xiaojun Liao (wwba)
Changed in cinder:
assignee: nobody → Xiaojun Liao (wwba)
Xiaojun Liao (wwba)
Changed in cinder:
assignee: Xiaojun Liao (wwba) → nobody
Changed in os-brick:
assignee: nobody → Xiaojun Liao (wwba)
dob (nnex)
information type: Public → Public Security
information type: Public Security → Public
information type: Public → Private Security
information type: Private Security → Public
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (master)

Fix proposed to branch: master
Review: https://review.openstack.org/476503

Changed in os-brick:
status: New → In Progress
Eric Harney (eharney)
Changed in cinder:
assignee: nobody → Chaynika Saikia (csaikia)
Eric Harney (eharney)
Changed in cinder:
status: Confirmed → In Progress
zheng yin (yin-zheng)
Changed in cinder:
assignee: Chaynika Saikia (csaikia) → zheng yin (yin-zheng)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/480062

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (master)

Reviewed: https://review.openstack.org/476503
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=1fe844efaf324ed7b9b64c21f1c3f8af9f616738
Submitter: Jenkins
Branch: master

commit 1fe844efaf324ed7b9b64c21f1c3f8af9f616738
Author: Xiaojun Liao <email address hidden>
Date: Thu Jun 22 19:54:01 2017 +0800

    Fix ceph incremental backup fail

    Cinder _rbd_diff_transfer() uses "import-diff" and "export-diff" cmdline
    to do a incremental backup, it will fail without a ceph-conf file. Delay
    to delete temporary ceph-conf file in class RBDConnector during ceph
    volume backup.

    Change-Id: Ib74c85266b8c812f7a40dac293847a28768eae9a
    Partial-Bug: #1578036

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by zheng yin (yin__zheng@163.com) on branch: master
Review: https://review.openstack.org/480062

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/482816

zheng yin (yin-zheng)
Changed in cinder:
assignee: zheng yin (yin-zheng) → nobody
assignee: nobody → zheng yin (yin-zheng)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by zheng yin (yin__zheng@163.com) on branch: master
Review: https://review.openstack.org/482816

zheng yin (yin-zheng)
Changed in cinder:
assignee: zheng yin (yin-zheng) → nobody
Xiaojun Liao (wwba)
Changed in os-brick:
status: In Progress → Fix Committed
Changed in cinder:
status: In Progress → New
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-brick (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/511232

Revision history for this message
weiguo sun (wsun2) wrote :

I am still seeing the error with the above commit (https://review.openstack.org/511232)for "in-use" volume with ceph volume & backup backend. The 2nd issue pointed out by Gaudenz in post#16 doesn't seem be resolved where ceph driver is trying to perform the differential export with a snapshot created against the original ceph volume instead of the snap-clone of the original ceph volume returned by "get_backup_device" in "cinder/backup/manager.py".

When "rbd export-diff --id xxx --conf /tmp/tmpN0XVXO --pool xxx <email address hidden> -" is attemped, it will fail with "Numerical argument out of domain" since the snapshot (backup.737df596-3922-4f60-8028-81e22a22a57f.snap.1507841205.47) generated by "source_rbd_image.create_snap(new_snap)" is against the snap-clone instead.

I am testing this against our Newton tree but I don't see any code in the latest master branch addressing this mix-up issue.

Revision history for this message
Xiaojun Liao (wwba) wrote : Re:[Bug 1578036] Re: ceph incremental backup fails in mitaka

I have some time recently, let me see.

------------------ Original ------------------
From: "weiguo sun"<email address hidden>;
Date: Tue, Oct 17, 2017 08:12 AM
To: "xiaojun.liao"<email address hidden>;

Subject: [Bug 1578036] Re: ceph incremental backup fails in mitaka

I am still seeing the error with the above commit
(https://review.openstack.org/511232)for "in-use" volume with ceph
volume & backup backend. The 2nd issue pointed out by Gaudenz in post#16
doesn't seem be resolved where ceph driver is trying to perform the
differential export with a snapshot created against the original ceph
volume instead of the snap-clone of the original ceph volume returned by
"get_backup_device" in "cinder/backup/manager.py".

When "rbd export-diff --id xxx --conf /tmp/tmpN0XVXO --pool xxx cinder-
pool-01/volume-origina-
<email address hidden> -"
is attemped, it will fail with "Numerical argument out of domain" since
the snapshot
(backup.737df596-3922-4f60-8028-81e22a22a57f.snap.1507841205.47)
generated by "source_rbd_image.create_snap(new_snap)" is against the
snap-clone instead.

I am testing this against our Newton tree but I don't see any code in
the latest master branch addressing this mix-up issue.

--
You received this bug notification because you are a bug assignee.
https://bugs.launchpad.net/bugs/1578036

Title:
  ceph incremental backup fails in mitaka

Status in Cinder:
  New
Status in os-brick:
  Fix Committed

Bug description:
  When I try to backup volume (Ceph backend) via "cinder backup" to 2nd
  Ceph cluster cinder create a full backup each time instead diff
  backup.

  mitaka release

  cinder-backup 2:8.0.0-0ubuntu1 all Cinder storage service - Scheduler server
  cinder-common 2:8.0.0-0ubuntu1 all Cinder storage service - common files
  cinder-volume 2:8.0.0-0ubuntu1 all Cinder storage service - Volume server
  python-cinder 2:8.0.0-0ubuntu1 all Cinder Python libraries

  My steps are:
  1. cinder backup-create a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a
  2. cinder backup-create --incremental --force a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a

  and what I have in Ceph backup cluster:
  rbd --cluster bak -p backups du
  volume-a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a.backup.37cddcbf-4a18-4f44-927d-5e925b37755f 1024M 1024M
  volume-a3bacaf5-6cf8-480d-a5db-5ecdf4223b6a.backup.55e5c1a3-8c0c-4912-b98a-1ea7e6396f85 1024M 1024M

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/1578036/+subscriptions

Revision history for this message
Jason Dillaman (jdillaman) wrote :

Please ensure you are testing this against the latest Ceph Jewel release (or later) since there have been some bug fixes related to export-diff/import-diff.

Revision history for this message
weiguo sun (wsun2) wrote :

Hi Jason,

PDB clearly shows this is not related to version of export-diff/import-diff. The ceph.py is trying to export-diff a snapshot made against the snapshot against the original volume, but backup manager.py passes the snap-clone of the original volume for the snapshot and hence export-diff can't find the actual snapshot.

As the finding from "Gaudenz" in post #16, there is a mix-up in ceph.py which has not been addressed. This leads to failure of the incremental (ceph differential) and forces a full backup against the original volume (chuck by chunk), which is mostly not a valid backup for an "in-use" volume given that it is not crash consistent.

Revision history for this message
weiguo sun (wsun2) wrote :

Made some typos in my above response to Jason, 1st paragraph should read as,

"PDB clearly shows this is not related to version of export-diff/import-diff. The ceph.py is trying to export-diff a snapshot made against the original ceph volume, but backup manager.py passes a cloned temp volume to ceph.py for snapshot creation. Hence export-diff, being executed, can't find the actual snapshot against the original ceph volume."

Revision history for this message
Jason Dillaman (jdillaman) wrote :

Cool, the "Numerical argument out of domain" error from the rbd CLI is still surprising and I really wouldn't expect it. If that is still occurring on up-to-date releases of Ceph, we (Ceph) would like to fix it.

Revision history for this message
Xiaojun Liao (wwba) wrote :

Hi weiguo,

Volume should not be attached to two different nodes at the same time, so for a 'in-use' original volume, a temp snapshot/clone would be attached to backup service node as we see from change https://review.openstack.org/#/c/262395. Because of temp snapshot/clone, it is impossiable to success to do an incremental/differential backup if the source volume is an RBD, considering it needs a 'from-snap'(recently created snapshot for backup).

In rbd backup driver, maybe it need a check to skip attempting incremental backup when the temp snapshot/clone was found.

Revision history for this message
weiguo sun (wsun2) wrote :

Hi Xiaojun,

Thanks for clarifying the design context of scaling the backup service. It seems that there are two potential fixes,

(1) Ceph backup driver skips incremental/differential backup once determining the volume is "in-use" status or the return volume is a snap-clone volume; however, the ceph backup driver would need to use the snap-clone volume as the source data, which is not the case with the driver right now (see the following debug output), and the driver is trying to transferring data from the original in-use volume (f8f6c0a3-b19c-43f3-965f-59945f4dc4b3), which won't be crash recovery consistent.

2017-10-12 20:31:24.207 104248 DEBUG cinder.backup.drivers.ceph [req-82609901-3596-4c28-905b-63cbc056c3a9 f4c2cd21cd1841f4bd87e4910291f930 729810f6d86f467082ec3fe9a70c84df - default default] Copying data from volume f8f6c0a3-b19c-43f3-965f-59945f4dc4b3. _full_backup /usr/lib/python2.7/site-packages/cinder/backup/drivers/ceph.py:712
2017-10-12 20:31:24.251 104248 DEBUG cinder.backup.drivers.ceph [req-82609901-3596-4c28-905b-63cbc056c3a9 f4c2cd21cd1841f4bd87e4910291f930 729810f6d86f467082ec3fe9a70c84df - default default] Transferring data between 'volume-f8f6c0a3-b19c-43f3-965f-59945f4dc4b3' and 'volume-f8f6c0a3-b19c-43f3-965f-59945f4dc4b3.backup.af9a04a4-7778-454c-9ca7-30c6f1dec4bb' _transfer_data /usr/lib/python2.7/site-packages/cinder/backup/drivers/ceph.py:304

(2) 2nd option is that ceph backup driver ignores the returned backup volume object (snap-clone) and takes a 'from-snap' against the original cinder/ceph volume. I would think this 'from-snap' is good enough for the incremental/differential backup. I don't think this is against the principle of "scaling the backup service" but I am willing to be convined otherwise. This option doesn't require the 'in-use' volume to be mounted on the backup service node but only the snapshot to be visible, which is how the "available" cinder/ceph volume is being backed up based on my debugging observation. To my understanding, ceph rbd volume snapshot is atomic and hence crash recovery consistent. So a regular snapshot for volume 'in-use' should be sufficient for crash recovery.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-brick (stable/ocata)

Reviewed: https://review.openstack.org/511232
Committed: https://git.openstack.org/cgit/openstack/os-brick/commit/?id=94728e37a62622442f86866f2a805a838bec9675
Submitter: Zuul
Branch: stable/ocata

commit 94728e37a62622442f86866f2a805a838bec9675
Author: Xiaojun Liao <email address hidden>
Date: Thu Jun 22 19:54:01 2017 +0800

    Fix ceph incremental backup fail

    Cinder _rbd_diff_transfer() uses "import-diff" and "export-diff" cmdline
    to do a incremental backup, it will fail without a ceph-conf file. Delay
    to delete temporary ceph-conf file in class RBDConnector during ceph
    volume backup.

    Change-Id: Ib74c85266b8c812f7a40dac293847a28768eae9a
    Partial-Bug: #1578036
    (cherry picked from commit 1fe844efaf324ed7b9b64c21f1c3f8af9f616738)

tags: added: in-stable-ocata
Eric Harney (eharney)
Changed in cinder:
assignee: nobody → Alan Bishop (alan-bishop)
Revision history for this message
Xiaojun Liao (wwba) wrote :

Hi weiguo,

You are right, there is really a bug about full_backup after attempting incremental backup failed for
'in-use' original volume. For 2nd option, I think 'in-use' original volume would fail to create a new snapshot on the backup service node.

Revision history for this message
weiguo sun (wsun2) wrote :

Hi Xiaojun,

Can you explain why it would fail trying to create a new snapshot on the backup service node? Is the design of "scaling backup service" forbidding backup service node from even connecting to the ceph cluster when the original volume is 'in-use' status? My basic understanding is that for original volume with 'available' status, backup service node does exactly that, ie, connecting to the ceph cluster hosting the original volume and placing a request for a new snapshot via "source_rbd_image.create_snap(new_snap)". So I failed to see the difference between "in-use" vs "available" state for a ceph/cinder volume when requesting a snapshot from backup service node.

Revision history for this message
Xiaojun Liao (wwba) wrote :

Hi weiguo,

You are right, "CEPH ISCSI GATEWAY" feature used with ceph block devices shows that "in-use" original volume can be attached to different client nodes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/480062

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/579606

Changed in cinder:
assignee: Alan Bishop (alan-bishop) → Eric Harney (eharney)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/579606
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=53522164dc11a1f8316de4cf023abfa113d35704
Submitter: Zuul
Branch: master

commit 53522164dc11a1f8316de4cf023abfa113d35704
Author: Eric Harney <email address hidden>
Date: Thu Apr 26 15:18:35 2018 -0400

    Fix RBD incremental backup

    Since the default get_backup_device method doesn't
    return enough Ceph connection information, RBD
    incremental backup will fail, resulting in full backups
    being performed when an incremental is requested.

    Closes-Bug: #1578036
    Co-Author: Gorka Eguileor <email address hidden>
    Change-Id: I41d1ec84db58f4bf0f7362cec022f2c3380e5ee2

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/580675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/queens)

Reviewed: https://review.openstack.org/580675
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=10937da92e55d6ff78db3d3637c35a6d98657107
Submitter: Zuul
Branch: stable/queens

commit 10937da92e55d6ff78db3d3637c35a6d98657107
Author: Eric Harney <email address hidden>
Date: Thu Apr 26 15:18:35 2018 -0400

    Fix RBD incremental backup

    Since the default get_backup_device method doesn't
    return enough Ceph connection information, RBD
    incremental backup will fail, resulting in full backups
    being performed when an incremental is requested.

    Closes-Bug: #1578036
    Co-Author: Gorka Eguileor <email address hidden>
    Change-Id: I41d1ec84db58f4bf0f7362cec022f2c3380e5ee2
    (cherry picked from commit 53522164dc11a1f8316de4cf023abfa113d35704)
    Conflicts:
     cinder/volume/drivers/rbd.py

tags: added: in-stable-queens
Revision history for this message
Chris Martin (6-chris-z) wrote :

Thanks for fixing! Can we please have a backport to Pike?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 13.0.0.0b3

This issue was fixed in the openstack/cinder 13.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 12.0.4

This issue was fixed in the openstack/cinder 12.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/610645

Changed in os-brick:
importance: Undecided → Low
dob (nnex)
information type: Public → Public Security
Jeremy Stanley (fungi)
information type: Public Security → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.