Cinder-Backup stuck in creating after RBD export-diff existed and turned into zombie / <defunct>

Bug #2031897 reported by Christian Rohmann
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Unassigned

Bug Description

I just observed a volume backup (RBD driver) which is in creating for DAYS already.
I went to the host running cinder-backup and looked at the RBD processes there:

```
[...]
cinder 3396548 1 3396548 3396548 0 Jul31 ? 03:40:51 /usr/bin/python3 /usr/bin/cinder-backup --config-file=/etc/cinder/cinder.conf --log-file=/var/log/cinder/cinder-backup.log
cinder 3396576 3396548 3396548 3396548 0 Jul31 ? 01:05:29 /usr/bin/python3 /usr/bin/cinder-backup --config-file=/etc/cinder/cinder.conf --log-file=/var/log/cinder/cinder-backup.log
cinder 3098464 3396576 3396548 3396548 0 Aug13 ? 00:00:01 [rbd] <defunct>
cinder 3098465 3396576 3396548 3396548 0 Aug13 ? 00:01:49 rbd import-diff --id cinder-backup --conf /etc/ceph/backup.conf --pool backups - backups/volume-fcc9390e-1b6d-48f1-858f-5ed6557cf7ec.backup.54fcb969-87e7-4594-becc-bd8c855a1cbf
root 3398716 1 3396548 3396548 0 Jul31 ? 00:01:14 /usr/bin/python3 /bin/privsep-helper --config-file /etc/cinder/cinder.conf --privsep_context os_brick.privileged.default --privsep_sock_path /tmp/tmpssrnnwgk/privsep.sock
[...]
```

Looking at the piece of code doing the piping of rbd export-diff into rbd import-idff at https://opendev.org/openstack/cinder/src/commit/3b9cc13a94668018d8e0aec2dc997262c9c654a8/cinder/backup/drivers/ceph.py#L685, it seems that this case is not handled properly.

Achieving something the likes of "-o pipefail" seems to be a little more challenging.
See e.g.

* https://stackoverflow.com/questions/68228199/make-a-shell-pipeline-started-from-subprocess-popen-fail-if-the-left-hand-side-o

summary: Cinder-Backup stuck in creating after RBD export-diff existed and turned
- into zombie / <defunct> properly
+ into zombie / <defunct>
Eric Harney (eharney)
tags: added: backup ceph drivers rbd
Changed in cinder:
importance: Undecided → Medium
Revision history for this message
Tony Liu (tonyliu0592) wrote :

Could you run ps to check if that rbd is zombie? And parent process may also be stuck.
Typically, rbd will return, success or fail, but in this case rbd is stuck. Need to find out how.
"-o pipefail" only affects the rc of whole command. I don't think it will help in this case,
where rbd doesn't return at all.

Revision history for this message
Tony Liu (tonyliu0592) wrote :

I had the same issue and figured out it's a bug in parent process, which caused infinite loop, which leaves sub-process zombie. Once that bug was fixed, zombie rbd doesn't happen any more.

Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Tony,

1) If you read my initial bug report, there you'll find the ps output clearly showing that the rbd export-diff (left-hand side cmd) was defunct. This was due to this command exiting without causing the cmd2 (rbd import-diff) to fail. This case is simply not handled properly in the current code.

2) You are referring to a bug in the "parent process", so I suspect cinder-backup? Could you please share a reference to that bug?

Revision history for this message
Simon Hensel (shensel) wrote :

Christian's observation is right, the pipeline gets set up and both processes get started, but the code never looks at the returncode of the first process (p1) and only returns the returncode of the second process (p2): https://opendev.org/openstack/cinder/src/commit/3b9cc13a94668018d8e0aec2dc997262c9c654a8/cinder/backup/drivers/ceph.py#L616

A simple solution would be to check that p1 actually exited (whether successful or because of a timeout) and then stop the pipeline to avoid defunct processes. The subprocess module offers the wait() function for this: https://docs.python.org/3/library/subprocess.html#subprocess.Popen.wait

Revision history for this message
Tony Liu (tonyliu0592) wrote :

I'm not using Cinder for this.
Here is my code.

    p = await asyncio.create_subprocess_shell(cmd,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE)
    stdout, stderr = await p.communicate()
    rc = p.returncode

When cmd is "p1 | p2", and p1 exits error, rc will be that error.
It works fine for me.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/897245

Changed in cinder:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/899488

Revision history for this message
Pete Zaitcev (zaitcev) wrote :

Using non-shell Popen is more secure. I considered an equivalent of what Tony proposed, but I think we don't have to go that far.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by "Pete Zaitcev <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/cinder/+/899488
Reason: Squashed into https://review.opendev.org/c/openstack/cinder/+/897245

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/897245
Committed: https://opendev.org/openstack/cinder/commit/f8e13a883d961a4a5ba1cf6f557a96ebe6f540ad
Submitter: "Zuul (22348)"
Branch: master

commit f8e13a883d961a4a5ba1cf6f557a96ebe6f540ad
Author: Jan Hartkopf <email address hidden>
Date: Tue Jan 16 15:34:55 2024 +0100

    Ceph: Catch more failure conditions on volume backup

    This fixes issues for volume backups with the Ceph driver
    where failures of the first process ("rbd export-diff") were
    not caught. Instead, only the return code of the second
    process ("rbd import-diff") was recognized.

    This change also preserves the stderr that was lost previously
    in order to ease debugging.

    Closes-Bug: 2031897
    Co-Authored-By: Pete Zaitcev <email address hidden>
    Change-Id: I53b573bfff64e7460ef34f1355d3a9d52a8879f9
    Signed-off-by: Jan Hartkopf <email address hidden>

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/2024.1)

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/cinder/+/916764

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/cinder/+/916764
Committed: https://opendev.org/openstack/cinder/commit/52f885b7770d048cd48a4fecedff88bf353ff15f
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 52f885b7770d048cd48a4fecedff88bf353ff15f
Author: Jan Hartkopf <email address hidden>
Date: Tue Jan 16 15:34:55 2024 +0100

    Ceph: Catch more failure conditions on volume backup

    This fixes issues for volume backups with the Ceph driver
    where failures of the first process ("rbd export-diff") were
    not caught. Instead, only the return code of the second
    process ("rbd import-diff") was recognized.

    This change also preserves the stderr that was lost previously
    in order to ease debugging.

    Closes-Bug: 2031897
    Co-Authored-By: Pete Zaitcev <email address hidden>
    Change-Id: I53b573bfff64e7460ef34f1355d3a9d52a8879f9
    Signed-off-by: Jan Hartkopf <email address hidden>
    (cherry picked from commit f8e13a883d961a4a5ba1cf6f557a96ebe6f540ad)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 24.1.0

This issue was fixed in the openstack/cinder 24.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.