swift-object-replicator killing long-running rsync failing to actually kill rsync causing replication backlog

Bug #1691570 reported by Drew Freiberger
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned
OpenStack Swift Storage Charm
Fix Released
Medium
Drew Freiberger

Bug Description

In an environment with 3 swift-storage zones, with 3 hosts in each zone, with 20 x 1TB disk drives and roughly 500million objects, we're finding that object replication is hanging up on one or more of the nodes.

Digging into it more deeply, I find that there are hung rsync processes which appear to be hung trying to write to the pipe opened by swift-object-replicator for IPC. See below:

ubuntu@myhost-cs-011:~$ ps -ef |grep 2993422|cut -c1-230
ubuntu 1987011 1977905 0 20:31 pts/0 00:00:00 grep --color=auto 2993422
swift 2993422 1 0 May11 ? 00:04:06 /usr/bin/python /usr/bin/swift-object-replicator /etc/swift/object-server.conf
swift 2993891 2993422 0 May11 ? 00:14:29 [rsync] <defunct>
swift 2993898 2993422 0 May11 ? 00:15:04 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-existing --timeout=30 --contimeout=30 --bwlimit=0 /srv/node/bcache11/objects/250/b3e /srv/node.....
*snip*
swift 3000030 2993422 0 May11 ? 00:19:05 rsync --recursive --whole-file --human-readable --xattrs --itemize-changes --ignore-existing --timeout=30 --contimeout=30 --bwlimit=0 /srv/node/bcache1/objects/226/000 /srv/node/.....

ubuntu@myhost-cs-011:~$ sudo strace -p 2993898
Process 2993898 attached
write(1, "<f+++++++++ 194/fa85c4fdeac6158d"..., 71^CProcess 2993898 detached
 <detached ...>

ubuntu@myhost-cs-011:~$ sudo ls -l /proc/2993898/fd/1
l-wx------ 1 swift swift 64 May 17 20:16 /proc/2993898/fd/1 -> pipe:[70538904]
ubuntu@myhost-cs-011:~$ sudo lsof |grep 70538904|grep -v rsync
swift-obj 2993422 swift 42r FIFO 0,10 0t0 70538904 pipe
swift-obj 2993422 2993435 swift 42r FIFO 0,10 0t0 70538904 pipe
swift-obj 2993422 2993436 swift 42r FIFO 0,10 0t0 70538904 pipe
swift-obj 2993422 2993437 swift 42r FIFO 0,10 0t0 70538904 pipe
*snip*
swift-obj 2993422 2993453 swift 42r FIFO 0,10 0t0 70538904 pipe
swift-obj 2993422 2993454 swift 42r FIFO 0,10 0t0 70538904 pipe

A snippet from syslog appears to show swift-object-replicator tyring to kill these processes:

/var/log/syslog.7.gz:May 11 06:04:52 myhost-cs-011 object-replicator: Killing long-running rsync: ['rsync', '--recursive', '--whole-file', '--human-readable', '--xattrs', '--itemize-changes', '--ignore-existing', '--timeout=30', '--contimeout=30', '--bwlimit=0', u'/srv/node/bcache11/objects/250/b3e', u'/srv/node/bcache11/objects/250/000', u'/srv/node/bcache11/objects/250/001', u'/srv/node/bcache11/objects/250/002', u'/srv/node/bcache11/objects/250/003', u'/srv/node/bcache11/objects/250/004', u'/srv/node/bcache11/objects/250/005', u'/srv/node/bcache11/objects/250/006', u'/srv/node/bcache11/objects/250/007', u'/srv/node/bcache11/objects/250/008', u'/srv/node/bcache11/objects/250/c5d', u'/srv/node/bcache11/objects/250/009', u'/srv/node/bcache11/objects/250/00a', u'/srv/node/bcache11/objects/250/00b', u'/srv/node/bcache11/objects/250/00c', u'/srv/node/bcache11/objects/250/00d', u'/srv/node/bcache11/objects/250/00e', u'/srv/node/bcache11/objects/250/00f', u'/srv/node/bcache11/objects/250/010', u'/srv/node/bcache11/objects/250/011', u'/srv/node/bcache11/objects/250/012', u'/srv/node/bcache11/objects/250/013', u'/srv/node/bcache11/objects/250/014', u'/srv/node/bcache11/objects/250/015', u'/srv/node/bcache11/objects/250/016', u'/srv/node/bcache11/objects/250/017', u'/srv/node/bcache11/objects/250/018', u'/srv/node/bcache11/objects/250/019', u'/srv/node/bcache11/objects/250/01a', u'/srv/node/bcache11/objects/250/01b', u'/srv/node/bcache11/objects/250/01c', u'/srv/node/bcache11/objects/250/01d', u'/srv/node/bcache11/objects/250/01e', u'/srv/node/bcache11/objects/250/01f', u'/srv/node/bcache11/objects/250/020', u'/srv/node/bcache11/objects/250/021', u'/srv/node/bcache11/objects/250/022', u'/srv/node/bcache11/objects/250/023', u'/srv/node/bcache11/objects/250/024', u'/srv/node/bcache11/objects/250/025', u'/srv/node/bcache11/objects/250/026', u'/srv/node/bcache11/objects/250/027', u'/srv/node/bcache11/objects/250/028', u'/srv/node/bcache11/objects/250/029', u'/srv/node/bcache11/objects/250/02a', u'/srv/node/bcache11/objects/250/02b', u'/srv/node/bcache11/objects/250/02c', u'/srv/node/bcache11/objects/250/02d', u'/srv/node/bcache11/objects/250/02e', u'/srv/node/bcache11/objects/250/02f', u'/srv/node/bcache11/objects/250/030', u'/srv/node/bcache11/objects/250/031', u'/srv/node/bcache11/objects/250/032', u'/srv/node/bcache11/objects/250/033', u'/srv/node/bcache11/objects/250/034', u'/srv/node/bcache11/objects/250/035', u'/srv/node/bcache11/objects/250/036', u'/srv/node/bcache11/objects/250/037', u'/srv/node/bcache11/objects/250/038', u'/srv/node/bcache11/objects/250/289', u'/srv/node/bcache11/objects/250/039', u'/srv/node/bcache11/objects/250/03a', u'/srv/node/bcache11/objects/250/03b', u'/srv/node/bcache11/objects/250/03c', u'/srv/node/bcache11/objects/250/03d', u'/srv/node/bcache11/objects/250/03e', u'/srv/node/bcache11/objects/250/03f', u'/srv/node/bcache11/objects/250/040', u'/srv/node/bcache11/objects/250/041', u'/srv/node/bcache11/objects/250/042', u'/srv/node/bcache11/objects/250/043', u'/srv/node/bcache11/objects/250/044', u'/srv/node/bcache11/objects/250/045', u'/srv/node/bcache11/objects/250/046', u'/srv/node/bcache11/objects/250/047', u'/srv/node/bcache11/objects/250/048', u'/srv/node/bcache11/objects/250/049', u'/srv/node/bcache11/objects/250/04a', u'/srv/node/bcache11/objects/250/04b', u'/srv/node/bcache11/objects/250/04c', u'/srv/node/bcache11/objects/250/04d', u'/srv/node/bcache11/objects/250/04e', u'/srv/node/bcache11/objects/250/04f', u'/srv/node/bcache11/objects/250/050', u'/srv/node/bcache11/objects/250/051', u'/srv/node/bcache11/objects/250/052', u'/srv/node/bcache11/objects/250/053', u'/srv/node/bcache11/objects/250/054', u'/srv/node/bcache11/objects/250/055', u'/srv/node/bcache11/objects/250/056', u'/srv/node/bcache11/objects/250/057', u'/srv/node/bcache11/objects/250/058', u'/srv/node/bcache11/objects/250/059', u'/srv/node/bcache11/objects/250/05a', u'/srv/node/bcache11/objects/250/05b', u'/srv/node/bcache11/objects/250/05c', u'/srv/node/bcache11/objects/250/05d', u'/srv/node/bcache11/objects/250/05e', u'/srv/node/bcache11/objects/250/05f', u'/srv/node/bcache11/objects/250/060', u'/srv/node/bcache11/objects/250/061', u'/srv/node/bcache11/objects/250/062', u'/srv/node/bcache11/objects/250/063', u'/srv/node/bcache11/objects/250/064', u'/srv/node/bcache11/objects/250/065', u'/srv/node/bcache11/objects/250/066', u'/srv/node/bcache11/objects/250/067', u'/srv/node/bcache11/objects/250/068', u'/srv/node/bcache11/objects/250/069', u'/srv/node/bcache11/objects/250/06a', u'/srv/node/bcache11/objects/250/06b', u'/srv/node/bcache11/objects/250/06c', u'/srv/node/bcache11/objects/250/06d', u'/srv/node/bcache11/objects/250/06e', u'/srv/node/bcache11/objects/250/06f', u'/srv/node/bcache11/objects/250/070', u'/srv/node/bcache11/objects/250/071', u'/srv/node/bcache11/objects/250/072', u'/srv/node/bcache11/objects/250/073', u'/srv/node/bcache11/objects/250/074', u'/srv/node/bcache11/objects/250/075', u'/srv/node/bcache11/objects/250/076', u'/srv/node/bcache11/objects/250/077', u'/srv/node/bcache11/objects/250/078', u'/srv/node/bcache11/objects/250/079', u'/srv/node/bcache11/objects/250/07a', u'/srv/node/bcache11/objects/250/07b', u'/srv/node/bcache11/objects/250/07c', u'/srv/node/bcache11/objects/250/07d', u'/srv/node/bcache11/objects/250/07e', u'/srv/node/bcache11/objects/250/07f', u'/srv/node/bcache11/objects/250/080', u'/srv/node/bcache11/objects/250/081', u'/srv/node/bcache11/objects/250/082', u'/srv/node/bcache11/objects/250/083', u'/srv/node/bcache11/objects/250/084', u'/srv/node/bcache11/objects/250/085', u'/srv/node/bcache11/objects/250/086', u'/srv/node/bcache11/objects/250/087', u'/srv/node/bcache11/objects/250/088', u'/srv/node/bcache11/objects/250/089', u'/srv/node/bcache11/objects/250/08a', u'/srv/node/bcache11/objects/250/08b', u'/srv/node/bcache11/objects/250/08c', u'/srv/node/bcache11/objects/250/08d', u'/srv/node/bcache11/objects/250/08e', u'/srv/node/bcache11/objects/250/08f', u'/srv/node/bcache11/objects/250/090', u'/srv/node/bcache11/objects/250/091', u'/srv/node/bcache11/objects/250/092', u'/srv/node/bcache11/objects/250/093', u'/srv/node/bcache11/objects/250/094', u'/srv/node/bcache11/objects/250/095', u'/srv/node/bcache11/objects/250/096', u'/srv/node/bcache11/objects/250/097', u'/srv/node/bcache11/objects/250/098', u'/srv/node/bcache11/objects/250/099', u'/srv/node/bcache11/objects/250/09a', u'/srv/node/bcache11/objects/250/09b', u'/srv/node/bcache11/objects/250/09c', u'/srv/node/bcache11/objects/250/09d', u'/srv/node/bcache11/objects/250/09e', u'/srv/node/bcache11/objects/250/09f', u'/srv/node/bcache11/objects/250/0a0', u'/srv/node/bcache11/objects/250/0a1', u'/srv/node/bcache11/objects/250/0a2', u'/srv/node/bcache11/objects/250/0a3', u'/srv/node/bcache11/objects/250/0a4', u'/srv/node/bcache11/objects/250/0a5', u'/srv/node/bcache11/objects/250/0a6', u'/srv/node/bcache11/objects/250/0a7', u'/srv/node/bcache11/objects/250/0a8', u'/srv/node/bcache11/objects/250/0a9', u'/srv/node/bcache11/objects/250/0aa', u'/srv/node/bcache11/objects/250/0ab', u'/srv/node/bcache11/objects/250/0ac', u'/srv/node/bcache11/objects/250/0ad', u'/srv/node/bcache11/objects/250/0ae', u'/srv/node/bcache11/objects/250/0af', u'/srv/node/bcache11/objects/250/0b0', u'/srv/node/bcache11/objects/250/0b1', u'/srv/node/bcache11/objects/250/0b2', u'/srv/node/bcache11/objects/250/0b3', u'/srv/node/bcache11/objects/250/0b4', u'/srv/node/bcache11/objects/250/0b5', u'/srv/node/bcache11/objects/250/0b6', u'/srv/node/bcache11/objects/250/0b7', u'/srv/node/bcache11/objects/250/0b8', u'/srv/node/bcache11/objects/250/0b9', u'/srv/node/bcache11/objects/250/0ba', u'/srv/node/bcache11/objects/250/0bb', u'/srv/node/bcache11/objects/250/0bc', u'/srv/node/bcache11/objects/250/0bd', u'/srv/node/bcache11/objects/250/0be', u'/srv/node/bcache11/objects/250/0bf', u'/srv/node/bcache11/objects/250/0c0', u'/srv/node/bcache11/objects/250/0c1', u'/srv/node/bcache11/objects/250/0c2', u'/srv/node/bcache11/objects/250/0c3', u'/srv/node/bcache11/objects/250/0c4', u'/srv/node/bcache11/objects/250/0c5', u'/srv/node/bcache11/objects/250/0c6', u
/var/log/syslog.7.gz:May 11 06:05:02 myhost-cs-011 object-replicator: STDERR: Traceback (most recent call last):
/var/log/syslog.7.gz:May 11 06:05:02 myhost-cs-011 object-replicator: STDERR: File "/usr/lib/python2.7/logging/handlers.py", line 860, in emit
/var/log/syslog.7.gz:May 11 06:05:02 myhost-cs-011 object-replicator: STDERR: self.socket.send(msg)
/var/log/syslog.7.gz:May 11 06:05:02 myhost-cs-011 object-replicator: STDERR: error: [Errno 105] No buffer space available
/var/log/syslog.7.gz:May 11 06:05:02 myhost-cs-011 object-replicator: STDERR: Logged from file replicator.py, line 185
/var/log/syslog.7.gz:May 11 06:07:05 myhost-cs-011 object-replicator: 118/118 (100.00%) partitions replicated in 3300.93s (0.04/sec, 0s remaining)
/var/log/syslog.7.gz:May 11 06:07:05 myhost-cs-011 object-replicator: 0 successes, 3865 failures

Somehow the kill signal doesn't appear to effectively end this process. this host has 28 such rsync processes stuck.

Full logs available at:
https://pastebin.canonical.com/188526/
https://pastebin.canonical.com/188527/

Trusty/Mitaka

ubuntu@myhost-cs-011:~$ dpkg -l|grep swift-object
ii swift-object 2.7.0-0ubuntu2~cloud0 all distributed virtual object store - object server
ubuntu@myhost-cs-011:~$ swift-recon -r
===============================================================================
--> Starting reconnaissance on 9 hosts
===============================================================================
[2017-05-17 21:18:05] Checking on replication
[replication_failure] low: 180, high: 3689, avg: 569.9, total: 5129, Failed: 0.0%, no_result: 0, reported: 9
[replication_success] low: 0, high: 7, avg: 3.2, total: 29, Failed: 0.0%, no_result: 0, reported: 9
[replication_time] low: 30, high: 90, avg: 61.5, total: 553, Failed: 0.0%, no_result: 0, reported: 9
[replication_attempted] low: 28, high: 118, avg: 41.6, total: 374, Failed: 0.0%, no_result: 0, reported: 9
Oldest completion was 2017-05-09 21:29:09 (7 days ago) by 10.101.140.59:6000.
Most recent completion was 2017-05-17 21:00:25 (17 minutes ago) by 10.101.140.54:6000.
===============================================================================
ubuntu@myhost-cs-011:~$ uname -a
Linux myhost-cs-011 4.4.0-51-generic #72~14.04.1-Ubuntu SMP Thu Nov 24 19:22:30 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@myhost-cs-011:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"

I believe charm version is 17.02

I see there's a swift 2.7.1 available for trusty. Would this have a fix for this issue?

Revision history for this message
Drew Freiberger (afreiberger) wrote :

FYI, After updating the above server to swift 2.7.1, we are still seeing the exact same issues.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I believe this issue was the result of hanging non-interruptable storage hardware errors.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

After updating to a version of the charm that includes the ability to tune the rsync timeout, we upped the timeout to >12 hours, and replication is able to complete on many hosts, but some rsync processes are hanging or timing out on the server side and causing swift-object-replicator hangups polling filehandles for the rsync pipes which interrupts the every 5 minute log-dump of current replication status. This shows up after "lockup detected, killing coros..." shows up in the logs.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I have a theory that perhaps the threading employed in swift-object-replicator may not be properly waiting for proper termination of the rsync processes if they are in the non-blocking i/o run state, as is often the case in the environment we are having issues with. I believe it it due to many writes happening from other nodes via the inbound rsync connections are causing a denial of service to the outbound rsync traffic at the disk controller. I think the "killing coros" section of code may need some improvement to ensure the rsyncs are culled before they return. We are ending up with a lot of zombie rsync processes under the swift-object-replicator parent.

Revision history for this message
James Page (james-page) wrote :

I've marked this as medium priority for the charm; however I'm not sure we're really to the bottom of the actual issue still - this might be a swift bug, rather than something the charm is doing with regards to configuration of swift.

Or our 7 year old swift/rsync config may be way out of date.

Changed in charm-swift-storage:
importance: Undecided → Medium
Revision history for this message
James Page (james-page) wrote :

(but the docs would indicate we're still pretty good on alignment)

Changed in charm-swift-storage:
status: New → Incomplete
Revision history for this message
Drew Freiberger (afreiberger) wrote :

It appears that this is an upstream swift issue. I fear that the greenthreads coro reaping process being used in swift is failing to account for processes not dying due to being stuck in blocking i/o wait (D) state.

Revision history for this message
clayg (clay-gerrard) wrote :

yes! that! I think if the "watcher" was moved into a separate os thread it'd have a better chance of detecting the lockup. But to do anything about it we'd need to be able to respawn a *process* - so the control needs to be moved *way* up. The "wrokers" design [1] recently added to the reconstructor is ripe to be ported/extended/unified with the replicator and could help with this problem in two ways:

1) the workers option is designed to isolate processes to a subset of disks - a dead/hung worker will only prevent progress on a smaller subset of devices (the bad one(s)!)
2) the workers are isolated processes, and there's a controller process that is able to detect when they die and restart them - if a watch dog decides one isn't making progress any more it could just shoot it in the head and the framework would automatically start up a new one in it's place!

Revision history for this message
clayg (clay-gerrard) wrote :
Revision history for this message
Drew Freiberger (afreiberger) wrote :

james-page:

The only thing I can think of for the actual charm in this bug would be sane defaults or "required config" with no default.

Unfortunately, this specific client's use-case is very odd, but the ultimate issue is that given the size of the build (9 hosts, 3 replicas, 20 disks each) it should have never had a partition-power of 8.

not sure if you could set that to auto and require population of num_machines and num_disks or some other way to safeguard against not thinking about this setting that's only able to be set once for a given swift installation.

Maybe make it something set by an action like "initialize rings" that takes in non-defaulted arguments only?

Changed in charm-swift-storage:
status: Incomplete → In Progress
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Drew, I don't see any patches pending for this bug so as it's been a few months I will mark this back to New. If you have a patch pending or landed for this LP please go ahead and paste a link. Thanks.

Changed in charm-swift-storage:
status: In Progress → New
Revision history for this message
David Ames (thedac) wrote :

For the charm at a minimum let's create a sanity check on the replication configuration.

Partition-power defaults to 8. This seems like a nearly impossible task to predict the correct default for all situations. Ultimately, the value is configurable and we can't stop people from failing to adjust this value.

But since we have seen this fail in the wild let's explore Drew's suggestion of making ring initialization an action that can be performed post deployment. This would force consideration of the implications of all pertinent replication settings.

Another option is to remove the replication configuration settings' defaults making the charm go into blocked state until they are set also forcing consideration of their implications.

Changed in charm-swift-storage:
status: New → Triaged
milestone: none → 19.04
David Ames (thedac)
Changed in charm-swift-storage:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-swift-storage:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-swift-storage:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-swift-storage:
milestone: 20.01 → 20.05
Changed in charm-swift-storage:
assignee: nobody → Drew Freiberger (afreiberger)
Revision history for this message
Drew Freiberger (afreiberger) wrote :

For the swift-storage charm, I've added in an nrpe monitor check that validates that the swift-object-replicator is actively logging into syslog. The indication that this bug has been exhibited is that the logs from swift-object-replicator stop providing the "replicated X of Y partitions (X%)" messages every five minutes. This monitor's alert will allow the cloud operators to identify and restart hung swift-object-replicator threads.

https://review.opendev.org/#/c/700444/

Changed in charm-swift-storage:
status: Triaged → In Progress
Revision history for this message
Andrea Ieri (aieri) wrote :

Subscribing field-medium since we need the nrpe check urgently for a customer.
We have a merge proposal ready for it, we only need a review and a merge.

David Ames (thedac)
Changed in charm-swift-storage:
milestone: 20.05 → 20.08
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-storage (master)

Reviewed: https://review.opendev.org/700444
Committed: https://git.openstack.org/cgit/openstack/charm-swift-storage/commit/?id=74daa465d626f04f30685f46025fbd6799be2657
Submitter: Zuul
Branch: master

commit 74daa465d626f04f30685f46025fbd6799be2657
Author: Drew Freiberger <email address hidden>
Date: Mon Dec 23 13:18:56 2019 -0600

    Add monitoring for object-replicator logs disappearing

    As noted in lp#1691570, there is an issue with storage I/O during
    coro-thread cleanup that affects the swift-object-replicator's ability
    to complete replication successfully. This is most easily witnessed by
    the lack of the every 5 minute replicated percentage complete messages
    that should come from the swift-object-replicator daemon to syslog.
    This patch monitors for and alerts on the condition of the "replicated"
    line missing from syslog within the past 15 minutes.

    Change-Id: Ieb15da3f3f67fa9bcad03151e36c70faae4c36c9
    Closes-Bug: 1691570

Changed in charm-swift-storage:
status: In Progress → Fix Committed
Changed in charm-swift-storage:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.