Object replicator hang

Bug #1659712 reported by Mark Kirkwood
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

We have a 6 node swift cluster running 2.7.0, with 2 regions. Write affinity is enabled. The regions are connected via VPN WAN (one network for storage, one for replication).

We are seeing one or more object replicators frequently entering a state where:
- no more stats progress messages are logged
- require kill -9 before they can be shutdown

This looks to be similar to https://bugs.launchpad.net/swift/+bug/1575277 , but I'm not seeing any "lockup" messages in the logs, simply a lack of further progress. I do see that some rsyncs are continuing.

I see some lock timeouts for objects too - often as the last replication log message (however I've spotted others too, were the replicator continues happily on afterwards - so this may be irrelevant):

Jan 27 13:33:19 cat-por-ostor003 object-server: ERROR __call__ error with REPLICATE /obj02/2077350 : LockTimeout (10s) /srv/node/obj02/objects/2077350/.lock

I'll attach a segment of log for perusal.

We have not adjusted either rsync or lockup timeout. I wonder if we have specified too much concurrency for the replicator (I'll attach config), here's a snippet (nodes have 32 hyperthreaded cpus and 4 disks):

[DEFAULT]
...
workers = 64

[pipeline:main]
pipeline = healthcheck recon object-server

[object-replicator]
concurrency = 32

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :
Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

The log as promised.

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

I am seeing top level exception being raised occasionally:

Feb 2 14:45:02 cat-por-ostor003 object-server: Exception in top-level replication loop: Timeout (1800s)

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

I note that the replicator refuses to shutdown via 'service swift-object-replicator stop'. Doings an strace of what it is doing shows:

$ sudo strace -p 143759
Process 143759 attached
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)

Which suggests that we might need to increase one of the timeouts (http_timeout I think) and/or some interaction between the replicator concurrency in one region and rsync mac connections in the other (I've noticed we have the latter set too low in the target region)

Revision history for this message
clayg (clay-gerrard) wrote :

The attached log is showing a bunch of replication contention - but that's frankly pretty common during a large rebalance.

I don't *normally* have any problems killing swift-object-replicator with SIGTERM when it gets in a stuck state. The attached conf doesn't specify sync_method - so the default should be rsync, and the logs look like rsync. Which is fine, I don't use ssync for replicated policies either?

I don't think the timeout in that strace is indicating any particular configurable timeout i think that's just the eventlet hub doing it's poll thing - you could lsof and track down if fd=5 is a running rsync process - or maybe the eventlet tpool socket?

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :
Download full text (6.0 KiB)

To check the concurrency thing I reduced replicator concurrency to 4. It took longer, but I have a hang again.

As before:
$ sudo strace -p 143938
Process 143938 attached
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 60000) = 0 (Timeout)
...

Looking at the fd for it:
$ for file in `sudo ls /proc/143938/fd`;do
> sudo file /proc/143938/fd/$file
> done
/proc/143938/fd/0: symbolic link to `/dev/null'
/proc/143938/fd/1: symbolic link to `/dev/null'
/proc/143938/fd/10: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/11: broken symbolic link to `pipe:[1508803856]'
/proc/143938/fd/12: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/13: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/14: broken symbolic link to `pipe:[1508655189]'
/proc/143938/fd/15: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/16: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/17: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/18: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/19: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/2: symbolic link to `/dev/null'
/proc/143938/fd/20: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/21: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/22: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/23: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/24: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/25: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/26: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/27: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/28: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/29: broken symbolic link to `pipe:[1508640614]'
/proc/143938/fd/30: broken symbolic link to `pipe:[1508660889]'
/proc/143938/fd/4: broken symbolic link to `socket:[2796838036]'
/proc/143938/fd/5: broken symbolic link to `socket:[2797025638]'
/proc/143938/fd/6: broken symbolic link to `socket:[2797025639]'
/proc/143938/fd/7: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/8: broken symbolic link to `anon_inode:[eventpoll]'
/proc/143938/fd/9: broken symbolic link to `anon_inode:[eventpoll]'

Checking out the fd5 socket:
$ sudo lsof|grep swift|grep 2797025638swift-obj 143938 swift 5u IPv4 2797025638 0t0 TCP localhost.localdomain:48358->localhost.localdomain:38256 (ESTABLISHED)
swift-obj 143938 143968 swift 5u IPv4 2797025638 0t0 TCP localhost.localdomain:48358->localhost.localdomain:38256 (ESTABLISHED)
swift-obj 143938 143969 swift 5u IPv4 2797025638 0t0 TCP localhost.localdomain:48358->localhost.localdomain:38256 (ESTABLISHED)
swift-obj 143938 143970 swift 5u IPv4 2797025638 0t0...

Read more...

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

...and I should have added:
$ ps auxww|grep 143938
markir 113603 0.0 0.0 10444 2180 pts/0 S+ 14:34 0:00 grep 143938
swift 143938 4.8 3.7 3975872 2448772 ? Ssl Feb10 362:53 /opt/cat/openstack/swift/bin/python /usr/bin/swift-object-replicator /etc/swift/object-server.conf

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

We have made some changes:
- reducing object replicator and updater concurrency to 4
- increasing rsync and lockup timeouts timeouts

So far (10 days) no hanging replicators...

New config is:

$ cat /etc/swift/object-server.conf
...
[object-replicator]
log_name = object-replicator
concurrency = 4
stats_interval = 30
rsync_timeout = 1800
lockup_timeout = 3600

[object-updater]
log_name = object-updater
concurrency = 4

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.