Object replicator hang
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
New
|
Undecided
|
Unassigned |
Bug Description
We have a 6 node swift cluster running 2.7.0, with 2 regions. Write affinity is enabled. The regions are connected via VPN WAN (one network for storage, one for replication).
We are seeing one or more object replicators frequently entering a state where:
- no more stats progress messages are logged
- require kill -9 before they can be shutdown
This looks to be similar to https:/
I see some lock timeouts for objects too - often as the last replication log message (however I've spotted others too, were the replicator continues happily on afterwards - so this may be irrelevant):
Jan 27 13:33:19 cat-por-ostor003 object-server: ERROR __call__ error with REPLICATE /obj02/2077350 : LockTimeout (10s) /srv/node/
I'll attach a segment of log for perusal.
We have not adjusted either rsync or lockup timeout. I wonder if we have specified too much concurrency for the replicator (I'll attach config), here's a snippet (nodes have 32 hyperthreaded cpus and 4 disks):
[DEFAULT]
...
workers = 64
[pipeline:main]
pipeline = healthcheck recon object-server
[object-replicator]
concurrency = 32
The log as promised.