Reconstructor should not hash suffixes after failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
During a big rebalance, the SSYNC replication server can suffer a lot from eventlet hub starvation (i.e. a blocking stat/write in one coro can cause a timeout in another) - especially when there is no IO limits per device.
The SSYNC protocol *is* limited per device by default (replication_
Even when running in handoffs_first - any reconstructor ssync request (even ones that get rejected for concurrency) will trigger a rehash [1]
In the happy path, this unbounded IO is optimistic at best; at worst its maybe only marginally helpful - but unlike rsync; ssync invalidates hashes as it goes - so it is *not* required. In the error path, it is actively harmful committing valuable IOps to zero-value work and activly contributing to the very IO conention that is causing the problem. It there a word for a positive feedback loop with a negative outcome? Downward death spiral?
Fix is trivial - attached.
Changed in swift: | |
importance: | Undecided → High |
summary: |
- Reconstructor should not sync after failure + Reconstructor should not hash suffixes after failure |
Changed in swift: | |
importance: | High → Medium |
https:/ /review. openstack. org/#/c/ 435152/