Ring refuses to save even when 100% parts move
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
If you're gradually adding weight while adding devices in multiple zones in a EC ring with multiple replicas in each zone, it's possible the rings preference to move replicas of parts that need to disperse to the new zone *first* might leave some new capacity in the original zone waiting for enough frags to move before it can start taking the part-replicas it wants.
duplication example script is attached
Work arounds are to either use "rebalance -f" multiple times until enough frags get assigned to the other zone that the balance changed detection code will start working again.
Or to change weights before you rebalance until the over-assignment in the standing zone becomes more apparent in the balance.
Fix would be to just look at dispersion or changed_parts coming out of rebalance (in addition to delta balance) before we design if the rebalance is worth saving.
With enough replicas and a failed device it's easier to see that we should look at delta_dispersion in addition to delta_balance:
https:/ /gist.github. com/clayg/ b0d0d41a382e703 56bb58a1ee94d1b 73
With the failed device on the server that's desperately trying to shed parts, and enough replicas -
balance will not change significantly from one invocation to the next while rebalance is busy fixing dispersion...
We should expect that as a desire-able behavior and use delta_dispersion to get over the hump:
ubuntu@ saio:/vagrant/ .scratch/ rings/tata$ swift-ring-builder stuck.builder rebalance saio:/vagrant/ .scratch/ rings/tata$ swift-ring-builder stuck.builder |head 0ab60eff2a2bb3a 75 saio:/vagrant/ .scratch/ rings/tata$ swift-ring-builder stuck.builder rebalance saio:/vagrant/ .scratch/ rings/tata$ swift-ring-builder stuck.builder rebalance -f ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -- saio:/vagrant/ .scratch/ rings/tata$ swift-ring-builder stuck.builder rebalance
Cowardly refusing to save rebalance as it did not change at least 1%.
ubuntu@
stuck.builder, build version 63, id a5b9fbd213bb4c2
256 partitions, 13.000000 replicas, 1 regions, 1 zones, 52 devices, 100.00 balance, 100.00 dispersion
...
ubuntu@
Cowardly refusing to save rebalance as it did not change at least 1%.
ubuntu@
Reassigned 256 (100.00%) partitions. Balance is now 100.00. Dispersion is now 0.00
-------
NOTE: Balance of 100.00 indicates you should push this
ring, wait at least 0 hours, and rebalance/repush.
-------
ubuntu@
Reassigned 255 (99.61%) partitions. Balance is now 1.56. Dispersion is now 0.00
Notice the delta_dispersion when "cowardly refusing to save rebalance" is *HUGE*