min_part_hours of 0 is not helpful - fix it
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
Fix Released
|
Medium
|
Cheng Li |
Bug Description
As more people are moving to monitoring partition placement in their clusters independently to control when it's "safe" to rebalance a ring - min_part_hours is starting to create more and more surprising behaviors (e.g. lp bug #1558754)
Unfortunately a min_part_hours of 0 is a non-starter because of it's tendency to move more than one replica of a part in the same rebalance [1] - which is *related* to the constraint of min_part_hours but somewhat orthogonal. Even if a part hasn't moved in >24 hours it's still never sane to move more than one part-replica in the same rebalance cycle. No matter how small your min_part_hours it should still require at least N rebalances to move N part-replicas of the same part.
I wouldn't necessarily *recommend* running your builders with min-part-hours of 0 - but if you're fully automating dispersion report style partition placement monitoring and tying that into automated ring rebalancing it's starting to look more attractive to take full control of how frequently part-replicas are allowed to move separately from min_part_hours. Essentially as soon as parts are on the right nodes - if you have pending changes to the ring (changes weights of devices your gradually adding/removing, remove failed devices or entire nodes, etc.) it's best to allow those changes to proceed as soon as the replication from the previous rebalance is "finished". Worse still if a topology change coupled with device failures causes new capacity to receive parts that are not optimum for dispersion (because the best parts are locked down from min_part_hours) - you may end up in a placement hole that's tedious for the current rebalance algorithm to dig out of (see all the "part_swapping" tests in test.unit.
It's possible to run with a *non-zero* min_part_hours and use pretend_
Discuss.
1. here's an obvious example:
#!/bin/bash
rm test.builder || true
swift-ring-builder test.builder create 8 3 0
swift-ring-builder test.builder add r1z1-127.
swift-ring-builder test.builder add r1z1-127.
swift-ring-builder test.builder add r1z1-127.
swift-ring-builder test.builder rebalance
swift-ring-builder test.builder add r1z1-127.
swift-ring-builder test.builder add r1z1-127.
swift-ring-builder test.builder add r1z1-127.
swift-ring-builder test.builder rebalance
if [ $? -eq 0 ]; then
echo "Should not have reassigned >100% of part-replicas!"
exit 1
fi
Changed in swift: | |
importance: | Undecided → Medium |
Hi clayg,
I can think of two solutions for this problem. And I prefer the first one, but I don't know if this will cause other problems.
Proposal 1.
One hour is very long, but one **second** is not. Cloud we change the unit of elapsed time to second instead of hour?
for example: /github. com/openstack/ swift/blob/ master/ swift/common/ ring/builder. py#L847- L858
elapsed_ seconds = int(time() - self._last_ part_moves_ epoch)
last_ plus_elapsed = self._last_ part_moves[ part] + elapsed_seconds
self. _last_part_ moves[part] = last_plus_elapsed
self. _last_part_ moves[part] = 0xff * 3600
self._ last_part_ moves_epoch = int(time())
https:/
these code could be replaced with follow lines:
```
if elapsed_seconds <= 0:
return
for part in range(self.parts):
if last_plus_elapsed < 0xff * 3600:
else:
```
and part_moves[ part] < self.min_part_hours part_moves[ part] <= self.min_part_hours * 3600
```
self._last_
```
be replaced by
```
self._last_
```
Proposal 2.
Adding an array to track if one partition had ever been moved in the rebalance.
``` part_moves[ part] = 0
self._last_
```
will be replaced with:
``` part_moves[ part] = 0 moved[part] = True
self._last_
self._part_
```
``` part_moves[ part] < self.min_part_hours
if self._last_
```
be replaces with: part_moves[ part] < self.min_part_hours or self._part_ moved[part]
```
if self._last_
```