min_part_hours of 0 is not helpful - fix it

Bug #1586167 reported by clayg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Cheng Li

Bug Description

As more people are moving to monitoring partition placement in their clusters independently to control when it's "safe" to rebalance a ring - min_part_hours is starting to create more and more surprising behaviors (e.g. lp bug #1558754)

Unfortunately a min_part_hours of 0 is a non-starter because of it's tendency to move more than one replica of a part in the same rebalance [1] - which is *related* to the constraint of min_part_hours but somewhat orthogonal. Even if a part hasn't moved in >24 hours it's still never sane to move more than one part-replica in the same rebalance cycle. No matter how small your min_part_hours it should still require at least N rebalances to move N part-replicas of the same part.

I wouldn't necessarily *recommend* running your builders with min-part-hours of 0 - but if you're fully automating dispersion report style partition placement monitoring and tying that into automated ring rebalancing it's starting to look more attractive to take full control of how frequently part-replicas are allowed to move separately from min_part_hours. Essentially as soon as parts are on the right nodes - if you have pending changes to the ring (changes weights of devices your gradually adding/removing, remove failed devices or entire nodes, etc.) it's best to allow those changes to proceed as soon as the replication from the previous rebalance is "finished". Worse still if a topology change coupled with device failures causes new capacity to receive parts that are not optimum for dispersion (because the best parts are locked down from min_part_hours) - you may end up in a placement hole that's tedious for the current rebalance algorithm to dig out of (see all the "part_swapping" tests in test.unit.common.ring.test_builder)

It's possible to run with a *non-zero* min_part_hours and use pretend_min_part_hours_passed before each rebalance to achieve essentially this behavior but less obvious why that's better than just "fixing" min_part_hours 0.

Discuss.

1. here's an obvious example:

#!/bin/bash
rm test.builder || true
swift-ring-builder test.builder create 8 3 0
swift-ring-builder test.builder add r1z1-127.0.0.1:6000/d1 1.0
swift-ring-builder test.builder add r1z1-127.0.0.1:6000/d2 1.0
swift-ring-builder test.builder add r1z1-127.0.0.1:6000/d3 1.0
swift-ring-builder test.builder rebalance
swift-ring-builder test.builder add r1z1-127.0.0.2:6000/d4 1.0
swift-ring-builder test.builder add r1z1-127.0.0.2:6000/d5 1.0
swift-ring-builder test.builder add r1z1-127.0.0.2:6000/d6 1.0
swift-ring-builder test.builder rebalance
if [ $? -eq 0 ]; then
    echo "Should not have reassigned >100% of part-replicas!"
    exit 1
fi

Revision history for this message
Cheng Li (shcli) wrote :

Hi clayg,

I can think of two solutions for this problem. And I prefer the first one, but I don't know if this will cause other problems.

Proposal 1.
One hour is very long, but one **second** is not. Cloud we change the unit of elapsed time to second instead of hour?

for example:
https://github.com/openstack/swift/blob/master/swift/common/ring/builder.py#L847-L858
these code could be replaced with follow lines:
```
        elapsed_seconds = int(time() - self._last_part_moves_epoch)
        if elapsed_seconds <= 0:
            return
        for part in range(self.parts):
            last_plus_elapsed = self._last_part_moves[part] + elapsed_seconds
            if last_plus_elapsed < 0xff * 3600:
                self._last_part_moves[part] = last_plus_elapsed
            else:
                self._last_part_moves[part] = 0xff * 3600
        self._last_part_moves_epoch = int(time())
```

and
```
self._last_part_moves[part] < self.min_part_hours
```
be replaced by
```
self._last_part_moves[part] <= self.min_part_hours * 3600
```

Proposal 2.
Adding an array to track if one partition had ever been moved in the rebalance.

```
self._last_part_moves[part] = 0
```

will be replaced with:

```
self._last_part_moves[part] = 0
self._part_moved[part] = True
```

```
if self._last_part_moves[part] < self.min_part_hours
```

be replaces with:
```
if self._last_part_moves[part] < self.min_part_hours or self._part_moved[part]
```

Changed in swift:
assignee: nobody → Cheng Li (shcli)
Revision history for this message
clayg (clay-gerrard) wrote :

I'm not sure exactly how the first proposal works, the trick seems to be adding a the granularity to _last_part_moves - which is currently just a cheap a byte array.

The second proposal seems quite reasonable to me and was indeed how I was thinking the problem would be approached. We'd only need to keep it around during the rebalance, depending on how many parts move in a rebalance 2 ** partpower bitmap might be more attractive than a huge python dictionary of booleans. This would change the behavior of rings with min_part_hours = 0, but for the better as I reason.

Revision history for this message
Cheng Li (shcli) wrote :

Yes, the second one is clearer. And the idea of bitmap is great.

Revision history for this message
Cheng Li (shcli) wrote :

I have searched current swift code, but didn't find bitmap related code.
Does swift/openstack have the standard for using bitmap?

Revision history for this message
clayg (clay-gerrard) wrote :

... not to my knowledge? Maybe just "do it"?

parts_moved = 0

# set part
parts_moved |= 1 << part

# test part
parts_moved & 1 << part

I was poking at it with sys.getsizeof(parts_moved) comparing parts_moved[random.randint(0, 2 ** 16)] = True vs parts_moved |= 1 << random.randint(0, 2 ** 16) and even after only a few hundred parts I think we might be better off with the big num than the dict?

... maybe there's something better.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/346475

Changed in swift:
status: New → In Progress
clayg (clay-gerrard)
Changed in swift:
importance: Undecided → Medium
Revision history for this message
Matthew Oliver (matt-0) wrote : Re: [Bug 1586167] Re: min_part_hours of 0 is not helpful - fix it
Download full text (4.2 KiB)

I remember playing with a bitmap idea back in sharding POC #1, where I'd
track what container shards existed in certain shard powers. I don't know
how useful it would be but here is a link to the basic Bitmap class I
created for it.. it never was fully tested though, and unlike just a
number, it used a string but really only a Hex string (each character
represents 4 partitions).. I don't know why I didn't just go one step
further and just use ints and bit shifting (you'd have to ask past Matt)..
so you could say this is somewhere in between the 2 solutions.

But either way, seeing as POC #1 is a no go, maybe the Bitmap class that
was never used could be fixed, tuned and inproved:
https://github.com/matthewoliver/swift/blob/sharding/swift/common/utils.py#L3324

Matt

On Tue, Oct 11, 2016 at 11:46 AM, clayg <email address hidden> wrote:

> ** Changed in: swift
> Importance: Undecided => Medium
>
> --
> You received this bug notification because you are a member of Swift
> Core security contacts, which is subscribed to OpenStack Object Storage
> (swift).
> https://bugs.launchpad.net/bugs/1586167
>
> Title:
> min_part_hours of 0 is not helpful - fix it
>
> Status in OpenStack Object Storage (swift):
> In Progress
>
> Bug description:
> As more people are moving to monitoring partition placement in their
> clusters independently to control when it's "safe" to rebalance a ring
> - min_part_hours is starting to create more and more surprising
> behaviors (e.g. lp bug #1558754)
>
> Unfortunately a min_part_hours of 0 is a non-starter because of it's
> tendency to move more than one replica of a part in the same rebalance
> [1] - which is *related* to the constraint of min_part_hours but
> somewhat orthogonal. Even if a part hasn't moved in >24 hours it's
> still never sane to move more than one part-replica in the same
> rebalance cycle. No matter how small your min_part_hours it should
> still require at least N rebalances to move N part-replicas of the
> same part.
>
> I wouldn't necessarily *recommend* running your builders with min-
> part-hours of 0 - but if you're fully automating dispersion report
> style partition placement monitoring and tying that into automated
> ring rebalancing it's starting to look more attractive to take full
> control of how frequently part-replicas are allowed to move separately
> from min_part_hours. Essentially as soon as parts are on the right
> nodes - if you have pending changes to the ring (changes weights of
> devices your gradually adding/removing, remove failed devices or
> entire nodes, etc.) it's best to allow those changes to proceed as
> soon as the replication from the previous rebalance is "finished".
> Worse still if a topology change coupled with device failures causes
> new capacity to receive parts that are not optimum for dispersion
> (because the best parts are locked down from min_part_hours) - you may
> end up in a placement hole that's tedious for the current rebalance
> algorithm to dig out of (see all the "part_swapping" tests in
> test.unit.common.ring.test_builder)
>
> It's possible to run with a *non-zero* min_part_hour...

Read more...

Revision history for this message
Matthew Oliver (matt-0) wrote :

I remember playing with a bitmap idea back in sharding POC #1, where I'd track what container shards existed in certain shard powers. I don't know how useful it would be but here is a link to the basic Bitmap class I created for it.. it never was fully tested though, and unlike just a number, it used a string but really only a Hex string (each character represents 4 partitions).. I don't know why I didn't just go one step further and just use ints and bit shifting (you'd have to ask past Matt).. so you could say this is somewhere in between the 2 solutions.

But either way, seeing as POC #1 is a no go, maybe the Bitmap class that was never used could be fixed, tuned and inproved: https://github.com/matthewoliver/swift/blob/sharding/swift/common/utils.py#L3324

Matt

Revision history for this message
Cheng Li (shcli) wrote :

Thanks, Matt. In the patch, I used bytes directly, because I didn't use too many features of bitmap, just set and get method.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/346475
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=ce26e78902a8c2feff5706a07750e34e9a94536e
Submitter: Jenkins
Branch: master

commit ce26e78902a8c2feff5706a07750e34e9a94536e
Author: cheng <email address hidden>
Date: Sun Jul 24 01:10:36 2016 +0000

    For any part, only one replica can move in a rebalance

    With a min_part_hours of zero, it's possible to move more than one
    replicas of the same part in a single rebalance.

    This change in behavior only effects min_part_hour zero rings, which
    are understood to be uncommon in production mostly because of this
    very specific and strange behavior of min_part_hour zero rings.

    With this change, no matter how small your min_part_hours it will
    always require at least N rebalances to move N part-replicas of the
    same part.

    To supplement the existing persisted _last_part_moves structure to
    enforce min_part_hours, this change adds a _part_moved_bitmap that
    exists only during the life of the rebalance, to track when rebalance
    moves a part in order to prevent another replicas of the same part
    from being moved in the same rebalance.

    Add authors: Clay Gerrard, <email address hidden>
                 Christian Schwede, <email address hidden>

    Closes-bug: #1586167

    Change-Id: Ia1629abd5ce6e1b3acc2e94f818ed8223eed993a

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/419816

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.13.0

This issue was fixed in the openstack/swift 2.13.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (stable/newton)

Reviewed: https://review.openstack.org/419816
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=e5dd050113646a93a0fe7fb1aed4f1cafdc9139f
Submitter: Jenkins
Branch: stable/newton

commit e5dd050113646a93a0fe7fb1aed4f1cafdc9139f
Author: cheng <email address hidden>
Date: Sun Jul 24 01:10:36 2016 +0000

    For any part, only one replica can move in a rebalance

    With a min_part_hours of zero, it's possible to move more than one
    replicas of the same part in a single rebalance.

    This change in behavior only effects min_part_hour zero rings, which
    are understood to be uncommon in production mostly because of this
    very specific and strange behavior of min_part_hour zero rings.

    With this change, no matter how small your min_part_hours it will
    always require at least N rebalances to move N part-replicas of the
    same part.

    To supplement the existing persisted _last_part_moves structure to
    enforce min_part_hours, this change adds a _part_moved_bitmap that
    exists only during the life of the rebalance, to track when rebalance
    moves a part in order to prevent another replicas of the same part
    from being moved in the same rebalance.

    Add authors: Clay Gerrard, <email address hidden>
                 Christian Schwede, <email address hidden>

    Closes-bug: #1586167

    Change-Id: Ia1629abd5ce6e1b3acc2e94f818ed8223eed993a
    (cherry picked from commit ce26e78)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.10.2

This issue was fixed in the openstack/swift 2.10.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.