swift-ring-builder not distributing partitions evenly between zones

Bug #1400497 reported by Tim Leak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Invalid
Undecided
Unassigned

Bug Description

There seems to be a different behavior in the swift-ring-builder between swift versions 1.10.0 and 2.2.0 which leads to multiple copies of an object being stored in the same zone.

Using the swift-ring-builder utility in swift version 1.10.0, I could use the following commands to build an object ring for a single node with 2 replicas, 2 zones, and 3 equally weighted devices and see the partitions distributed evenly between the 2 zones. Using the same commands after updating to swift version 2.2.0, I see the partitions distributed evenly among the 3 devices, but not distributed evenly between the 2 zones, which seems to break a fundamental concept of swift.

Here is the script that executes the commands:

    #!/bin/bash
    REPLICA_COUNT=2
    swift-ring-builder object.builder create 15 $REPLICA_COUNT 1
    swift-ring-builder object.builder add z1-172.1.1.25:6000/d1 99.0
    swift-ring-builder object.builder add z2-172.1.1.25:6000/d2 99.0
    swift-ring-builder object.builder add z2-172.1.1.25:6000/d3 99.0
    swift-ring-builder object.builder rebalance
    swift-ring-builder object.builder set_min_part_hours 1

With version 1.10.0, I see the following output:

    WARNING: No region specified for z1-172.1.1.25:6000/d1. Defaulting to region 1.
    Device d0r1z1-172.1.1.25:6000R172.1.1.25:6000/d1_"" with 99.0 weight got id 0WARNING: No region specified for z2-172.1.1.25:6000/d2. Defaulting to region 1.
    Device d1r1z2-172.1.1.25:6000R172.1.1.25:6000/d2_"" with 99.0 weight got id 1WARNING: No region specified for z2-172.1.1.25:6000/d3. Defaulting to region 1.
    Device d2r1z2-172.1.1.25:6000R172.1.1.25:6000/d3_"" with 99.0 weight got id 2
    Reassigned 32768 (100.00%) partitions. Balance is now 50.00.
    -------------------------------------------------------------------------------
    NOTE: Balance of 50.00 indicates you should push this
          ring, wait at least 1 hours, and rebalance/repush.
    -------------------------------------------------------------------------------
    The minimum number of hours before a partition can be reassigned is now set to 1

And with version 1.10.0, I see the following builder file created:

    $ swift-ring-builder object.builder
    object.builder, build version 3
    32768 partitions, 2.000000 replicas, 1 regions, 2 zones, 3 devices, 50.00 balance
    The minimum number of hours before a partition can be reassigned is 1
    Devices: id region zone ip address port replication ip replication
    port name weight partitions balance meta
                 0 1 1 172.1.1.25 6000 172.1.1.25 6000 d1 99.00 32768 50.00
                 1 1 2 172.1.1.25 6000 172.1.1.25 6000 d2 99.00 16384 -25.00
                 2 1 2 172.1.1.25 6000 172.1.1.25 6000 d3 99.00 16384 -25.00

With version 2.2.0, I see the following output:

    WARNING: No region specified for z1-172.1.1.25:6000/d1. Defaulting to region 1.
    Device d0r1z1-172.1.1.25:6000R172.1.1.25:6000/d1_"" with 99.0 weight got id 0
    WARNING: No region specified for z2-172.1.1.25:6000/d2. Defaulting to region 1.
    Device d1r1z2-172.1.1.25:6000R172.1.1.25:6000/d2_"" with 99.0 weight got id 1
    WARNING: No region specified for z2-172.1.1.25:6000/d3. Defaulting to region 1.
    Device d2r1z2-172.1.1.25:6000R172.1.1.25:6000/d3_"" with 99.0 weight got id 2
    Reassigned 32768 (100.00%) partitions. Balance is now 0.00.
    The minimum number of hours before a partition can be reassigned is now set to 1

And with version 2.2.0, I see the following builder file created:

    $ swift-ring-builder object.builder
    object.builder, build version 3
    32768 partitions, 2.000000 replicas, 1 regions, 2 zones, 3 devices, 0.00 balance
    The minimum number of hours before a partition can be reassigned is 1
    Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
                 0 1 1 172.1.1.25 6000 172.1.1.25 6000 d1 99.00 21846 0.00
                 1 1 2 172.1.1.25 6000 172.1.1.25 6000 d2 99.00 21845 -0.00
                 2 1 2 172.1.1.25 6000 172.1.1.25 6000 d3 99.00 21845 -0.00

Can anyone tell me what has changed and whether or not there is additional configuration that is required with the updated code to resolve this?

Thanks in advance.

Revision history for this message
Christian Schwede (cschwede) wrote :

Thanks for your bug report!

Indeed there is a change in the calculation:

Swift now also takes the device weight into account when assigning partitions. Your zone 1 has a total weight of 99, but zone 2 has a total weight of 198 - thus more partitions are assigned to zone 2.
If you increase the weight of device number 1 to 198 (or lower the weight for device 2+3 to 49.5) and rebalance your ring you'll see that zone 1 has the same amount of partitions assigned as zone 2.

It's also a little bit special because there are two zones with two replicas in this example. Let's assume you would add another zone, also with a single device and a weight of 99 - now you end up with one replica in zone 2, and the other replica in zone 1 or 3.

I start working on a patch for this that will raise a warning if there is a problem (like in this case) and update the docs as well.

Changed in swift:
assignee: nobody → Christian Schwede (cschwede)
status: New → Incomplete
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/140478

Revision history for this message
Tim Leak (tim-leak) wrote :

Thanks for the response, Christian. I understand now that the device weight is being considered when the partitions are assigned, but what I am seeing is that this behavior can cause both copies of ingested content to be stored on devices in the same zone. Again, this seems to break a fundamental concept of swift:

<from the Swift Architectural Overview, under The Ring>
"Data can be isolated with the concept of zones in the ring. Each replica of a partition is guaranteed to reside in a different zone. A zone could represent a drive, a server, a cabinet, a switch, or even a datacenter."

I would expect that the weight of the devices would need be considered when assigning partitions within a zone, say to allow a 3TB disk to be allocated more partitions than a 1TB disk. I would not expect that the cumulative weights of the zones would need to be the same in order to guarantee that copies of ingested data are isolated.

If this is by design, and I do need to guarantee that different copies of a piece of ingested content are stored on devices in different zones, then what specific characteristics do I need to configure for the devices/zones in order to make that happen?

Is there any difference when I introduce different regions for further isolation?

Thanks again for your help.

Revision history for this message
Christian Schwede (cschwede) wrote : Re: [Bug 1400497] Re: swift-ring-builder not distributing partitions evenly between zones

Hello Tim,

On 09.12.14 23:43, Tim Leak wrote:
> Thanks for the response, Christian. I understand now that the
> device weight is being considered when the partitions are assigned,
> but what I am seeing is that this behavior can cause both copies of
> ingested content to be stored on devices in the same zone. Again,
> this seems to break a fundamental concept of swift:
>
> <from the Swift Architectural Overview, under The Ring> "Data can be
> isolated with the concept of zones in the ring. Each replica of a
> partition is guaranteed to reside in a different zone. A zone could
> represent a drive, a server, a cabinet, a switch, or even a
> datacenter."

Yes, you're right - unfortunately this section in the documents was not
updated with the patch. I submitted a patch for this:

https://review.openstack.org/#/c/140478/

> I would expect that the weight of the devices would need be
> considered when assigning partitions within a zone, say to allow a
> 3TB disk to be allocated more partitions than a 1TB disk. I would
> not expect that the cumulative weights of the zones would need to be
> the same in order to guarantee that copies of ingested data are
> isolated.

So, let's take the following example: you have one zone with 10 x 4 TB
disks, and assigned a weight of 10 x 4000. Another zone has 10 x 2 TB
disks, and a weight of 10 x 2000. Now the second zone will run out of
space very quickly (with Swift < 2.1), so you have to assign more space
to this zone - and by doing this you also assign more weight.

> If this is by design, and I do need to guarantee that different
> copies of a piece of ingested content are stored on devices in
> different zones, then what specific characteristics do I need to
> configure for the devices/zones in order to make that happen?

This is by design. The problem is when you have less regions/zones/nodes
than replicas and add a new region/zone/replica. Before Swift 2.1 one
replica of each partition was moved to the new region/zone - and this is
a huge impact for larger deployments or deployments with a lot of
requests. By including the weight into the calculation it is now
possible to control the data that will be moved to other regions/zones.

The documentation recommends 5 zones, and in that case (or rings with
even more zones) you only have to ensure that none of the zones has a
weight of more than 1/replicas * total weight assigned. This should be
the case if all of your zones are similar sized.

The submitted patch will now raise a warning after rebalancing if there
is any tier at risk.

> Is there any difference when I introduce different regions for
> further isolation?

No, it's the same behavior.

Let me know if you have more questions on this. I can also have a look
at a specific ring if you can share that data.

Best,

Christian

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by Christian Schwede (<email address hidden>) on branch: master
Review: https://review.openstack.org/140478

Revision history for this message
Samuel Merritt (torgomatic) wrote :

I'm closing this one as invalid; with the new overload stuff, users can tune their rings for dispersion vs. even weighting to exactly the degree they require. If this is incorrect, please reopen the bug.

Changed in swift:
assignee: Christian Schwede (cschwede) → nobody
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.