swift-ringbuilder rebalance moves 100% partitions when adding a new node to a new region

Bug #1367826 reported by Florent Flament
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Christian Schwede

Bug Description

When adding a new node to a new region of an existing Swift cluster,
the number of partitions moved is still 100%. On the other hand, when
the node is added to the current region, only the required partitions
are moved.

Patch https://review.openstack.org/#/c/115441/ allows moving
partitions to a new region progressively. The drawback is that a lot
of useless traffic is generated inside the initial region.

Starting from a 5 nodes Swift cluster, when I add a new device to the
current 'r1' region with weight 1000, 18.75% of partitions are moved
to this new device (as expected).

  $ swift-ring-builder object.builder
  object.builder, build version 5
  262144 partitions, 3.000000 replicas, 1 regions, 1 zones, 5 devices, 0.00 balance
  The minimum number of hours before a partition can be reassigned is 0
  Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
               0 1 1 192.168.100.10 6000 192.168.100.10 6000 d1 3000.00 157287 0.00
               1 1 1 192.168.100.11 6000 192.168.100.11 6000 d1 3000.00 157286 -0.00
               2 1 1 192.168.100.12 6000 192.168.100.12 6000 d1 3000.00 157286 -0.00
               3 1 1 192.168.100.13 6000 192.168.100.13 6000 d1 3000.00 157287 0.00
               4 1 1 192.168.100.14 6000 192.168.100.14 6000 d1 3000.00 157286 -0.00
  $ swift-ring-builder object.builder add r1z1-192.168.100.15:6000/d1 1000
  Device d5r1z1-192.168.100.15:6000R192.168.100.15:6000/d1_"" with 1000.0 weight got id 5
  $ swift-ring-builder object.builder rebalance
  Reassigned 49152 (18.75%) partitions. Balance is now 0.00.
  $ swift-ring-builder object.builder
  object.builder, build version 7
  262144 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance
  The minimum number of hours before a partition can be reassigned is 0
  Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
               0 1 1 192.168.100.10 6000 192.168.100.10 6000 d1 3000.00 147456 0.00
               1 1 1 192.168.100.11 6000 192.168.100.11 6000 d1 3000.00 147456 0.00
               2 1 1 192.168.100.12 6000 192.168.100.12 6000 d1 3000.00 147456 0.00
               3 1 1 192.168.100.13 6000 192.168.100.13 6000 d1 3000.00 147456 0.00
               4 1 1 192.168.100.14 6000 192.168.100.14 6000 d1 3000.00 147456 0.00
               5 1 1 192.168.100.15 6000 192.168.100.15 6000 d1 1000.00 49152 0.00
  $

Now, I start from the same builder as previously with my 5 nodes
cluster. When I add a new device to a new 'r2' region with weight
1000, 100% of partitions (262144 partitions) are moved, while 18.75%
were expected to move as when adding the node in the 'r1'
region. While it is true that only 49152 partitions are in the new
region, many partitions seem to have been moved between nodes of the
'r1' region uselessly. This would most probably generate heavy traffic
on a running cluster.

  $ swift-ring-builder object.builder
  object.builder, build version 5
  262144 partitions, 3.000000 replicas, 1 regions, 1 zones, 5 devices, 0.00 balance
  The minimum number of hours before a partition can be reassigned is 0
  Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
               0 1 1 192.168.100.10 6000 192.168.100.10 6000 d1 3000.00 157287 0.00
               1 1 1 192.168.100.11 6000 192.168.100.11 6000 d1 3000.00 157286 -0.00
               2 1 1 192.168.100.12 6000 192.168.100.12 6000 d1 3000.00 157286 -0.00
               3 1 1 192.168.100.13 6000 192.168.100.13 6000 d1 3000.00 157287 0.00
               4 1 1 192.168.100.14 6000 192.168.100.14 6000 d1 3000.00 157286 -0.00
  $ swift-ring-builder object.builder add r2z1-192.168.100.15:6000/d1 1000
  Device d5r2z1-192.168.100.15:6000R192.168.100.15:6000/d1_"" with 1000.0 weight got id 5
  $ swift-ring-builder object.builder rebalance
  Reassigned 262144 (100.00%) partitions. Balance is now 0.00.
  $ swift-ring-builder object.builder
  object.builder, build version 7
  262144 partitions, 3.000000 replicas, 2 regions, 2 zones, 6 devices, 0.00 balance
  The minimum number of hours before a partition can be reassigned is 0
  Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
               0 1 1 192.168.100.10 6000 192.168.100.10 6000 d1 3000.00 147456 0.00
               1 1 1 192.168.100.11 6000 192.168.100.11 6000 d1 3000.00 147456 0.00
               2 1 1 192.168.100.12 6000 192.168.100.12 6000 d1 3000.00 147456 0.00
               3 1 1 192.168.100.13 6000 192.168.100.13 6000 d1 3000.00 147456 0.00
               4 1 1 192.168.100.14 6000 192.168.100.14 6000 d1 3000.00 147456 0.00
               5 2 1 192.168.100.15 6000 192.168.100.15 6000 d1 1000.00 49152 0.00
  $

Initial cluster was built with the following script:

  #!/bin/bash
  swift-ring-builder object.builder create 18 3 0

  IPs="192.168.100.10 192.168.100.11 192.168.100.12 192.168.100.13 192.168.100.14"

  for ip in $IPs
  do
      swift-ring-builder object.builder add r1z1-${ip}:6000/d1 3000
  done

  swift-ring-builder object.builder
  swift-ring-builder object.builder rebalance

Changed in swift:
assignee: nobody → Christian Schwede (cschwede)
status: New → In Progress
Revision history for this message
Christian Schwede (cschwede) wrote :

Florent, thanks for your extensive test.

I just had a quick look into this, and currently I think only the displayed percentage is wrong. The actual number of moved partitions from one device to another is quite low; I attached a small shell script that diffs the assigned partitions after each rebalance, and these numbers are very low (as expected).

Will have a deeper look into this tomorrow.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/120713

Revision history for this message
Florent Flament (florentflament) wrote :

Hi Christian,
Thank you very much for you feedback.

I did some more testing based on the `sim.sh` script that you
provided. I uploaded an updated version `sim2.sh`, which allows you to
specify the number of devices in the initial region. The partitions
count is computed based on this number.

My findings are the followings:

* You are right, the reassigned partitions number displayed by the
  `rebalance` command is wrong.

* In the case of a 3 devices initial region, the number of partitions
  actually moved is almost optimal: 16 partitions assigned to the new
  node, with a total of 17 partitions moved (3.3% of partitions)

* However, when you increase the devices count in the initial region,
  the number of partitions moved increases. As in the following
  example, with a 10 devices initial region, 61 partitions have been
  added to the new node, while a total of 1806 partitions have moved
  (88.2% of the partitions).

What do you think about that ?

Best regards,
Florent Flament

Details
-------

$ sh sim2.sh 3
Devices count in the initial region: 3
Partitions count is: 512
+ swift-ring-builder test.builder add r2z0-127.0.0.1:6000/sdb1 3
Device d3r2z0-127.0.0.1:6000R127.0.0.1:6000/sdb1_"" with 3.0 weight got id 3
+ swift-ring-builder test.builder rebalance
Reassigned 512 (100.00%) partitions. Balance is now 5.21.
-------------------------------------------------------------------------------
NOTE: Balance of 5.21 indicates you should push this
      ring, wait at least 0 hours, and rebalance/repush.
-------------------------------------------------------------------------------
Partitions changes inside initial region: 18
Partitions count for new device: 16
Total partitions moved: 17 (3.3203125%)

$ sh sim2.sh 10
Devices count in the initial region: 10
Partitions count is: 2048
+ swift-ring-builder test.builder add r2z0-127.0.0.1:6000/sdb1 10
Device d10r2z0-127.0.0.1:6000R127.0.0.1:6000/sdb1_"" with 10.0 weight got id 10
+ swift-ring-builder test.builder rebalance
Reassigned 2048 (100.00%) partitions. Balance is now 0.28.
Partitions changes inside initial region: 3551
Partitions count for new device: 61
Total partitions moved: 1806 (88.18359375%)

Revision history for this message
Christian Schwede (cschwede) wrote :

Hello Florent,

you're right - there is still a lot of partition movement in the initital region, depending on the size of the cluster (it's getting worse with bigger clusters).

I reopened a former abandoned patch and updated it: https://review.openstack.org/#/c/105666/

Best,

Christian

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/121422

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by Christian Schwede (<email address hidden>) on branch: master
Review: https://review.openstack.org/105666
Reason: Abandoning; https://review.openstack.org/#/c/121422/ is the better solution IMO.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/120713
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=09bdc87cbc1e7bc1918f9b5094bec266b6761d75
Submitter: Jenkins
Branch: master

commit 09bdc87cbc1e7bc1918f9b5094bec266b6761d75
Author: Christian Schwede <email address hidden>
Date: Thu Sep 11 08:01:51 2014 +0000

    Return correct number of changed partitions

    When a ring is rebalanced the number of changed partitions is counted.
    Before this patch partitions might be rebalanced, but actually no data
    is moved - for example, when a partition is assigned to the same device
    as before. This results in a wrong number of reassigned partitions that
    is shown to the user.

    This patch remembers the partition allocation before the rebalance, and
    compares it to the new allocation after a rebalance. Only partitions
    that are stored on a different device than before are counted.

    Partial-Bug: 1367826
    Also-By: Florent Flament <email address hidden>
    Change-Id: Iacfd514df3af351791f9191cef78cff1b3e2645f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/ec)

Fix proposed to branch: feature/ec
Review: https://review.openstack.org/124503

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/ec)
Download full text (9.1 KiB)

Reviewed: https://review.openstack.org/124503
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=bcaa00f25f3e8bd4123645256c535d0efe8a6733
Submitter: Jenkins
Branch: feature/ec

commit 15fbf9fe7cf33ed4b56569078400a2ba070b6bce
Author: paul luse <email address hidden>
Date: Thu Sep 11 06:55:45 2014 -0700

    Add container_count to policy_stat table

    Start tracking the container count per policy including reporting
    it in account HEAD and supporting installations where the DB
    existed before the updated schema.

    Migration is triggered by the account audtior; if the database is
    un-migrated it will continue to report policy_stats without the per
    policy container_count keys.

    Closes-Bug: #1367514
    Change-Id: I07331cea177e19b3df303609a4ac510765a19162

commit d10462e8704e7f5efe03c4024b541a3780348615
Author: Darrell Bishop <email address hidden>
Date: Tue Sep 23 09:11:44 2014 -0700

    Fix profile tests to clean up its tempdirs.

    Change-Id: I363651046cf414e14f15affd834043aabd5427c0

commit b68258a322cb004151b84584d00b3c76ee01bc29
Author: Mahati Chamarthy <email address hidden>
Date: Fri Sep 5 01:42:17 2014 +0530

    Added instructions to create a label or UUID to the XFS volume and mount using it.

    Change-Id: Idcaf16a278d6c34770af9b1f17d69bdd94fb86b7

commit 4d23a0fcf5faa6339a1a58fcbdab8687a6c88feb
Author: Samuel Merritt <email address hidden>
Date: Thu Aug 28 09:39:38 2014 -0800

    Reject overly-taxing ranged-GET requests

    RFC 7233 says that servers MAY reject egregious range-GET requests
    such as requests with hundreds of ranges, requests with non-ascending
    ranges, and so on.

    Such requests are fairly hard for Swift to process. Consider a Range
    header that asks for the first byte of every 10th MiB in a 4 GiB
    object, but in some random order. That'll cause a lot of seeks on the
    object server, but the corresponding response body is quite small in
    comparison to the workload.

    This commit makes Swift reject, with a 416 response, any ranged GET
    request with more than fifty ranges, more than three overlapping
    ranges, or more than eight non-increasing ranges.

    This is a necessary prerequisite for supporting multi-range GETs on
    large objects. Otherwise, a malicious user could construct a Range
    header with hundreds of byte ranges where each individual byterange
    requires the proxy to contact a different object server. If seeking
    all over a disk is bad, connecting all over the cluster is way worse.

    DocImpact

    Change-Id: I4dcedcaae6c3deada06a0223479e611094d57234

commit 5efdab6055bc99638e4e1388bef685b19682025d
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Sep 22 06:07:56 2014 +0000

    Imported Translations from Transifex

    Change-Id: Ibd8882766a87c6d77e786f7635b1290391e43f10

commit 4faf1702706b289521329243e5802cf86e54c7f7
Author: Lorcan <email address hidden>
Date: Tue Sep 2 18:12:18 2014 +0100

    Add "--no-overlap" option to swift-dispersion populate

    This change allows the user to use a "--no-overlap" parameter w...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/121422
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=20e9ad538b6ba90674a75310d9cc184162ba9398
Submitter: Jenkins
Branch: master

commit 20e9ad538b6ba90674a75310d9cc184162ba9398
Author: Christian Schwede <email address hidden>
Date: Sun Sep 14 20:48:32 2014 +0000

    Limit partition movement when adding a new tier

    When adding a new tier (region, zone, node, device) to an existing,
    already balanced ring all existing partitions in the existing tiers of
    the same level are gathered for reassigning, even when there is
    not enough space in the new tier. This will create a lot of unnecessary
    replication traffic in the backend network.

    For example, when only one region exists in the ring and a new region is
    added, all existing parts are selected to reassign, even when the new
    region has a total weight of 0. Same for zones, nodes and devices.

    This patch limits the number of partitions that are choosen to reassign
    by checking for devices on other tiers that are asking for more
    partitions.

    Failed devices are not considered when applying the limit.

    Co-Authored By: Florent Flament <email address hidden>
    Change-Id: I6178452e47492da4677a8ffe4fb24917b5968cd9
    Closes-Bug: 1367826

Changed in swift:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in swift:
milestone: none → 2.2.0-rc1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/ec)

Fix proposed to branch: feature/ec
Review: https://review.openstack.org/126595

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/ec)
Download full text (11.3 KiB)

Reviewed: https://review.openstack.org/126595
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=06800cbe446ce4c937a57b69517b55c3bba9b6e1
Submitter: Jenkins
Branch: feature/ec

commit 7528f2b22169e90fe8ddd19b7ef7d46ecff5d231
Author: Christian Schwede <email address hidden>
Date: Mon Oct 6 10:01:03 2014 +0000

    Fix minor typo

    Fixes minor typo in one method and adds missing parameter in other
    method. Only checked swift/container/reconciler.py for now.

    Change-Id: I5c648010f09b6e4b1fb0380bc97b266e680602f8

commit 94fd95ba30c72fbcb03367aaa8da407a408948d5
Author: OpenStack Proposal Bot <email address hidden>
Date: Sat Oct 4 06:07:47 2014 +0000

    Imported Translations from Transifex

    Change-Id: I31b5e6b0f2922150902e1bfa52144302ee0c7a8e

commit d6a827792619f3343af07fc2519f4253fbdc67f7
Author: John Dickinson <email address hidden>
Date: Fri Oct 3 10:17:00 2014 -0400

    updated AUTHORS and CHANGELOG for 2.2.0

    Change-Id: I6c0bc1570f6a48439de5a029a86f1b582f30f8a6

commit 5b2c27a5874c2b5b0a333e4955b03544f6a8119f
Author: Richard (Rick) Hawkins <email address hidden>
Date: Wed Oct 1 09:37:47 2014 -0400

    Fix metadata overall limits bug

    Currently metadata limits are checked on a per request basis. If
    multiple requests are sent within the per request limits, it is
    possible to exceed the overall limits. This patch adds an overall
    metadata check to ensure that multiple requests to add metadata to
    an account/container will check overall limits before adding
    the additional metadata.

    Change-Id: Ib9401a4ee05a9cb737939541bd9b84e8dc239c70
    Closes-Bug: 1365350

commit 301a96f664d58b4ccad8e3cbf5d5a889cc76790f
Author: Jay S. Bryant <email address hidden>
Date: Tue Sep 30 15:08:59 2014 -0500

    Ensure sys.exit called in fork_child after exception

    Currently, the fork_child() function in auditor.py does not
    handle the case where run_audit() encounters an exception
    properly.

    A simple case is where the /srv directory is set
    with permissions such that the 'swift' user cannot access it.
    Such a situation causes a os.listdir() to return an OSError
    exception. When this happens the fork_child() process does not
    run to completion and sys.exit() is not executed. The process
    that was forked off continues to run as a result. Execution goes
    back up to the audit_loop function which restarts the whole process. The
    end result is an increasing number of processes on the system
    until the parent is terminated. This can quickly exhaust the
    process descriptors on a system.

    This change wraps run_audit() in a try block and adds an
    exception handler that prints what exception was encountered.
    The sys.exit() was moved to a finally: block so that it will
    always be run, avoiding the creation of zombies.

    Change-Id: I89d7cd27112445893852e62df857c3d5262c27b3
    Closes-bug: 1375348

commit 6d49cc3092168de6d22378557b2c37ea4063beeb
Author: Samuel Merritt <email address hidden>
Date: Thu Oct 2 17:14:58 2014 -0400

    Fix ring-builder crash.

    If you adjust ...

Thierry Carrez (ttx)
Changed in swift:
milestone: 2.2.0-rc1 → 2.2.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.