some parts replicas assigned to duplicate devices in the ring

Bug #1452431 reported by clayg
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
High
Samuel Merritt

Bug Description

With rings who's replica count is restrained by the device number - when the weights of the available devices are not well distributed - sometimes the RingBuilder will place more than one replica of partition on the same device.

This is more likely with EC rings where the replica count is more likely to approach the device count.

You can work around the issue if you set overload very high before the initial balance. After a rebalance it can take some time for overload to pull the duplicated replicas of the hungry devices.

Here's an example ring builer output that demonstrates the issue:

ec-test.builder, build version 15
1024 partitions, 14.000000 replicas, 1 regions, 1 zones, 15 devices, 0.10 balance, 29.98 dispersion
The minimum number of hours before a partition can be reassigned is 0
The overload factor is 0.00% (0.000000)
Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
             0 1 1 127.0.0.1 6000 127.0.0.1 6000 d0 2.00 1146 -0.08
             1 1 1 127.0.0.1 6001 127.0.0.1 6001 d1 2.00 1147 0.01
             2 1 1 127.0.0.1 6002 127.0.0.1 6002 d2 2.00 1147 0.01
             3 1 1 127.0.0.1 6003 127.0.0.1 6003 d3 2.00 1146 -0.08
             4 1 1 127.0.0.1 6004 127.0.0.1 6004 d4 2.00 1146 -0.08
             5 1 1 127.0.0.1 6005 127.0.0.1 6005 d5 2.00 1146 -0.08
             6 1 1 127.0.0.1 6006 127.0.0.1 6006 d6 2.00 1147 0.01
             7 1 1 127.0.0.1 6007 127.0.0.1 6007 d7 2.00 1147 0.01
             8 1 1 127.0.0.1 6008 127.0.0.1 6008 d8 2.00 1147 0.01
             9 1 1 127.0.0.1 6009 127.0.0.1 6009 d9 2.00 1147 0.01
            10 1 1 127.0.0.1 6010 127.0.0.1 6010 d10 1.00 574 0.10
            11 1 1 127.0.0.1 6011 127.0.0.1 6011 d11 1.00 574 0.10
            12 1 1 127.0.0.1 6012 127.0.0.1 6012 d12 1.00 574 0.10
            13 1 1 127.0.0.1 6013 127.0.0.1 6013 d13 1.00 574 0.10
            14 1 1 127.0.0.1 6014 127.0.0.1 6014 d14 1.00 574 0.10

14 devices, and 14 replicas, but some of the devices have half the weight. Towards the end of partition placement - the hungry devices get assigned multiple replicas of the same part. Note that the partition counts all add to replicas * parts - because the device that is holding multiple replicas of the same part is happily incrementing it's part count.

I've attached a script that I think would be a good addition to the RingBuilder post rebalance validate method after we decide how we want to fix it.

Revision history for this message
clayg (clay-gerrard) wrote :
clayg (clay-gerrard)
summary: - some parts replicas assigned to duplicate devices
+ some parts replicas assigned to duplicate devices in the ring
tags: added: ec
Revision history for this message
clayg (clay-gerrard) wrote :

@torgomatic should be able to confirm this bug

Revision history for this message
Samuel Merritt (torgomatic) wrote :

Yeah, this'll happen.

Note that fixing it means you can't have an N-replica ring with fewer than N devices. That's probably not important any more now that you can change a builder's replica count.

Changed in swift:
status: New → Confirmed
Changed in swift:
assignee: nobody → Samuel Merritt (torgomatic)
Changed in swift:
importance: Undecided → Medium
Revision history for this message
paul luse (paul-e-luse) wrote :

Sam, are you going to knock this one out?

Revision history for this message
Samuel Merritt (torgomatic) wrote :

I don't know about "knock out"; I've gone six rounds with it and it's winning.

tags: removed: ec
Revision history for this message
clayg (clay-gerrard) wrote :

I don't know if we "fix" this anytime soon - but we have got to stop letting people deploy rings like this - at a minimum add the validator so we can fail early and fail loudly

Changed in swift:
importance: Medium → Critical
Revision history for this message
clayg (clay-gerrard) wrote :

I think if we can get the validator in we can reduce the importance of this bug to high.

Revision history for this message
John Dickinson (notmyname) wrote :

with patch https://review.openstack.org/#/c/222799/ we can lower this priority

Changed in swift:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (master)
Download full text (4.0 KiB)

Reviewed: https://review.openstack.org/222799
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=5070869ac0e6a2d577dd4054ffbcbffd06db3c5b
Submitter: Jenkins
Branch: master

commit 5070869ac0e6a2d577dd4054ffbcbffd06db3c5b
Author: Clay Gerrard <email address hidden>
Date: Fri Sep 11 16:24:52 2015 -0700

    Validate against duplicate device part replica assignment

    We should never assign multiple replicas of the same partition to the
    same device - our on-disk layout can only support a single replica of a
    given part on a single device. We should not do this, so we validate
    against it and raise a loud warning if this terrible state is ever
    observed after a rebalance.

    Unfortunately currently there's a couple not necessarily uncommon
    scenarios which will trigger this observed state today:

     1. If we have less devices than replicas
     2. If a server or zones aggregate device weight make it the most
        appropriate candidate for multiple replicas and you're a bit unlucky

    Fixing #1 would be easy, we should just not allow that state anymore.
    Really we never did - if you have a 3 replica ring with one device - you
    have one replica. Everything that iter_nodes'd would de-dupe. We
    should just be insisting that you explicitly acknowledge your replica
    count with set_replicas.

    I have been lost in the abyss for days searching for a general solutions
    to #2. I'm sure it exists, but I will not have wrestled it to
    submission by RC1. In the meantime we can eliminate a great deal of the
    luck required simply by refusing to place more than one replica of a
    part on a device in assign_parts.

    The meat of the change is a small update to the .validate method in
    RingBuilder. It basically unrolls a pre-existing (part, replica) loop
    so that all the replicas of the part come out in order so that we can
    build up the set of dev_id's for which all the replicas of a given part
    are assigned part-by-part.

    If we observe any duplicates - we raise a warning.

    To clean the cobwebs out of the rest of the corner cases we're going to
    delay get_required_overload from kicking in until we achive dispersion,
    and a small check was added when selecting a device subtier to validate
    if it's already being used - picking any other device in the tier works
    out much better. If no other devices are available in the tier - we
    raise a warning. A more elegant or optimized solution may exist.

    Many unittests did not meet the criteria #1, but the fix was straight
    forward after being identified by the pigeonhole check.

    However, many more tests were affected by #2 - but again the fix came to
    be simply adding more devices. The fantasy that all failure domains
    contain at least replica count devices is prevalent in both our ring
    placement algorithm and it's tests. These tests were trying to
    demonstrate some complex characteristics of our ring placement algorithm
    and I believe we just got a bit too carried away trying to find the
    simplest possible example to demonstrate the...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (feature/crypto)

Related fix proposed to branch: feature/crypto
Review: https://review.openstack.org/234065

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (feature/crypto)
Download full text (36.8 KiB)

Reviewed: https://review.openstack.org/234065
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=c80229fd86853f5f8541aeef4b5044170572640d
Submitter: Jenkins
Branch: feature/crypto

commit 9cafa472a336f66d149a20c12f4251703d96f04d
Author: Ondřej Nový <email address hidden>
Date: Sat Oct 10 20:57:07 2015 +0200

    Autodetect systemctl in SAIO and use it on systemd distros

    Change-Id: I84a9b27baac89327749d8774032860f8ad5166f2

commit 92767f28d668643bc2affee7b2fd46fd9349656a
Author: Emile Snyder <email address hidden>
Date: Sun Oct 11 21:24:54 2015 -0700

    Fix 'swift-ring-builder write_builder' after you remove a device

    clayg already posted the code fix in the bug, but noted it needs a test.

    Closes-Bug: #1487280
    Change-Id: I07317754afac7165baac4e696f07daeba2e72adc

commit a48649002970b2150d24d0622a100f54045443c5
Author: Lisak, Peter <email address hidden>
Date: Mon Oct 12 14:42:01 2015 +0200

    swift-recon fails with socket.timeout exception

    If some server is overloaded or timeout set too low, swift-recon fails with
    raised socket.timeout exception.

    This error should be processed the same way as HTTPError/URLError.

    Change-Id: Ide8843977ab224fa866097d0f0b765d6899c66b8

commit 767fac8186ea4541f4466ac9a55c03abea6a878b
Author: Christian Schwede <email address hidden>
Date: Mon Oct 12 07:09:00 2015 +0000

    Enable H234 check (assertEquals is deprecated, use assertEqual)

    All usages of assertEquals and assertNotEquals are fixed now, so let's enable
    the H234 check to avoid regressions in the future.

    Change-Id: I2c2ccb3b268cf9eb11f2db045378ab125a02bc31

commit 1882801be1d8983cd718786bd409cf09f65a00b0
Author: janonymous <email address hidden>
Date: Mon Aug 31 21:49:49 2015 +0530

    pep8 fix: assertNotEquals -> assertNotEqual

    assertNotEquals is deprecated in py3

    Change-Id: Ib611351987bed1199fb8f73a750955a61d022d0a

commit f5f9d791b0b8b32350bd9a47fbc00ff86a65f09d
Author: janonymous <email address hidden>
Date: Wed Aug 5 23:58:14 2015 +0530

    pep8 fix: assertEquals -> assertEqual

    assertEquals is deprecated in py3, replacing it.

    Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
    Co-Authored-By: Ondřej Nový <email address hidden>

commit 1ba7641c794104de57e5010f76cecbf146a2a63b
Author: Zack M. Davis <email address hidden>
Date: Thu Oct 8 16:16:18 2015 -0700

    minutæ: port ClientException tweaks from swiftclient; dict .pop

    openstack/python-swiftclient@5ae4b423 changed python-swiftclient's
    ClientException to have its http_status attribute default to
    None (rather than 0) and to use super in its __init__ method. For
    consistency's sake, it's nice for Swift's inlined copy of
    ClientException to receive the same patch. Also, the retry function in
    direct_client (a major user of ClientException) was using a somewhat
    awkward conditional-assignment-and-delete construction where the .pop
    method of dictionaries would be more idiomatic.

    Change-Id: I70a12f934f84f57549617af28b86f7f5637bd8fa

commit 01f9d15045129d09...

tags: added: in-feature-crypto
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (feature/hummingbird)

Related fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/236162

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (feature/hummingbird)
Download full text (52.0 KiB)

Reviewed: https://review.openstack.org/236162
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=18ddcaf0d6b67fcbb6b0a4cf4a9a99c72f6f3a08
Submitter: Jenkins
Branch: feature/hummingbird

commit a9ddc2d9ea402eaac7ccd8992387f77855968ab5
Author: Mahati Chamarthy <email address hidden>
Date: Fri Oct 16 18:18:33 2015 +0530

    Hyperlink fix to first contribution doc

    Change-Id: I19fc1abc89f888233b80a57c68a152c1c1758640

commit 83a1151d13e096b480aefe6ec18259f2d7d021db
Author: Pete Zaitcev <email address hidden>
Date: Fri Oct 9 16:45:20 2015 -0600

    Interpolate the explanation string not whole HTML body

    The only reason this exists is that I promised to do it.
    But in our case, there's no big advantage, and here's why.

    The general thinking goes that strings must be interpolated
    first because the body may contain a syntax that confuses the
    interpolation. So this patch makes the code "more correct".
    However, our HTML template is tightly controlled. It's not
    like it contains additional percents.

    So I'll just leave this here for now while I'm asking if
    the content type is set correctly.

    Change-Id: Ia18aeb0f94ef389f8b95450986a566e5fa06aa10

commit 384b91eb824376659989b904f9396cbf2e02d2bd
Author: asettle <email address hidden>
Date: Thu Sep 3 15:11:46 2015 +1000

    Moving DLO functionality doc to the middleware code

    This change moves the RST DLO documentation from
    statically inside overview_large_objects.rst and moves it
    to middleware/dlo.py.
    This is where all middleware RST documentation is defined.

    The overview_large_objects.rst is still the main page
    for information on large objects, so now dynamically
    points to both the DLO and SLO middleware RST
    documentation and the relevant middleware.rst page
    simply points to it.

    Change-Id: I40d918c8b7bc608ab945805d69fe359521df038a
    Closes-bug: #1276852

commit 2996974e5d48b4efaa1b271b8fbd0387bced7242
Author: Ondřej Nový <email address hidden>
Date: Sat Oct 10 14:56:30 2015 +0200

    Script for running unit, func and probe tests at once

    When developing Swift it's often needed to run all tests.
    This script makes it much simpler.

    Change-Id: I67e6f7cc05ebd0475001c1b56e8f6fd09c8c644f

commit c2182fd4163050a5f76eb3dedb7703dc821fa83d
Author: janonymous <email address hidden>
Date: Fri Jul 17 20:20:15 2015 +0530

    Python3: do not use im_self/im_func/func_closure

    Use __self__, __func__ and __closure__ instead, as they work
    with both Python 2 and 3.

    Modifying usage of __func__ in codebase.

    Change-Id: I57e907c28c1d4646605e70194ea3650806730b83

commit c0866ceaac2f69ae01345a795520141f59ec64f5
Author: Samuel Merritt <email address hidden>
Date: Fri Sep 25 17:26:37 2015 -0700

    Improve SLO PUT error checking

    This commit tries to give the user a reason that their SLO manifest
    was invalid instead of just saying "Invalid SLO Manifest File". It
    doesn't get every error condition, but it's better than before.

    Examples of things that now have real error...

tags: added: in-feature-hummingbird
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)
Download full text (4.6 KiB)

Reviewed: https://review.openstack.org/241571
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=7035639dfd239b52d4ed46aae50f78d16ec8cbfe
Submitter: Jenkins
Branch: master

commit 7035639dfd239b52d4ed46aae50f78d16ec8cbfe
Author: Clay Gerrard <email address hidden>
Date: Thu Oct 15 16:20:58 2015 -0700

    Put part-replicas where they go

    It's harder than it sounds. There was really three challenges.

    Challenge #1 Initial Assignment
    ===============================

    Before starting to assign parts on this new shiny ring you've
    constructed, maybe we'll pause for a moment up front and consider the
    lay of the land. This process is called the replica_plan.

    The replica_plan approach is separating part assignment failures into
    two modes:

     1) we considered the cluster topology and it's weights and came up with
        the wrong plan

     2) we failed to execute on the plan

    I failed at both parts plenty of times before I got it this close. I'm
    sure a counter example still exists, but when we find it the new helper
    methods will let us reason about where things went wrong.

    Challenge #2 Fixing Placement
    =============================

    With a sound plan in hand, it's much easier to fail to execute on it the
    less material you have to execute with - so we gather up as many parts
    as we can - as long as we think we can find them a better home.

    Picking the right parts for gather is a black art - when you notice a
    balance is slow it's because it's spending so much time iterating over
    replica2part2dev trying to decide just the right parts to gather.

    The replica plan can help at least in the gross dispersion collection to
    gather up the worst offenders first before considering balance. I think
    trying to avoid picking up parts that are stuck to the tier before
    falling into a forced grab on anything over parts_wanted helps with
    stability generally - but depending on where the parts_wanted are in
    relation to the full devices it's pretty easy pick up something that'll
    end up really close to where it started.

    I tried to break the gather methods into smaller pieces so it looked
    like I knew what I was doing.

    Going with a MAXIMUM gather iteration instead of balance (which doesn't
    reflect the replica_plan) doesn't seem to be costing me anything - most
    of the time the exit condition is either solved or all the parts overly
    aggressively locked up on min_part_hours. So far, it mostly seemds if
    the thing is going to balance this round it'll get it in the first
    couple of shakes.

    Challenge #3 Crazy replica2part2dev tables
    ==========================================

    I think there's lots of ways "scars" can build up a ring which can
    result in very particular replica2part2dev tables that are physically
    difficult to dig out of. It's repairing these scars that will take
    multiple rebalances to resolve.

    ... but at this point ...

    ... lacking a counter example ...

    I've been able to close up all the edg...

Read more...

Changed in swift:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/crypto)

Fix proposed to branch: feature/crypto
Review: https://review.openstack.org/257416

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/crypto)
Download full text (8.1 KiB)

Reviewed: https://review.openstack.org/257416
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=b38681f64613ff976b8f1137fad0a7dd28e27410
Submitter: Jenkins
Branch: feature/crypto

commit 52b534c9d13efde15841402dc86c99685334250f
Author: hgangwx <email address hidden>
Date: Fri Dec 11 17:31:36 2015 +0800

    Fix some inconsistency in docstrings

    Added colon after ":returns" according to:
    http://docs.openstack.org/developer/hacking

     blank lines are added before the param lines

    Change-Id: I008bea270c6a4b7c80d1a84a2cb9fcc9b5a7264a

commit 19c7dbc0ba49ff7e1f59190a7ead9ae9d3d80238
Author: Mahati Chamarthy <email address hidden>
Date: Fri Dec 11 12:27:18 2015 +0530

    Update versioned_writes doc

    Change-Id: Ibe53c79cf49330332112001c02a2a6b078764130

commit 70bc3d1a3a240c35eba934010fdef99a775b8715
Author: John Dickinson <email address hidden>
Date: Tue Dec 8 16:08:04 2015 -0800

    added a few ruby clients to associated projects

    Change-Id: I4764ba505646949ff694f8939947f83c6940128a

commit 7035639dfd239b52d4ed46aae50f78d16ec8cbfe
Author: Clay Gerrard <email address hidden>
Date: Thu Oct 15 16:20:58 2015 -0700

    Put part-replicas where they go

    It's harder than it sounds. There was really three challenges.

    Challenge #1 Initial Assignment
    ===============================

    Before starting to assign parts on this new shiny ring you've
    constructed, maybe we'll pause for a moment up front and consider the
    lay of the land. This process is called the replica_plan.

    The replica_plan approach is separating part assignment failures into
    two modes:

     1) we considered the cluster topology and it's weights and came up with
        the wrong plan

     2) we failed to execute on the plan

    I failed at both parts plenty of times before I got it this close. I'm
    sure a counter example still exists, but when we find it the new helper
    methods will let us reason about where things went wrong.

    Challenge #2 Fixing Placement
    =============================

    With a sound plan in hand, it's much easier to fail to execute on it the
    less material you have to execute with - so we gather up as many parts
    as we can - as long as we think we can find them a better home.

    Picking the right parts for gather is a black art - when you notice a
    balance is slow it's because it's spending so much time iterating over
    replica2part2dev trying to decide just the right parts to gather.

    The replica plan can help at least in the gross dispersion collection to
    gather up the worst offenders first before considering balance. I think
    trying to avoid picking up parts that are stuck to the tier before
    falling into a forced grab on anything over parts_wanted helps with
    stability generally - but depending on where the parts_wanted are in
    relation to the full devices it's pretty easy pick up something that'll
    end up really close to where it started.

    I tried to break the gather methods into smaller pieces so it looked
    like I knew what I was doing.

    Going wi...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/264517

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/hummingbird)
Download full text (54.1 KiB)

Reviewed: https://review.openstack.org/264517
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=93ddaffaebb620bc712d52d0267194562dce02be
Submitter: Jenkins
Branch: feature/hummingbird

commit f53cf1043d078451c4b9957027bf3af378aa0166
Author: Ondřej Nový <email address hidden>
Date: Tue Jan 5 20:20:15 2016 +0100

    Fixed few misspellings in comments

    Change-Id: I8479c85cb8821c48b5da197cac37c80e5c1c7f05

commit 79222e327f9df6335b58e17a6c8dd0dc44b86c17
Author: ChangBo Guo(gcb) <email address hidden>
Date: Sat Dec 26 13:13:37 2015 +0800

    Fix AttributeError for LogAdapter

    LogAdapter object has no attribute 'warn' but has attribute
    'warning'.

    Closes-Bug: #1529321
    Change-Id: I0e0bd0a3dbc4bb5c1f0b343a8809e53491a1da5f

commit 684c4c04592278a280032002b5313b171ee7a4c0
Author: janonymous <email address hidden>
Date: Sun Aug 2 22:47:42 2015 +0530

    Python 3 deprecated the logger.warn method in favor of warning

    DeprecationWarning: The 'warn' method is deprecated, use 'warning'
    instead

    Change-Id: I35df44374c4521b1f06be7a96c0b873e8c3674d8

commit d0a026fcb8e8a9f5475699cc56e1998bdc4cd5ca
Author: Hisashi Osanai <email address hidden>
Date: Wed Dec 16 18:50:37 2015 +0900

    Fix duplication for headers in Access-Control-Expose-Headers

    There are following problems with Access-Control-Expose-Headers.

    * If headers in X-Container-Meta-Access-Control-Expose-Headers are
      configured, the headers are kept with case-sensitive string.
      Then a CORS request comes, the headers are merged into
      Access-Control-Expose-Headers as case-sensitive string even if
      there is a same header which is not case-sensitive string.

    * Access-Control-Expose-Headers is handled by a list.
      If X-Container/Object-Meta-XXX is configured in container/object
      and X-Container-Meta-Access-Control-Expose-Headers, same header
      is listed in Access-Control-Expose-Headers.

    This patch provides a fix for the problems.

    Change-Id: Ifc1c14eb3833ec6a851631cfc23008648463bd81

commit 0bcd7fd50ec0763dcb366dbf43a9696ca3806f15
Author: Bill Huber <email address hidden>
Date: Fri Nov 20 12:09:26 2015 -0600

    Update Erasure Coding Overview doc to remove Beta version

    The major functionality of EC has been released for Liberty and
    the beta version of the code has been removed since it is now
    in production.

    Change-Id: If60712045fb1af803093d6753fcd60434e637772

commit 84ba24a75640be4212e0f984c284faf4c894e7c6
Author: Alistair Coles <email address hidden>
Date: Fri Dec 18 11:24:34 2015 +0000

    Fix rst errors so that html docs are complete

    rst table format errors don't break the gate job
    but do cause sections of the documents to go missing
    from the html output.

    Change-Id: Ic8c9953c93d03dcdafd8f47b271d276c7b356dc3

commit 87f7e907ee412f5847f1f9ffca7a566fb148c6b1
Author: Matthew Oliver <email address hidden>
Date: Wed Dec 16 17:19:24 2015 +1100

    Pass HTTP_REFERER down to subrequests

    Currently a HTTP_REFERER (Referer) header isn't passed down to
    subreq...

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/swift 2.6.0

This issue was fixed in the openstack/swift 2.6.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.