Proxy server ignores additional fragments on primary nodes

Bug #1484598 reported by clayg
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
High
paul luse

Bug Description

You can't get multiple fragments on the same node unless you rebalance the ring.

When you are doing rebalances - it's possible a node holding a fragment is not holding the correct fragment index according to it's current index in the ring.

If this happens the node that *is* holding the correct fragment index that belongs on that current primary will push it over to that node and it will end up holding multiple fragments.

However, the proxy will never consider multiple fragments from the same node, it's possible even that a fragment could be reconstructed to a primary node and the proxy could see multiples of the same indexes.

None of these extremely rare and not well understood failure conditions are handled gracefully in the proxy. We've been able to concoct some contrived failure modes that we know would result in 503s and 404s in the proxy.

paul luse (paul-e-luse)
Changed in swift:
assignee: nobody → paul luse (paul-e-luse)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/213147

Changed in swift:
status: New → In Progress
Changed in swift:
importance: Undecided → Critical
importance: Critical → High
Changed in swift:
importance: High → Critical
Changed in swift:
importance: Critical → High
Revision history for this message
Alistair Coles (alistair-coles) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by paul luse (<email address hidden>) on branch: master
Review: https://review.openstack.org/213147
Reason: capability has since been rolled into another patch

Revision history for this message
clayg (clay-gerrard) wrote :

I disagree with the priority of this bug. This is a low priority for me.

IMHO there is no way to code up a *reasonable* probe tests to demonstrate the described availability failure. Without a plausible scenario in context it's hard to really decide on the priority.

In reality - at worst you would expect only a single node to end up holding two fragments because of a rebalance [1]. It's only in this temporary period during a rebalance where one less than normal node availability failure could potentially impact the ability to service reads.

But that's the same as replicated today.

If you normally have two primaries that can service a read, during a rebalance you only have one - if that one is down - you have none.

More over, the situation where a rebalance results in a primary node holding two frags is the extremely uncommon case! Generally speaking you would expect the extra frag to be floating around on some old handoff node in the cluster with *no* chance of being found! We're optimizing in the wrong place! Bigger Fish!

1. unless you're rebalancing faster than your reconstructor cycle time

Revision history for this message
clayg (clay-gerrard) wrote :

So there's two cases we might consider covered with this bug - and we might have to split them up because they have different consequences and likelyhood.

One way to have "multiple fragments" is a rebalance - this is multiple fragments *for the same timestamp*.

Another way to consider "multiple fragments" would be multiple data files with the *same* fragment and different timestamps. This happens when you have non-durable fragments piled over a durable fragment:

https://gist.github.com/clayg/1f2590ea4608a13823865e27a6666c2a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/215276
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=44a861787a60346aded8be2a54ef4e16accccbb6
Submitter: Jenkins
Branch: master

commit 44a861787a60346aded8be2a54ef4e16accccbb6
Author: Alistair Coles <email address hidden>
Date: Mon Oct 5 17:40:06 2015 +0100

    Enable object server to return non-durable data

    This patch improves EC GET response handling:

    - The proxy no longer requires all object servers to have a
      durable file for the fragment archive that they return in
      response to a GET. The proxy will now be satisfied if just
      one object server has a durable file at the same timestamp
      as fragments from other object servers.

      This means that the proxy can now successfully GET an
      object that had missing durable files when it was PUT.

    - The proxy will now ensure that it has a quorum of *unique*
      fragment indexes from object servers before considering a
      GET to be successful.

    - The proxy is now able to fetch multiple fragment archives
      having different indexes from the same node. This enables
      the proxy to successfully GET an object that has some
      fragments that have landed on the same node, for example
      after a rebalance.

    This new behavior is facilitated by an exchange of new
    headers on a GET request and response between the proxy and
    object servers.

    An object server now includes with a GET (or HEAD) response:

    - X-Backend-Fragments: the value of this describes all
      fragment archive indexes that the server has for the
      object by encoding a map of the form: timestamp -> <list
      of fragment indexes>

    - X-Backend-Durable-Timestamp: the value of this is the
      internal form of the timestamp of the newest durable file
      that was found, if any.

    - X-Backend-Data-Timestamp: the value of this is the
      internal form of the timestamp of the data file that was
      used to construct the diskfile.

    A proxy server now includes with a GET request:

    - X-Backend-Fragment-Preferences: the value of this
      describes the proxy's current preference with respect to
      those fragments that it would have object servers
      return. It encodes a list of timestamp, and for each
      timestamp a list of fragment indexes that the proxy does
      NOT require (because it already has them).

      The presence of a X-Backend-Fragment-Preferences header
      (even one with an empty list as its value) will cause the
      object server to search for the most appropriate fragment
      to return, disregarding the existence or not of any
      durable file. The object server assumes that the proxy
      knows best.

    Closes-Bug: 1469094
    Closes-Bug: 1484598

    Change-Id: I2310981fd1c4622ff5d1a739cbcc59637ffe3fc3
    Co-Authored-By: Paul Luse <email address hidden>
    Co-Authored-By: Clay Gerrard <email address hidden>

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.10.0

This issue was fixed in the openstack/swift 2.10.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/400985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/hummingbird)
Download full text (78.0 KiB)

Reviewed: https://review.openstack.org/400985
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=0c3f8f87104af8717115c5badffd243dbaa1c430
Submitter: Jenkins
Branch: feature/hummingbird

commit 2d25fe6ad3573b2a06b6b3e5e66493d7b0c55693
Author: Tim Burke <email address hidden>
Date: Mon Jul 25 15:06:23 2016 -0700

    Reduce backend requests for SLO If-Match / HEAD requests

    ... by storing SLO Etag and size in sysmeta.

    Previously, we had to GET the manifest for every HEAD or conditional
    request to an SLO. Worse, since SLO PUTs require that we HEAD every
    segment, we'd GET all included sub-SLO manifests. This was necessary so
    we could recompute the large object's Etag and content-length.

    Since we already know both of those during PUT, we'll now store it in
    object sysmeta. This allows us to:

     * satisfy HEAD requests based purely off of the manifest's HEAD
       response, and
     * perform the If-(None-)Match comparison on the object server, without
       any additional subrequests.

    Note that the large object content-length can't just be parsed from
    content-type -- with fast-POST enabled, the content-type coming out of
    the object-server won't necessarily include swift_bytes.

    Also note that we must still fall back to GETting the manifest if the
    sysmeta headers were not found. Otherwise, we'd break existing large
    objects.

    Change-Id: Ia6ad32354105515560b005cea750aa64a88c96f9

commit ae7dddd801e28217d7dc46bd45cd6b621f29340c
Author: Ondřej Nový <email address hidden>
Date: Mon Nov 21 22:13:11 2016 +0100

    Added comment for "user" option in drive-audit config

    Change-Id: I24362826bee85ac3304e9b63504c9465da673014

commit c3e1d847f4b9d6cc6212aae4dc1b1e6dff45fb40
Author: Thiago da Silva <email address hidden>
Date: Thu Nov 17 17:17:00 2016 -0500

    breaking down tests.py into smaller pieces

    tests.py is currently at ~5500 lines of code, it's
    time to break it down into smaller files.

    I started with an easy middleware set of tests
    (i.e., versioned writes, ~600 lines of code ) so I can get
    some feedback. There are more complicated tests that cover
    multiple middlewares for example, it is not so clear where
    those should go.

    Change-Id: I2aa6c18ee5b68d0aae73cc6add8cac6fbf7f33da
    Signed-off-by: Thiago da Silva <email address hidden>

commit 5d7a3a4172f0f11ab870252eec784cf24b247dea
Author: Ondřej Nový <email address hidden>
Date: Sat Nov 19 23:24:30 2016 +0100

    Removed "in-process-" from func env tox name

    This shorten shebang in infra, because we are hitting 128 bytes limit.

    Change-Id: I02477d81b836df71780942189d37d616944c4dce

commit 9ea340256996a03c8c744201297b47a0e91fe65b
Author: Kota Tsuyuzaki <email address hidden>
Date: Fri Nov 18 01:50:11 2016 -0800

    Don't overwrite built-in 'id'

    This is a follow up for https://review.openstack.org/#/c/399237

    'id' is assigned as a builtin function so that we should not use 'id'
    for the local variable name.

    Change-Id: Ic27460d49e68f6cd50bda1d5b3810e01ccb07a37

commit bf...

tags: added: in-feature-hummingbird
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.10.0

This issue was fixed in the openstack/swift 2.10.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.