EC missing durable can prevent reconstruction

Bug #1624088 reported by clayg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Critical
Unassigned

Bug Description

If you do not have enough durable [1] fragments to rebuild a missing fragment the reconstructor will fail to rebuild a missing fragment:

object-6020: Unable to get enough responses (2/4) to reconstruct 127.0.0.1:6040/sdb4/892/AUTH_test/ec-test/delete.me policy#1 frag#5 with ETag 0195c438677f2c0acfe05338b1bd1488

Luckily, if the non-durable fragments' neighbor has a durable fragment - it will make the non-durable fragment durable when it syncs. In this manor a durable can "passed" around to all the nodes until it encounters a missing fragment. When a node is missing a fragment - the reconstructor will attempt reconstruction which might fail (if there are < ndata durable fragments available) and it will not attempt to mark other nodes as durable.

This behavior allows a non-durable fragment to be "trapped" by adjacent missing fragments - which can cause it to *never* be marked durable by it's partners - and therefore the missing fragments *never* get rebuilt - so the durability is *never* repaired.

There's an attached probetest.

1. It doesn't matter if you have *enough* fragments - if they're not durable - their object servers won't give them to the reconstructor.

Revision history for this message
clayg (clay-gerrard) wrote :
Revision history for this message
clayg (clay-gerrard) wrote :

Making the Object Server prefer to serve non-durable frags rather than 404 seems like the most straight forward way to address this terrible terrible failure:

https://gist.github.com/clayg/2988c8804ab4b704905d203d534a5342

I'm not aware of any situation where the proxy or reconstructor would ask for a fragment and not be robust to a potentially uncommited minority response. If there was an uncommited *majority* response - it would be served whiled it's available - but the eventual durability would still be subject to the existence and propagation of the durable file.

As such, this solution should be considered at *least* a partial fix for lp bug #1469094 - additional work to improve availability of a minority overwrite while additional primaries off line (i.e. servicing a durable majority from "under" a non-durable minority using secondary frags from the same nodes) should fall under lp bug #1484598

Revision history for this message
Alistair Coles (alistair-coles) wrote :

If we just make the object server always return most recent fragment even if it is not durable, (which is how I understood your attached patch), then don't we just swap one "bug" for another?

On master, if a partial PUT results in >= ec_ndata frags at t1 but < ec_ndata durables, then we cannot GET the object at t1

With your change, if a partial PUT at t1 results in < ec_ndata frags with no durables then we cannot GET the object at t1 *nor* an underlying object at t0, which remains on disk (no durable -> old durable data remains on disk) because optimism failed and there is no fallback. In this case no reconstruction can happen either.

With patch https://review.openstack.org/#/c/215276/ both scenarios are "fixed" because the proxy is smart enough to see that optimism failed and fallback to whatever else can be fetched from disks. Although that patch does not cover the reconstructor, I think the strategy (smart proxy plus adaptive object server) will allow to fix more cases.

Revision history for this message
clayg (clay-gerrard) wrote :

> If we just make the object server always return most recent fragment
> even if it is not durable, (which is how I understood your attached
> patch), then don't we just swap one "bug" for another?

That would definately not be acceptable, we'd need unitests for the
suggested change to assuage any concerns that there is no bad behaviors.

> On master, if a partial PUT results in >= ec_ndata frags at t1 but <
> ec_ndata durables, then we cannot GET the object at t1

If we have >= ec_ndata frags at t1 we should serve the it. If the worry
is some of those frags would be unavailable on a subsequent read but an
older stale read might be *possible* - I think we'd have to conceeed
then any improvments would have to include a second request back to
those nodes.

So, I think the question is about what the object server should do in
the default case (which is most likely to achive our goals of highly
available read your writes?) I have learend that missing durables can
happen under load, and failing to serve undurable fragments is causing
us a lot of pain.

I'm not really sure about servicing stale durable fragments from "under"
more recent writes - now we're getting into lp bug #1484598 - which adds
a bunch of complexity.

> With your change, if a partial PUT at t1 results in < ec_ndata frags
> with no durables then we cannot GET the object at t1 *nor* an
> underlying object at t0, which remains on disk (no durable -> old
> durable data remains on disk) because optimism failed and there is no
> fallback. In this case no reconstruction can happen either.

It think this is correct. It's not good. Worth considering...

https://gist.github.com/clayg/1f2590ea4608a13823865e27a6666c2a

Yes, I think we would have to address this case. Still feels like lp
bug #1484598

If a GET at t2 were able to get *something* from the t0 write (which
might not be available until that TOR switch is fixed) - the ideal
behavior would be for the proxy to make a second round of requests to
the nodes - I'd probably advocate for a "X-Require-Durable" so Node 1-6
could return the "stale" write.

> With patch https://review.openstack.org/#/c/215276/ both scenarios are
> "fixed" because the proxy is smart enough to see that optimism failed
> and fallback to whatever else can be fetched from disks. Although that
> patch does not cover the reconstructor, I think the strategy (smart
> proxy plus adaptive object server) will allow to fix more cases.

Yes, there does keep seeming to be some failure scenarios that would
require a proxy smart enough to reach back into an object server for
"something else" that it might be holding. The API and adaptiveness
that the object server might need to be expressable is more contentious
- I'm definately triaging failure modes based on likeleness and trying
to deal with those in a manor as simple as possible. I'm not scared of
complexity, but I think complex APIs are complicated to get right. So I
think it's reasonable to only let in as much complexity as needed into
the API.

Revision history for this message
clayg (clay-gerrard) wrote :

No one has ever reported this failure from the wild. The probetest probes it's possible - but it's going out of it's way to mess with ondisk state so it's kind of unfair.

Most everyone seems convinced it can happen with timeouts writing durables; and a couple of unlucky disk failures - so it should be fixed; but the solution is not straight forward (it took us a year to fix essentially the same problem in the proxy [1])

It being "hard" isn't a good reason to reduce the severity of the bug - but not known to have ever occurred is probably worth a little breathing room.

1. lp bug #1484598

Changed in swift:
importance: Critical → High
tags: added: ec
Revision history for this message
Alistair Coles (alistair-coles) wrote :
Changed in swift:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/376630
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=6574ce31eeda1bdffa2534f21eed7ab0d62bce97
Submitter: Jenkins
Branch: master

commit 6574ce31eeda1bdffa2534f21eed7ab0d62bce97
Author: Alistair Coles <email address hidden>
Date: Mon Sep 26 17:25:09 2016 +0100

    EC: reconstruct using non-durable fragments

    Previously the reconstructor would only reconstruct a missing fragment
    when a set of ec_ndata other fragments was available, *all* of which
    were durable. Since change [1] it has been possible to retrieve
    non-durable fragments from object servers. This patch changes the
    reconstructor to take advantage of [1] and use non-durable fragments.

    A new probe test is added to test scenarios with a mix of failed and
    non-durable nodes. The existing probe tests in
    test_reconstructor_rebuild.py and test_reconstructor_durable.py were
    broken. These were intended to simulate cases where combinations of
    nodes were either failed or had non-durable fragments, but the test
    scenarios defined were not actually created - every test scenario
    broke only one node instead of the intent of breaking multiple
    nodes. The existing tests have been refactored to re-use most of their
    setup and assertion code, and merged with the new test into a single
    class in test_reconstructor_rebuild.py.

    test_reconstructor_durable.py is removed.

    [1] Related-Change: I2310981fd1c4622ff5d1a739cbcc59637ffe3fc3

    Change-Id: Ic0cdbc7cee657cea0330c2eb1edabe8eb52c0567
    Co-Authored-By: Clay Gerrard <email address hidden>
    Closes-Bug: #1624088

Changed in swift:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.11.0

This issue was fixed in the openstack/swift 2.11.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/400985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/hummingbird)
Download full text (78.0 KiB)

Reviewed: https://review.openstack.org/400985
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=0c3f8f87104af8717115c5badffd243dbaa1c430
Submitter: Jenkins
Branch: feature/hummingbird

commit 2d25fe6ad3573b2a06b6b3e5e66493d7b0c55693
Author: Tim Burke <email address hidden>
Date: Mon Jul 25 15:06:23 2016 -0700

    Reduce backend requests for SLO If-Match / HEAD requests

    ... by storing SLO Etag and size in sysmeta.

    Previously, we had to GET the manifest for every HEAD or conditional
    request to an SLO. Worse, since SLO PUTs require that we HEAD every
    segment, we'd GET all included sub-SLO manifests. This was necessary so
    we could recompute the large object's Etag and content-length.

    Since we already know both of those during PUT, we'll now store it in
    object sysmeta. This allows us to:

     * satisfy HEAD requests based purely off of the manifest's HEAD
       response, and
     * perform the If-(None-)Match comparison on the object server, without
       any additional subrequests.

    Note that the large object content-length can't just be parsed from
    content-type -- with fast-POST enabled, the content-type coming out of
    the object-server won't necessarily include swift_bytes.

    Also note that we must still fall back to GETting the manifest if the
    sysmeta headers were not found. Otherwise, we'd break existing large
    objects.

    Change-Id: Ia6ad32354105515560b005cea750aa64a88c96f9

commit ae7dddd801e28217d7dc46bd45cd6b621f29340c
Author: Ondřej Nový <email address hidden>
Date: Mon Nov 21 22:13:11 2016 +0100

    Added comment for "user" option in drive-audit config

    Change-Id: I24362826bee85ac3304e9b63504c9465da673014

commit c3e1d847f4b9d6cc6212aae4dc1b1e6dff45fb40
Author: Thiago da Silva <email address hidden>
Date: Thu Nov 17 17:17:00 2016 -0500

    breaking down tests.py into smaller pieces

    tests.py is currently at ~5500 lines of code, it's
    time to break it down into smaller files.

    I started with an easy middleware set of tests
    (i.e., versioned writes, ~600 lines of code ) so I can get
    some feedback. There are more complicated tests that cover
    multiple middlewares for example, it is not so clear where
    those should go.

    Change-Id: I2aa6c18ee5b68d0aae73cc6add8cac6fbf7f33da
    Signed-off-by: Thiago da Silva <email address hidden>

commit 5d7a3a4172f0f11ab870252eec784cf24b247dea
Author: Ondřej Nový <email address hidden>
Date: Sat Nov 19 23:24:30 2016 +0100

    Removed "in-process-" from func env tox name

    This shorten shebang in infra, because we are hitting 128 bytes limit.

    Change-Id: I02477d81b836df71780942189d37d616944c4dce

commit 9ea340256996a03c8c744201297b47a0e91fe65b
Author: Kota Tsuyuzaki <email address hidden>
Date: Fri Nov 18 01:50:11 2016 -0800

    Don't overwrite built-in 'id'

    This is a follow up for https://review.openstack.org/#/c/399237

    'id' is assigned as a builtin function so that we should not use 'id'
    for the local variable name.

    Change-Id: Ic27460d49e68f6cd50bda1d5b3810e01ccb07a37

commit bf...

tags: added: in-feature-hummingbird
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.