Overloaded object primaries cause 404s on GET

Bug #1837819 reported by Tim Burke
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Tim Burke

Bug Description

If a proxy sees

  (Timeout, Timeout, Timeout, 404, 404, 404)

for a triple-replicated object GET, at the moment, we take *a* response over *no* response and return 404 to the client.

This seems like a poor choice, though: we have no reason to expect the data to be on handoffs! *Maybe* the 404 would be OK if we could assume that all data would have been written recently (ie, while the primaries were unavailable) -- but it seems like a much safer assumption to think that if there's data it would have been written long enough ago that the replicators would have gotten it settled on the primaries. The correct response (if the primaries weren't overloaded) may *still be* a 404 -- but the system isn't healthy enough for us to say either way with any confidence. 503 is more appropriate.

(That is, assuming those 404s aren't from tombstones -- if there's a timestamp, we *should* give that 404 consideration.)

The story gets even funkier with EC -- on a 3+2 policy, you could get something like

  (frag#0#d.data, frag#1#d.data, Timeout, Timeout, Timeout, 404, 404, 404)

so *we see durable data* and yet still 404. If we know we should be able to reconstruct but can't right now, 503 seems like the *only* appropriate response.

Changed in swift:
importance: Undecided → High
importance: High → Medium
assignee: nobody → Tim Burke (1-tim-z)
Changed in swift:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/672186
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=3189410f9d5be3a845dbd47fdd77584f4b76da8a
Submitter: Zuul
Branch: master

commit 3189410f9d5be3a845dbd47fdd77584f4b76da8a
Author: Tim Burke <email address hidden>
Date: Mon Jul 22 12:38:30 2019 -0700

    Ignore 404s from handoffs for objects when calculating quorum

    We previously realized we needed to do that for accounts and containers
    where the consequences of treating the 404 as authoritative were more
    obvious: we'd cache the non-existence which prevented writes until it
    fell out of cache.

    The same basic logic applies for objects, though: if we see

        (Timeout, Timeout, Timeout, 404, 404, 404)

    on a triple-replica policy, we don't really have any reason to think
    that a 404 is appropriate. In fact, it seems reasonably likely that
    there's a thundering-herd problem where there are too many concurrent
    requests for data that *definitely is there*. By responding with a 503,
    we apply some back-pressure to clients, who hopefully have some
    exponential backoff in their retries.

    The situation gets a bit more complicated with erasure-coded data, but
    the same basic principle applies. We're just more likely to have
    confirmation that there *is* data out there, we just can't reconstruct
    it (right now).

    Note that we *still want to check* those handoffs, of course. Our
    fail-in-place strategy has us replicate (and, more recently,
    reconstruct) to handoffs to maintain durability; it'd be silly *not* to
    look.

    UpgradeImpact:
    --------------
    Be aware that this may cause an increase in 503 Service Unavailable
    responses served by proxy-servers. However, this should more accurately
    reflect the state of the system.

    Co-Authored-By: Thiago da Silva <email address hidden>
    Change-Id: Ia832e9bab13167948f01bc50aa8a61974ce189fb
    Closes-Bug: #1837819
    Related-Bug: #1833612
    Related-Change: I53ed04b5de20c261ddd79c98c629580472e09961
    Related-Change: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/losf)

Fix proposed to branch: feature/losf
Review: https://review.opendev.org/677848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/losf)
Download full text (20.8 KiB)

Reviewed: https://review.opendev.org/677848
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=6204a2bc571a51e91f97ad4342b8ae216aaf6f36
Submitter: Zuul
Branch: feature/losf

commit ff5ea003b3a1b37d8417aa17d3521237768dfe62
Author: Tim Burke <email address hidden>
Date: Tue Aug 20 22:20:44 2019 -0700

    ec: log durability of frags that fail to reconstruct

    Whether the frag is durable or non-durable greatly affects how much I
    care whether I can reconstruct it.

    Change-Id: Ie6f46267d4bb567ecc0cc195d1fd7ce55c8cb325

commit 94366fd00ec1e71b4f67ef8595b31162c6945c28
Author: Thiago da Silva <email address hidden>
Date: Mon Aug 19 22:31:41 2019 +0200

    Add Dockerfile to build a py3 swift docker image

    This patch contains only a new Dockerfile and few changes to
    be able to build both py2 and py3 images.

    Next patch should contain changes to add the gate jobs to build
    a py3 docker image

    Change-Id: Ifdebde9597a787abcd553756e22261e2faaeedfc

commit e92191306587335ad0aab3a3b9dffef21765e17b
Author: Clay Gerrard <email address hidden>
Date: Mon Aug 19 11:06:20 2019 -0500

    Rename symlink method

    Related-Change-Id: I179ea6180d31146bb947061c69b1807c59529ac8
    Change-Id: I770ea1be25e339e5ce4341b532a1fff9e1373152

commit 82e427a8b14e3efee2da99d622fad04ed6461a50
Author: Clay Gerrard <email address hidden>
Date: Mon Aug 19 10:56:50 2019 -0500

    Fix symlink docstring

    Related-Change-Id: I179ea6180d31146bb947061c69b1807c59529ac8
    Change-Id: I41cbf445057be09a102ef5515a9ce6a4f14e7458

commit a21c0be70ca6c200076d274c54f6906e1cf92a61
Author: Tim Burke <email address hidden>
Date: Wed Jun 5 12:02:07 2019 -0700

    dlo: Respond 200 on multi-range GETs

    We already had a unit test for this, but it was inadvertently retesting
    what test_get_suffix_range_many_segments tested.

    Change-Id: I43eee7029ca884268c3f2ad74300b94b299fd8d2
    Closes-Bug: #1831790

commit 7f42309a335370f567aefdf0236cfe01cbbfca5d
Author: Tim Burke <email address hidden>
Date: Thu Mar 21 15:13:36 2019 -0700

    py3: Cover account/container func tests

    Change-Id: I9a754f1eb06debbe16800a05fc8e969af2f89043

commit 3e7752d8c80473e93d537c7ba30419ff9313ebb8
Author: Thiago da Silva <email address hidden>
Date: Fri Aug 16 18:30:13 2019 +0200

    Update docker image to latest Alpine 3.10.1

    Change-Id: I888c51a965ee27175584f575560a1d50f4534663

commit 1abc9c4f9d4c2f9d8ac996cca489b0ec65660a05
Author: Tim Burke <email address hidden>
Date: Thu Jan 24 00:23:01 2019 +0000

    Allow "static symlinks"

    ... by embedding something like `If-Match: <etag>` semantics in the
    symlink.

    When creating a symlink, users may now specify an optional
    X-Symlink-Target-Etag header. If present, the etag of the final object
    returned to the client will be checked; if it does not match the
    X-Symlink-Target-Etag header, a 409 Conflict error will be returned to
    the client.

    Note that, unlike "dynamic symlink" behavior, the target object must
    exist with the matching Etag for the "static symlink" to be created.

    Since we're v...

tags: added: in-feature-losf
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.23.0

This issue was fixed in the openstack/swift 2.23.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.