sharded root containers can loose their epoch, turning them into unsharded

Bug #1980451 reported by Matthew Oliver
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Matthew Oliver

Bug Description

We've now noticed it twice in production where an already sharded root suddendly has no epoch. This tells the state machine that the container isn't sharded. However, when we look at the db on disk it has an epoch in the filename. Which means this was a sharded container. And sure enough it has shards out there.

We are not sure how this can happen. The only way for an own_shard_range (OSR) to be created without and epoch, is if the container doesn't have one _AND_ it hits a bit of the sharder code that calls with `get_own_shard_range(no_default=False)`.
I theory this could happen on a new container put for a recently rebalanced replica or maybe a handoff.

In anycase a new OSR would have a timestamp and state_timestamp of the time it's been created so will overwrite others on replication. Meaning it can take out the OSRs of other replicas.

This comes with some problems, the sharded roots will start looking in their own object table for objects in the container and ignore the shards, meaning a difference in stats and object listing.

Possible solutions:

  - Set a default OSR to have an older timestamp, but at somepoint it needs to update to now, so might not be super useful in blocking the bug.
  - Once an OSR has an epoch we shouldn't be able to loose it.. EVER.. as it's kind of an importance peice of the sharding state machine.

We have minimalised the calls to get_own_shard_range(no_default=True) to the bare minimum, and also added device names to the db_id's so we can track who was replicating to a container in an attempt to help track down this issue.

I like the latter solution, we only ever go "no epoch" > "epoch" and never backwards, so all we should need to do is stop merge_shard_ranges from merging any shardrange without and epoch over one with one.

Still I'd love to know the root cause of this bug.

Matthew Oliver (matt-0)
Changed in swift:
assignee: nobody → Matthew Oliver (matt-0)
Changed in swift:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/c/openstack/swift/+/809969
Committed: https://opendev.org/openstack/swift/commit/8227f4539cbdf4b9b59def8a344084796622248c
Submitter: "Zuul (22348)"
Branch: master

commit 8227f4539cbdf4b9b59def8a344084796622248c
Author: Matthew Oliver <email address hidden>
Date: Wed Sep 8 16:29:30 2021 +1000

    sharding: don't replace own_shard_range without an epoch

    We've observed a root container suddenly thinks it's unsharded when it's
    own_shard_range is reset. This patch blocks a remote osr with an epoch
    of None from overwriting a local epoched OSR.

    The only way we've observed this happen is when a new replica or handoff
    node creates a container and it's new own_shard_range is created without
    an epoch and then replicated to older primaries.

    However, if a bad node with a non-epoched OSR is on a primary, it's
    newer timestamp would prevent pulling the good osr from it's peers. So
    it'll be left stuck with it's bad one.

    When this happens expect to see a bunch of:
        Ignoring remote osr w/o epoch: x, from: y

    When an OSR comes in from a replica that doesn't have an epoch when
    it should, we do a pre-flight check to see if it would remove the epoch
    before emitting the error above. We do this because when sharding is
    first initiated it's perfectly valid to get OSR's without epochs from
    replicas. This is expected and harmless.

    Closes-bug: #1980451
    Change-Id: I069bdbeb430e89074605e40525d955b3a704a44f

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.33.0

This issue was fixed in the openstack/swift 2.33.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.