make reconstructor handoffs_first mode more useful

Bug #1653018 reported by clayg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Unassigned

Bug Description

The hadnoffs_first mode on the reconstructor doesn't work very well in typical deployments with multiple disks because of lp bug #1491605 processing the part jobs per disk in order.

so the handoff parts on the first disk will process - but then it will get bogged down on the sync jobs of the parts on that disk - so it's not really doing handoffs first

also in lessons learned from the replicator - handoffs_first should be a mode of operation where an operator can get a reconstructor to process *only* handoff parts rather than doing handoffs for a while to start and then doing regular work.

Revision history for this message
clayg (clay-gerrard) wrote :
Changed in swift:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)
Download full text (3.2 KiB)

Reviewed: https://review.openstack.org/425493
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=da557011ecc1ec46bce24bd896ec50cbed914ba6
Submitter: Jenkins
Branch: master

commit da557011ecc1ec46bce24bd896ec50cbed914ba6
Author: Clay Gerrard <email address hidden>
Date: Wed Jan 25 11:51:03 2017 -0800

    Deprecate broken handoffs_first in favor of handoffs_only

    The handoffs_first mode in the replicator has the useful behavior of
    processing all handoff parts across all disks until there aren't any
    handoffs anymore on the node [1] and then it seemingly tries to drop
    back into normal operation. In practice I've only ever heard of
    handoffs_first used while rebalancing and turned off as soon as the
    rebalance finishes - it's not recommended to run with handoffs_first
    mode turned on and it emits a warning on startup if option is enabled.

    The handoffs_first mode on the reconstructor doesn't work - it was
    prioritizing handoffs *per-part* [2] - which is really unfortunate
    because in the reconstructor during a rebalance it's often *much* more
    attractive from an efficiency disk/network perspective to revert a
    partition from a handoff than it is to rebuild an entire partition from
    another primary using the other EC fragments in the cluster.

    This change deprecates handoffs_first in favor of handoffs_only in the
    reconstructor which is far more useful - and just like handoffs_first
    mode in the replicator - it gives the operator the option of forcing the
    consistency engine to focus on rebalance. The handoffs_only behavior is
    somewhat consistent with the replicator's handoffs_first option (any
    error on any handoff in the replicactor will make it essentially handoff
    only forever) but the option does what you want and is named correctly
    in the reconstructor.

    For consistency with the replicator the reconstructor will mostly honor
    the handoffs_first option, but if you set handoffs_only in the config it
    always takes precedence. Having handoffs_first in your config always
    results in a warning, but if handoff_only is not set and handoffs_first
    is true the reconstructor will assume you need handoffs_only and behaves
    as such.

    When running in handoffs_only mode the reconstructor will start to log a
    warning every cycle if you leave it running in handoffs_only after it
    finishes reverting handoffs. However you should be monitoring on-disk
    partitions and disable the option as soon as the cluster finishes the
    full rebalance cycle.

    1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator
    handoffs_first "mode"

    2. Unlike replication each partition in a EC policy can have a different
    kind of job per frag_index, but the cardinality of jobs is typically
    only one (either sync or revert) unless there's been a bunch of errors
    during write and then handoffs partitions maybe hold a number of
    different fragments.

    Known-Issues:

    handoffs_only is not documented outside of the example config, see lp
    bug #1626290

    Closes-Bug: #1653018
    ...

Read more...

Changed in swift:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.13.0

This issue was fixed in the openstack/swift 2.13.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.