auditor status files cause replicator to report errors

Bug #1583305 reported by clayg
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Charles Hsu

Bug Description

On a totally healthy working system the replicator finishes with log lines reporting errors:

May 18 18:38:48 saio object-6010: 954 successes, 8 failures
May 18 18:38:48 saio object-6010: 748 suffixes checked - 100.00% hashed, 0.00% synced

This is because build_jobs is incrementing errors for *all* devices when it encounters an unexpected file in the objects dir (in this cause the auditor_status_*.json files).

There's no good logging around the event.

The object-replicator should expect the auditor status files to be in the objects dir. It should not report failures when their are none. If it encounters a condition which results in incrementing failures it should include good logging around the nature of the failure.

This was reported on the ML and I confirmed it:

http://lists.openstack.org/pipermail/openstack/2016-May/016218.html

Revision history for this message
clayg (clay-gerrard) wrote :
Revision history for this message
clayg (clay-gerrard) wrote :

gah, reconstructor needs the same smarts:

May 17 16:29:35 STACO2 object-reconstructor: Unexpected entity in data dir: u'/srv/node/s02z2ecd02/objects-1/auditor_status_ZBF.json'

Revision history for this message
Tim Burke (1-tim-z) wrote :

Reconstructor should have those smarts already: https://review.openstack.org/#/c/315334/

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

Also the calling of _add_failure_stats in the finally clause needs something a bit like the attached (suggestion only).

Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

Note - my patch uploaded prev is meant to work in addition to Clay's!

Altho, testing with *just* Clay's patch applied I'm not seeing those other (timeout logic related) error counts. Not sure if my test setup is merely not exhibiting the same bugs now or if Clays fix stops those code paths being visited.

Revision history for this message
clayg (clay-gerrard) wrote :

yeah with the continue lines when it encounters these expected files it never gets into those other blocks that do the failure increments. Although FWIW I think those other lines are somewhat dubious - they seem to be blanket catch alls for unexpected behavior - but I can't imagine when what they do is the correct general behavior. Skipping these files in this known and explicit case is certainly the right way to handle *this* situation. Thinking harder about the general behavior when that block of code goes off the rails might be a separate bug.

Glad the patch worked out - hopefully we can slide something like that fix into a release soon.

Revision history for this message
Charles Hsu (charles0126) wrote :

@Mark Did you submit the code to review?

Changed in swift:
assignee: nobody → Charles Hsu (charles0126)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/353648

Changed in swift:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/353648
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=65b1820407ea40bd7d65a5356a58a689befe3cb5
Submitter: Jenkins
Branch: master

commit 65b1820407ea40bd7d65a5356a58a689befe3cb5
Author: Charles Hsu <email address hidden>
Date: Thu Aug 11 00:53:13 2016 +0800

    Ignore auditor status files to prevent replicator reports errors

    Ignore `auditor_status_*.json` files during the collecting jobs
    and replicator won't use these wrong paths to find objects that
    causes an exception to increase failure count in replicator report.

    Co-Authored-By: Clay Gerrard <email address hidden>
    Co-Authored-By: Mark Kirkwood <email address hidden>

    Change-Id: Ib15a0987288d9ee32432c1998aefe638ca3b223b
    Closes-Bug: #1583305

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
Mark Kirkwood (mark-kirkwood) wrote :

Can we possibly backport this to 2.7.x series? Would be great to not have to work around this bug in production.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/362514

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/363111

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/hummingbird)
Download full text (84.1 KiB)

Reviewed: https://review.openstack.org/363111
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=1ab2a296f58ae76aeffef9f3f0fb90e15358be27
Submitter: Jenkins
Branch: feature/hummingbird

commit 3b5850836c59c46f2507a7f62aceccf4c37e5d41
Author: gecong1973 <email address hidden>
Date: Tue Aug 30 15:08:49 2016 +0800

    Remove white space between print and ()

    There is a white space between print and ()
    in /tempauth.py, This patch fix it

    Change-Id: Id3493bdef12223aa3a2bffc200db8710f5949101

commit f88e7fc0da2ed6a63e0ea3c3459d80772b3068cd
Author: zheng yin <email address hidden>
Date: Mon Aug 29 20:21:44 2016 +0800

    Clarify test case in common/ring/test_builder

    They use a bare assertRaises(ValueError, ring.RingBuilder, *,*,*), but
    it's not clear which one raises which ValueError(), so I extend them to
    validate the error strings as well.

    Change-Id: I63280a9fc47ff678fe143e635046a0b402fd4506

commit d68b1bd6ddf44c5088e9d02dcb2f1b802c71411b
Author: zhufl <email address hidden>
Date: Mon Aug 29 14:31:27 2016 +0800

    Remove unnecessary tearDown

    This is to remove unnecessary tearDown to keep code clean.

    Change-Id: Ie70e40d6b55f379b0cc9bc372a35705462cade8b

commit d2fc2614575b04fd9cab5ae589880b92eee9b186
Author: Matthew Oliver <email address hidden>
Date: Fri Aug 19 16:17:31 2016 +1000

    Authorise versioned write PUTs before copy

    Currently a versioned write PUT uses a pre-authed request to move
    it into the versioned container before checking whether the
    user is authorised. This can lead to some interesting behaviour
    whereby a user can select a versioned object path that it does not
    have access to, request a put on that versioned object, and this
    request will execute the copy part of the request before it fails
    due to lack of permissions.

    This patch changes the behaviour to be the same as versioned DELETE
    where the request is authorised before anything is moved.

    Change-Id: Ia8b92251718d10b1eb44a456f28d3d2569a30003
    Closes-Bug: #1562175

commit c1ef6539b6ba9d2e4354c9cd2eec8a0195cdb19f
Author: Clay Gerrard <email address hidden>
Date: Thu Aug 25 11:00:49 2016 -0700

    add test for expirer processes == process

    This is a follow up from a change that improved the error message.

    Related-Change: I3d12b79470d122b2114f9ee486b15d381f290f95

    Change-Id: I093801f3516a60b298c13e2aa026c11c68a63792

commit 01477c78c1163822de41484e914a0736e622085b
Author: zheng yin <email address hidden>
Date: Thu Aug 25 15:37:42 2016 +0800

    Fix ValueError information in obj/expirer

    I fix error information in raise ValueError(...)
    For example:
        if a>=b:
            # It should be under below and not 'a must be less than or equal to b'
            raise ValueError('a must be less than b')

    Change-Id: I3d12b79470d122b2114f9ee486b15d381f290f95

commit b81f53b964fdb8f3b50dd369ce2e194ee4dbb0b7
Author: zheng yin <email address hidden>
Date: Tue Aug 23 14:26:47 2016 +0800

    Improve readability in the obj server's unit tests

    This change improves the reada...

tags: added: in-feature-hummingbird
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/363855

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (stable/mitaka)

Reviewed: https://review.openstack.org/362514
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=2c39cc9990fc2a2f3fac9a15c79c10e8fdb1b1fc
Submitter: Jenkins
Branch: stable/mitaka

commit 2c39cc9990fc2a2f3fac9a15c79c10e8fdb1b1fc
Author: Charles Hsu <email address hidden>
Date: Thu Aug 11 00:53:13 2016 +0800

    Ignore auditor status files to prevent replicator reports errors

    Ignore `auditor_status_*.json` files during the collecting jobs
    and replicator won't use these wrong paths to find objects that
    causes an exception to increase failure count in replicator report.

    Co-Authored-By: Clay Gerrard <email address hidden>
    Co-Authored-By: Mark Kirkwood <email address hidden>

    Change-Id: Ib15a0987288d9ee32432c1998aefe638ca3b223b
    Closes-Bug: #1583305
    (cherry picked from commit 65b1820407ea40bd7d65a5356a58a689befe3cb5)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/363855
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=06c385cac36223e044df72d9e74d07809cb1f0e7
Submitter: Jenkins
Branch: stable/mitaka

commit 06c385cac36223e044df72d9e74d07809cb1f0e7
Author: Tim Burke <email address hidden>
Date: Wed May 11 19:54:47 2016 -0700

    Stop complaining about auditor_status files

    Following fd86d5a, the object-auditor would leave status files so it
    could resume where it left off if restarted. However, this would also
    cause the object-reconstructor to print warnings like:

      Unexpected entity in data dir: u'/srv/node4/sdb8/objects/auditor_status_ZBF.json'

    ...which isn't actually terribly useful or actionable. The auditor will
    clean it up (eventually); the operator doesn't have to do anything.

    Now, the reconstructor will specifically ignore those status files.

    Partial-Bug: 1583305
    Change-Id: I2f3d0bd2f1e242db6eb263c7755f1363d1430048
    (cherry picked from commit ad16e2c77bb61bdf51a7d3b2c258daf69bfc74da)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.10.0

This issue was fixed in the openstack/swift 2.10.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.7.1

This issue was fixed in the openstack/swift 2.7.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.