logic error in ssync_rcvr when getting EC frags from a handoff

Bug #1489546 reported by paul luse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Critical
paul luse

Bug Description

Back from patch https://review.openstack.org/#/c/191521/ we failed to recognize the situation where a header value of string 'None' is valid and use a comparison to None instead causing an INT conversion failure when handling sender requests from a handoff node

Revision history for this message
clayg (clay-gerrard) wrote :

Looking at the diff from patchset 3 => 4 looks like maybe the X-Backend-Ssync-Node-Index can be 'None' - so that's probably the suck.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/217830

Changed in swift:
status: New → In Progress
Revision history for this message
clayg (clay-gerrard) wrote :

Well, but then looking at Ie9df9927ff4d3dd3f334647f883b2937d0d81030 [1] it doesn't look like the header should ever be 'None'

1. https://review.openstack.org/#/c/195457/

Revision history for this message
paul luse (paul-e-luse) wrote :

yeah looks that way but it is... will see if we can determine how on this cluster where its hapening

Revision history for this message
paul luse (paul-e-luse) wrote :

OK so Caleb just pointed out in the relevant code:

            self.connection.putheader(
                'X-Backend-Ssync-Frag-Index', self.node.get(
                    'index', self.job.get('frag_index', '')))
            # a revert job to a handoff will not have a node index
            self.connection.putheader('X-Backend-Ssync-Node-Index',
                                      self.node.get('index', ''))

that if 'frag_index' key exists and is None then we'll end up setting the 1st header above to 'None' which is what we're seeing happen. frag_index can be None (non string) in the Ec recon, see _get_part_jobs()

Revision history for this message
clayg (clay-gerrard) wrote :

ok, so I double confirmed that node.get('index', '') will never return None - a ring node will never have an 'index' key that is None. Handoffs just don't have the 'index' key.

But as stated, a reconstructor job may have a 'frag_index' key that is None - but it's a specific subset of REVERT jobs that include *only* tombstones.

However, most of time these tombstone REVERT jobs will go to primary nodes, specifically *all* primary nodes (this is probably a bit pessimistic given that tombstones are essentially replicated data) - further if *any* of the primary nodes throws an ssync failure (any reason will do, e.g. replication lock timeout) - then the revert job will keep processing nodes - even digging into handoffs (N.B. this behavior is sort of nuts, the partition is already on a handoff, and we're trying to move it to a *better* handoff - which is a somewhat dubious proposition). But in this case as Caleb & Paul pointed out ssync_sender.Sender will write a 'None' on the wire which is bad.

test.probe.test_reconstructor_revert.TestReconstructorRevert.test_delete_propogate is very close to covering this scenario, but it doesn't go as far as to introduce a failure on one of the primaries. I'm going to add that as well as make the needed fixups to ssync_sender to avoid writing 'None' to the wire, and update ssync_receiver to raise a cleaner BadRequest error if it gets a non-empty & non-digit X-Backend-Ssync-*-Index header. To go even one step further we can make the ssync_sender better a logging the details of the error and closing down the chunked transfer ssync protocol.

Changed in swift:
importance: Undecided → High
Revision history for this message
Caleb Tennis (ctennis) wrote :

Hit this again today, backtrace:

Sep 2 17:29:39 localhost object-server: ERROR __call__ error with SSYNC /d4/952 : #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/server.py", line 938, in __call__#012 res = method(req)#012 File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 2668, in wrapped#012 return func(*a, **kw)#012 File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 1208, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/server.py", line 918, in SSYNC#012 return Response(app_iter=ssync_receiver.Receiver(self, request)())#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/ssync_receiver.py", line 72, in __init__#012 self.initialize_request()#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/ssync_receiver.py", line 162, in initialize_request#012 self.request.headers['X-Backend-Ssync-Frag-Index'])#012ValueError: invalid literal for int() with base 10: 'None'

Sep 2 17:29:39 localhost object-reconstructor: 172.30.3.43:6003/d4/952 Expected status 200; got 500

Revision history for this message
Caleb Tennis (ctennis) wrote :

Above backtrace still happens with https://review.openstack.org/#/c/217830/ patchset2 applied to latest swift master.

Revision history for this message
Caleb Tennis (ctennis) wrote :

Nevermind, I was incorrect. Above patchset does indeed fix this issue.

clayg (clay-gerrard)
Changed in swift:
importance: High → Critical
Revision history for this message
clayg (clay-gerrard) wrote :

I marked this as high just because - well the reconstructor sucks really bad w/o this patch.

https://review.openstack.org/#/c/218023/

is along the same theme

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/217830
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=a3facce53cac0497edb326a93a97f24e02c603ab
Submitter: Jenkins
Branch: master

commit a3facce53cac0497edb326a93a97f24e02c603ab
Author: paul luse <email address hidden>
Date: Thu Aug 27 11:02:27 2015 -0700

    Fix invalid frag_index header in ssync_sender when reverting EC tombstones

    Back in d124ce [1] we failed to recognize the situation where a revert
    job would have an explicit frag_index key wth the literal value None
    which would take precedence over the dict.get's default value of ''.
    Later in ssync_receiver we'd bump into the ValueError converting 'None'
    to an int (again).

    In ssync_sender we now handle literal None's correctly and should
    hopefully no longer put this invalid headers on the wire - but for belts
    and braces we'll also update ssync_receiver to raise a 400 series error
    and ssync_sender to better log the error messages.

    1. https://review.openstack.org/#/c/195457/

    Co-Author: Clay Gerrard <email address hidden>
    Co-Author: Alistair Coles <email address hidden>
    Change-Id: Ic71ba7cc82487773214030207bb193f425319449
    Closes-Bug: 1489546

Changed in swift:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/crypto)

Fix proposed to branch: feature/crypto
Review: https://review.openstack.org/225540

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (feature/crypto)

Change abandoned by John Dickinson (<email address hidden>) on branch: feature/crypto
Review: https://review.openstack.org/225540
Reason: jrichli will do this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/crypto)
Download full text (12.2 KiB)

Reviewed: https://review.openstack.org/225540
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=1936299943116abcedd8b36fb8b64c724e539ab7
Submitter: Jenkins
Branch: feature/crypto

commit 9e95613d717489da584531e220db20d1b8f43c51
Author: Alistair Coles <email address hidden>
Date: Mon Sep 21 10:01:54 2015 +0100

    Fix .coveragrc to prevent nose tests error

    Since v4.0 the coverage package raises an error if unrecognized
    options are found in .coveragrc [1]. Previously they were ignored. The
    ignore-errors option therefore causes nosetests with coverage to error
    because the option should be ignore_errors (underscore not hyphen).

    [1] https://bitbucket.org/ned/coveragepy/src/tip/CHANGES.rst

    Change-Id: Ic488801b7cc43217f9e2a4ed945e06505b667608

commit 1fe8e4327b15f8933efef185d09b6067d1f28716
Author: Clay Gerrard <email address hidden>
Date: Fri Sep 18 13:54:52 2015 -0700

    Fix recon tests on SAIO with multiple policies

    Recon middleware used to only look on rings that exist on disk when it was
    started, so if a test didn't create a ring in the temp swift_dir it can expect
    the middleware to not report it.

    However, after we started looking at policies to determine rings [1] - we need
    to be more careful to patch policies to match up with the test requirements.
    On development environments with only the legacy default polices the existing
    recon tests were passing by accident - but not in my environment.

    This change will patch policies for the TestCase so that tests will pass for
    me. Individual test methods that have more specific policy requirements for
    the test can continue to @patch_policies just for those tests but in general
    the existing test_methods all seem to expect legacy policies - so we just make
    the default for the TestCase legacy_only.

    Change-Id: I778a0a59091ca8870e1cab985f3ec426eb045ab7

commit a63f70c17d392379ec75045a94c38a96878b4c5c
Author: Minwoo Bae <email address hidden>
Date: Wed Sep 9 15:33:45 2015 -0500

    Reconstructor logging to omit 404 warnings

    Currently, the replicator does not log warning messages
    for 404 responses. We would like the reconstructor to
    do the same, as 404s are not considered unusual, and
    are already handled by the object server.

    Change-Id: Ia927bf30362548832e9f451923ff94053e11b758
    Closes-Bug: #1491883

commit 530102ae07ea27b4a994e4b1fb2f590700cfef0c
Author: Bill Huber <email address hidden>
Date: Mon Sep 14 16:01:39 2015 -0500

    Update EC Support on how to build an EC ring with replicas count

    This doc is being updated to specify the replicas count parameter
    to build an EC ring that enforces both data and parity placements
    for each partition.

    Change-Id: I770ad268e4017e610be3357e89b89f0b7d3c18af
    Closes-Bug: 1487203

commit 0d9142abd45f189b15eaec1a25464ed7d1859f97
Author: Clay Gerrard <email address hidden>
Date: Mon Sep 14 17:17:29 2015 -0700

    Fix typo in Deployment Guide and add some formatting

    Change-Id: I58703162bf3e9f39656a5e511bd8fe845793bca2

commit 460a7e4b64d134d1fd47f09924d594196b6...

tags: added: in-feature-crypto
Thierry Carrez (ttx)
Changed in swift:
milestone: none → 2.5.0
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/236162

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/hummingbird)
Download full text (52.0 KiB)

Reviewed: https://review.openstack.org/236162
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=18ddcaf0d6b67fcbb6b0a4cf4a9a99c72f6f3a08
Submitter: Jenkins
Branch: feature/hummingbird

commit a9ddc2d9ea402eaac7ccd8992387f77855968ab5
Author: Mahati Chamarthy <email address hidden>
Date: Fri Oct 16 18:18:33 2015 +0530

    Hyperlink fix to first contribution doc

    Change-Id: I19fc1abc89f888233b80a57c68a152c1c1758640

commit 83a1151d13e096b480aefe6ec18259f2d7d021db
Author: Pete Zaitcev <email address hidden>
Date: Fri Oct 9 16:45:20 2015 -0600

    Interpolate the explanation string not whole HTML body

    The only reason this exists is that I promised to do it.
    But in our case, there's no big advantage, and here's why.

    The general thinking goes that strings must be interpolated
    first because the body may contain a syntax that confuses the
    interpolation. So this patch makes the code "more correct".
    However, our HTML template is tightly controlled. It's not
    like it contains additional percents.

    So I'll just leave this here for now while I'm asking if
    the content type is set correctly.

    Change-Id: Ia18aeb0f94ef389f8b95450986a566e5fa06aa10

commit 384b91eb824376659989b904f9396cbf2e02d2bd
Author: asettle <email address hidden>
Date: Thu Sep 3 15:11:46 2015 +1000

    Moving DLO functionality doc to the middleware code

    This change moves the RST DLO documentation from
    statically inside overview_large_objects.rst and moves it
    to middleware/dlo.py.
    This is where all middleware RST documentation is defined.

    The overview_large_objects.rst is still the main page
    for information on large objects, so now dynamically
    points to both the DLO and SLO middleware RST
    documentation and the relevant middleware.rst page
    simply points to it.

    Change-Id: I40d918c8b7bc608ab945805d69fe359521df038a
    Closes-bug: #1276852

commit 2996974e5d48b4efaa1b271b8fbd0387bced7242
Author: Ondřej Nový <email address hidden>
Date: Sat Oct 10 14:56:30 2015 +0200

    Script for running unit, func and probe tests at once

    When developing Swift it's often needed to run all tests.
    This script makes it much simpler.

    Change-Id: I67e6f7cc05ebd0475001c1b56e8f6fd09c8c644f

commit c2182fd4163050a5f76eb3dedb7703dc821fa83d
Author: janonymous <email address hidden>
Date: Fri Jul 17 20:20:15 2015 +0530

    Python3: do not use im_self/im_func/func_closure

    Use __self__, __func__ and __closure__ instead, as they work
    with both Python 2 and 3.

    Modifying usage of __func__ in codebase.

    Change-Id: I57e907c28c1d4646605e70194ea3650806730b83

commit c0866ceaac2f69ae01345a795520141f59ec64f5
Author: Samuel Merritt <email address hidden>
Date: Fri Sep 25 17:26:37 2015 -0700

    Improve SLO PUT error checking

    This commit tries to give the user a reason that their SLO manifest
    was invalid instead of just saying "Invalid SLO Manifest File". It
    doesn't get every error condition, but it's better than before.

    Examples of things that now have real error...

tags: added: in-feature-hummingbird
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.5.0

This issue was fixed in the openstack/swift 2.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.