EC: Client Disconnect leaves inaccessible data on disk

Bug #1496205 reported by clayg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Critical
clayg

Bug Description

If a client disconnects from a regular content-length put without sending all of the data a non-durable datafile is left on disk [1]

Since we're not returning non-druable fragments the proxy will still 404 a subsequent GET - but the data on disk should not be there - that object is not valid.

I attached a python script to play around with the disconnect scenario - but you still need to `find /srv/node*/sdb*/object* -name \*.data` to see the on disk fragments.

This is probably the root cause of lp bug #1469094 - I found it riffing on tests Kota's attached to patch 211338 [2]

I'll probably try to codify this into failing probetest tomorrow if the overnight crew doesn't beat me to it.

1. replication on disconnect, I verified, will not only *not* drop the .data file in to the hashpath - it'll even clean up the tmpfile!
2. https://review.openstack.org/#/c/211338/ => http://paste.openstack.org/show/449973/

Revision history for this message
clayg (clay-gerrard) wrote :

I like this test - it uses a real proxy and a real object server on a real socket to really disconnect - and then it looks at the entire state of the world on-disk to show whats wrong. It has a test for repl that passes. It re-phrases the same test for ec with little code duplication and it fails. it reads ok I think. it cleans up after itself.

They say most of the time writing the test is the hard part... although I'm not sure on this one. We'll see!

Revision history for this message
clayg (clay-gerrard) wrote :

The earlier script to demonstrate disconnect had some math/comment bugs - this one is better-ish.

Worse (and I don't understand this at all) - I'm getting .durable files even tho the proxy logged 499?

clayg (clay-gerrard)
tags: added: ec
Bill Huber (wbhuber)
Changed in swift:
status: New → Confirmed
Revision history for this message
Minwoo Bae (minwoob) wrote :

Originally, the script that Clay posted didn't work for me. Try this if it helps:

Changed in swift:
importance: Undecided → Critical
Revision history for this message
Matthew Oliver (matt-0) wrote :

Clay has a patch to fix this bug, for some reason lauchpad hasn't updated. The patch is: https://review.openstack.org/#/c/225357/

I have tested it against the script clay (which uses the an EC policy called ec) or Minwoo's which uses the standard SAIO policy ec42. And is cleaning up as expected.

Changed in swift:
assignee: nobody → clayg (clay-gerrard)
status: Confirmed → In Progress
Revision history for this message
Matthew Oliver (matt-0) wrote :

Since launchpad failed to update the bug, I have, assigned it to clay and marked in progress.

I have tested his patch as it does clean up. So I for one will make sure the patch is reviewed. Noting because LP failed I was going to grab this and try and fix it.. but clay is a machine, go clay!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/225357
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=3afdcf6b8fa74051fb6da0bf24799c80399624fd
Submitter: Jenkins
Branch: master

commit 3afdcf6b8fa74051fb6da0bf24799c80399624fd
Author: Clay Gerrard <email address hidden>
Date: Thu Sep 17 09:54:30 2015 -0700

    Fix proxy handling of EC client disconnect

    The ECObjectController was unconditionally sending down the frag archive
    commit document after the client source stream terminated - even if the
    client disconnected early.

    We can detect early disconnect in two ways:

      1. Content-Length and not enough bytes_transfered

         When eventlet.wsgi is reading from a Content-Length body the
         readable returns the empty string and our iterable raises
         StopIteration - but we can check content-length against
         bytes_transfered and know if the client disconnected.

      2. Transfer-Encoding: chunked - w/o a 0\r\n\r\n

         When eventlet.wsgi is reading from a Transfer-Encoding: chunked
         body the socket read returns the empty string, eventlet.wsgi's
         chunked parser raises ValueError (which we translate to
         ChunkReadError*) and we know we know the client disconnected.

    ... if we detect either of these conditions the proxy should:

      1. *not* send down the commit document to object servers
      2. disconnect from backend servers
      3. log the client disconnect

    Oddly the code on master was only messing up the first part. Backend
    connections were terminated (gracefully after the commit document), and
    then the disconnect was being logged as 499.

    So now we only send down the commit document on a successful complete
    client HTTP request (either whole Content-Length, or clean
    Transfer-Encoding: chunked 0\r\n\r\n).

     * To detect the early disconnect on Transfer-Encoding: chunked a new
    swift.common.exceptions.ChunkReadError is used to translate
    eventlet.wsgi's more general IOError and ValueErrors into something
    more appropriate to catch and handle closer to our generic
    ChunkReadTimeout handling.

    Co-Author: Alistair Coles <email address hidden>
    Closes-Bug: #1496205
    Change-Id: I028a530aba82d50baa4ee1d05ddce18d4cce4e81

Changed in swift:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (master)

Reviewed: https://review.openstack.org/211338
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=696186c680540606a8cea6495b39e23085b6863f
Submitter: Jenkins
Branch: master

commit 696186c680540606a8cea6495b39e23085b6863f
Author: paul luse <email address hidden>
Date: Mon Aug 10 14:37:10 2015 -0700

    Better error handling for EC PUT path when client goes away

    There are a few places in the PUT path where the object server is
    reading WSGI input and can find that there's nothing there. e.g. in the
    middle of a 2 phase commit and the proxy goes away for whatever reason,
    like maybe it timed out because things are really busy. Anyway, this
    results in the ugly ValueError coming out of eventlet.wsgi about a
    zillion levels away from the PUT path.

    Expanding on the test cases from lp bug #1496205 and lp bug #1469094
    this change carefully narrows into our read/readline calls to
    wsgi_input and makes sure to tranlsate the ValueError to a
    ChunkReadError - which the object.server can handle along with
    ChunkReadTimeout. When it made sense, this change attempts to stay
    consistent throughout the code path in logging/raising client disconnect
    instead of timeout.

    It's unfortunate the error coming out of eventlet is so generic, but
    that will be improved in future versions [1].

    1. https://github.com/eventlet/eventlet/commit/c3ce3eef0b4d0dfdbfb1ec0186d4bb204fb8ecd5

    Related-Bug: #1469094
    Related-Bug: #1496205
    Co-Authored-By: Clay Gerrard <email address hidden>
    Change-Id: I9e4dbf26623c0c6fc5c87afd14349466aa157385

Thierry Carrez (ttx)
Changed in swift:
milestone: none → 2.5.0
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (feature/crypto)

Related fix proposed to branch: feature/crypto
Review: https://review.openstack.org/234065

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (feature/crypto)
Download full text (36.8 KiB)

Reviewed: https://review.openstack.org/234065
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=c80229fd86853f5f8541aeef4b5044170572640d
Submitter: Jenkins
Branch: feature/crypto

commit 9cafa472a336f66d149a20c12f4251703d96f04d
Author: Ondřej Nový <email address hidden>
Date: Sat Oct 10 20:57:07 2015 +0200

    Autodetect systemctl in SAIO and use it on systemd distros

    Change-Id: I84a9b27baac89327749d8774032860f8ad5166f2

commit 92767f28d668643bc2affee7b2fd46fd9349656a
Author: Emile Snyder <email address hidden>
Date: Sun Oct 11 21:24:54 2015 -0700

    Fix 'swift-ring-builder write_builder' after you remove a device

    clayg already posted the code fix in the bug, but noted it needs a test.

    Closes-Bug: #1487280
    Change-Id: I07317754afac7165baac4e696f07daeba2e72adc

commit a48649002970b2150d24d0622a100f54045443c5
Author: Lisak, Peter <email address hidden>
Date: Mon Oct 12 14:42:01 2015 +0200

    swift-recon fails with socket.timeout exception

    If some server is overloaded or timeout set too low, swift-recon fails with
    raised socket.timeout exception.

    This error should be processed the same way as HTTPError/URLError.

    Change-Id: Ide8843977ab224fa866097d0f0b765d6899c66b8

commit 767fac8186ea4541f4466ac9a55c03abea6a878b
Author: Christian Schwede <email address hidden>
Date: Mon Oct 12 07:09:00 2015 +0000

    Enable H234 check (assertEquals is deprecated, use assertEqual)

    All usages of assertEquals and assertNotEquals are fixed now, so let's enable
    the H234 check to avoid regressions in the future.

    Change-Id: I2c2ccb3b268cf9eb11f2db045378ab125a02bc31

commit 1882801be1d8983cd718786bd409cf09f65a00b0
Author: janonymous <email address hidden>
Date: Mon Aug 31 21:49:49 2015 +0530

    pep8 fix: assertNotEquals -> assertNotEqual

    assertNotEquals is deprecated in py3

    Change-Id: Ib611351987bed1199fb8f73a750955a61d022d0a

commit f5f9d791b0b8b32350bd9a47fbc00ff86a65f09d
Author: janonymous <email address hidden>
Date: Wed Aug 5 23:58:14 2015 +0530

    pep8 fix: assertEquals -> assertEqual

    assertEquals is deprecated in py3, replacing it.

    Change-Id: Ida206abbb13c320095bb9e3b25a2b66cc31bfba8
    Co-Authored-By: Ondřej Nový <email address hidden>

commit 1ba7641c794104de57e5010f76cecbf146a2a63b
Author: Zack M. Davis <email address hidden>
Date: Thu Oct 8 16:16:18 2015 -0700

    minutæ: port ClientException tweaks from swiftclient; dict .pop

    openstack/python-swiftclient@5ae4b423 changed python-swiftclient's
    ClientException to have its http_status attribute default to
    None (rather than 0) and to use super in its __init__ method. For
    consistency's sake, it's nice for Swift's inlined copy of
    ClientException to receive the same patch. Also, the retry function in
    direct_client (a major user of ClientException) was using a somewhat
    awkward conditional-assignment-and-delete construction where the .pop
    method of dictionaries would be more idiomatic.

    Change-Id: I70a12f934f84f57549617af28b86f7f5637bd8fa

commit 01f9d15045129d09...

tags: added: in-feature-crypto
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (feature/hummingbird)

Related fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/236162

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (feature/hummingbird)
Download full text (52.0 KiB)

Reviewed: https://review.openstack.org/236162
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=18ddcaf0d6b67fcbb6b0a4cf4a9a99c72f6f3a08
Submitter: Jenkins
Branch: feature/hummingbird

commit a9ddc2d9ea402eaac7ccd8992387f77855968ab5
Author: Mahati Chamarthy <email address hidden>
Date: Fri Oct 16 18:18:33 2015 +0530

    Hyperlink fix to first contribution doc

    Change-Id: I19fc1abc89f888233b80a57c68a152c1c1758640

commit 83a1151d13e096b480aefe6ec18259f2d7d021db
Author: Pete Zaitcev <email address hidden>
Date: Fri Oct 9 16:45:20 2015 -0600

    Interpolate the explanation string not whole HTML body

    The only reason this exists is that I promised to do it.
    But in our case, there's no big advantage, and here's why.

    The general thinking goes that strings must be interpolated
    first because the body may contain a syntax that confuses the
    interpolation. So this patch makes the code "more correct".
    However, our HTML template is tightly controlled. It's not
    like it contains additional percents.

    So I'll just leave this here for now while I'm asking if
    the content type is set correctly.

    Change-Id: Ia18aeb0f94ef389f8b95450986a566e5fa06aa10

commit 384b91eb824376659989b904f9396cbf2e02d2bd
Author: asettle <email address hidden>
Date: Thu Sep 3 15:11:46 2015 +1000

    Moving DLO functionality doc to the middleware code

    This change moves the RST DLO documentation from
    statically inside overview_large_objects.rst and moves it
    to middleware/dlo.py.
    This is where all middleware RST documentation is defined.

    The overview_large_objects.rst is still the main page
    for information on large objects, so now dynamically
    points to both the DLO and SLO middleware RST
    documentation and the relevant middleware.rst page
    simply points to it.

    Change-Id: I40d918c8b7bc608ab945805d69fe359521df038a
    Closes-bug: #1276852

commit 2996974e5d48b4efaa1b271b8fbd0387bced7242
Author: Ondřej Nový <email address hidden>
Date: Sat Oct 10 14:56:30 2015 +0200

    Script for running unit, func and probe tests at once

    When developing Swift it's often needed to run all tests.
    This script makes it much simpler.

    Change-Id: I67e6f7cc05ebd0475001c1b56e8f6fd09c8c644f

commit c2182fd4163050a5f76eb3dedb7703dc821fa83d
Author: janonymous <email address hidden>
Date: Fri Jul 17 20:20:15 2015 +0530

    Python3: do not use im_self/im_func/func_closure

    Use __self__, __func__ and __closure__ instead, as they work
    with both Python 2 and 3.

    Modifying usage of __func__ in codebase.

    Change-Id: I57e907c28c1d4646605e70194ea3650806730b83

commit c0866ceaac2f69ae01345a795520141f59ec64f5
Author: Samuel Merritt <email address hidden>
Date: Fri Sep 25 17:26:37 2015 -0700

    Improve SLO PUT error checking

    This commit tries to give the user a reason that their SLO manifest
    was invalid instead of just saying "Invalid SLO Manifest File". It
    doesn't get every error condition, but it's better than before.

    Examples of things that now have real error...

tags: added: in-feature-hummingbird
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.5.0

This issue was fixed in the openstack/swift 2.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.