ssync doesn't work with replication_server = true

Bug #1446873 reported by clayg
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Unassigned

Bug Description

All the background replication daemons should use the replication network defined in the ring. The reconstructor does for REPLICATION and SSYNC calls - but not for GET's to other nodes while rebuilding other fragment archives in reconstruct_fa.

Tags: ec
Changed in swift:
milestone: none → 2.3.0-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/176110

Changed in swift:
milestone: 2.3.0-rc2 → next-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (stable/kilo)

Change abandoned by John Dickinson (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/176110
Reason: nope

Revision history for this message
clayg (clay-gerrard) wrote : Re: Object Reconstructor doesn't always use replication_ip
Download full text (3.5 KiB)

Hey, so I'm going to hijack this bug since ssync just doesn't work with replication_server = true

vagrant-swift-all-in-one has a branch to setup separate replication servers on different ports [1].

If you use that and try to ssync with some objects that need to move you'll find that the sub-requests inside of sync get 405'd

  May 7 04:18:45 saio object-7040: 127.0.0.1 - - [07/May/2015:04:18:45 +0000] "SSYNC /sdb4/219" 200 - "-" "-" "-" 0.0001 "-" 23181 0
  May 7 04:18:45 saio object-7040: None - - [07/May/2015:04:18:45 +0000] "PUT /sdb4/219/AUTH_test/new-test/115677-B-obj" 405 - "-" "-" "-" 0.0001 "-" 23181 0
  May 7 04:18:46 saio object-7020: 127.0.0.1:7040/sdb4/219 Early disconnect
  May 7 04:18:46 saio object-7040: STDERR: Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 483, in handle_one_response#012
   write(b''.join(towrite))#012 File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 426, in write#012 _writelines(towrite)#012 File "/usr/lib/python2.7/socket.py", l
  ine 334, in writelines#012 self.flush()#012 File "/usr/lib/python2.7/socket.py", line 303, in flush#012 self._sock.sendall(view[write_offset:write_offset+buffer_size])#012 File
  "/usr/local/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 376, in sendall#012 tail = self.send(data, flags)#012 File "/usr/local/lib/python2.7/dist-packages/eventlet/g
  reenio/base.py", line 359, in send#012 total_sent += fd.send(data[total_sent:], flags)#012error: [Errno 32] Broken pipe
  May 7 04:18:46 saio object-7040: STDERR: Traceback (most recent call last):
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/local/lib/python2.7/dist-packages/eventlet/greenpool.py", line 82, in _spawn_n_impl
  May 7 04:18:46 saio object-7040: STDERR: func(*args, **kwargs)
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 686, in process_request
  May 7 04:18:46 saio object-7040: STDERR: proto.__init__(sock, address, self)
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/lib/python2.7/SocketServer.py", line 649, in __init__
  May 7 04:18:46 saio object-7040: STDERR: self.handle()
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
  May 7 04:18:46 saio object-7040: STDERR: self.handle_one_request()
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 358, in handle_one_request
  May 7 04:18:46 saio object-7040: STDERR: self.handle_one_response()
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 507, in handle_one_response
  May 7 04:18:46 saio object-7040: STDERR: while self.environ['eventlet.input'].read(MINIMUM_CHUNK_SIZE):
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 189, in read
  May 7 04:18:46 saio object-7040: STDERR: return self._chunked_read(self.rfile, length)
  May 7 04:18:46 saio object-7040: STDERR: File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi...

Read more...

summary: - Object Reconstructor doesn't always use replication_ip
+ ssync doesn't work with replication_server = true
Revision history for this message
clayg (clay-gerrard) wrote :

oh, i should say as a work around you can comment out the replication_server line in your replication server's config

clayg (clay-gerrard)
tags: added: ec
Revision history for this message
clayg (clay-gerrard) wrote :

@acoles @pluse @swifterdarrell should all be able to confirm this bug

As far as priority, for people currently running with with replication networks for rsync this is a big blocker to using EC (or ssync).

The work around to comment out replication_server = True in your replication server's config file has no negative impact [1] - and begs the question why do we even have the config option to make replication servers not handle the real request methods (as is needed for ssync and the reconstructor).

replication_server = False seems like a sane option for the object servers processes in the client data path - meaning don't answer replication/ssync requests.

But rather than having a three state switch - maybe we only need two - either you can serve all requests (i.e. the *default*), or you can serve all requests except replication/ssync (i.e. replication_server = False).

1. Can any one think of any harm to get rid of the "only serve replication/ssync requests" option that only actually *works* in the one replication scenario where rsync, not the object server, is the primary process for moving data?

Revision history for this message
Samuel Merritt (torgomatic) wrote :

I think Clay's right. Replication needs pretty much all the verbs that normal traffic does if we're using ssync, so having the three-state switch is pretty useless. The three states wind up being "client traffic, no replication", "client or replication traffic", and "horribly broken".

I mean, I can sort of see wanting an object server that only does replication *stuff*, so if client traffic winds up getting there then it's rejected. What we have now doesn't do that. Either we need to fix up that third state so it's not horribly broken, or we should just get rid of it.

I usually lean towards getting rid of things, FWIW.

clayg (clay-gerrard)
Changed in swift:
importance: Undecided → Low
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by Clay Gerrard (<email address hidden>) on branch: master
Review: https://review.openstack.org/176090
Reason: better plan in https://bugs.launchpad.net/swift/+bug/1446873

Changed in swift:
status: New → Confirmed
Revision history for this message
paul luse (paul-e-luse) wrote :

so the last abandoned change update confuses me - what are we doing with this and is there a patch proposed to implement it or not?

Thierry Carrez (ttx)
Changed in swift:
milestone: 2.4.0 → next-liberty
Revision history for this message
Matthew Oliver (matt-0) wrote :

So to paraphrase, replication_server = True fails because it's stopping Ssync and reconstructure from using verbs other than REPLICATE.

Now that these other replication methods are required replication_server isn't really useful anymore, so maybe it should be removed, anyone who wants to use a second object server for replication only should make sure it's IP and port is specified as the replication IP/ replication port in the ring so user traffic uses it. Alternatively, if we want to filter out user traffic, we'd need to filter by swift_source or a different tx_id or something.

Thierry Carrez (ttx)
Changed in swift:
milestone: 2.5.0 → none
Revision history for this message
clayg (clay-gerrard) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/740867
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=9eb81f6e698ef8ac8855c413b0b1bf4e030cb687
Submitter: Zuul
Branch: master

commit 9eb81f6e698ef8ac8855c413b0b1bf4e030cb687
Author: Tim Burke <email address hidden>
Date: Tue Jul 7 21:28:36 2020 -0700

    Allow replication servers to handle all request methods

    Previously, the replication_server setting could take one of three
    states:

     * If unspecified, the server would handle all available methods.
     * If "true", "yes", "on", etc. it would only handle replication
       methods (REPLICATE, SSYNC).
     * If any other value (including blank), it would only handle
       non-replication methods.

    However, because SSYNC tunnels PUTs, POSTs, and DELETEs through
    the same object-server app that's responding to SSYNC, setting
    `replication_server = true` would break the protocol. This has
    been the case ever since ssync was introduced.

    Now, get rid of that second state -- operators can still set
    `replication_server = false` as a principle-of-least-privilege guard
    to ensure proxy-servers can't make replication requests, but replication
    servers will be able to serve all traffic. This will allow replication
    servers to be used as general internal-to-the-cluster endpoints, leaving
    non-replication servers to handle client-driven traffic.

    Closes-Bug: #1446873
    Change-Id: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (master)

Reviewed: https://review.opendev.org/735751
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=2a6dfae2f3d50750de2bdb5f31ea52070588e895
Submitter: Zuul
Branch: master

commit 2a6dfae2f3d50750de2bdb5f31ea52070588e895
Author: Tim Burke <email address hidden>
Date: Mon Jun 15 17:09:15 2020 -0700

    Allow direct and internal clients to use the replication network

    A new header `X-Backend-Use-Replication-Network` is added; if true, use
    the replication network instead of the client-data-path network.

    Several background daemons are updated to use the replication network:

      * account-reaper
      * container-reconciler
      * container-sharder
      * container-sync
      * object-expirer

    Note that if container-sync is being used to sync data within the same
    cluster, the replication network will only be used when communicating
    with the "source" container; the "destination" traffic will continue to
    use the configured realm endpoint.

    The direct and internal client APIs still default to using the
    client-data-path network; this maintains backwards compatibility for
    external tools written against them.

    UpgradeImpact
    =============

    Until recently, servers configured with

      replication_server = true

    would only handle REPLICATE (and, in the case of object servers, SSYNC)
    requests, and would respond 405 Method Not Allowed to other requests.
    When upgrading from Swift 2.25.0 or earlier, remove the config option
    and restart services prior to upgrade to avoid a flood of background
    daemon errors in logs.

    Note that some background daemons find work by querying Swift rather
    than walking local drives that should be available on the replication
    network:

      * container-reconciler
      * object-expirer

    Previosuly these may have been configured without access to the
    replication network; ensure they have access before upgrading.

    Closes-Bug: #1883302
    Related-Bug: #1446873
    Related-Change: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87
    Change-Id: Ieef534bf5d5fb53602e875b51c15ef565882fbff

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.