racy problem on fetching swift rings from swift-proxy

Bug #1765203 reported by Drew Freiberger
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Swift Proxy Charm
Fix Released
High
Edward Hope-Morley
OpenStack Swift Storage Charm
Fix Released
High
Edward Hope-Morley

Bug Description

I'm re-deploying swift-storage onto a node that it was previously installed on and finding that swift-storage-relation-changed is trying to pull swift rings via http from a non-leader unit.

This is openstack charms version 18.02 on trusty.

Scenario and possible reproducer:

I have 3 swift-proxy units, let's call them p/0, p/1, p/2, and 12 swift-storage units, z1/0, z1/1, z1/2, z1/3, and same four units for z2 and z3.

I manually took the swift-storage-z1/3 disks out of the rings on the swift-proxy leader (let's say p/2) and then changed min-hours on swift-proxy from 0 to 1 to trigger a charm managed rebalance.

I then stopped the swift-* and rsync services on the node to keep it from participating in any rebalancing efforts.

I then ran juju remove-unit z1/3, let it complete, then juju add-unit z1 --to <same metal>.

At this point, it installed and the rings got updated by swift-proxy and started getting distributed to all swift-proxy and swift-storage nodes other than this new unit, now z1/4. on z1/4, I got a hook failed on swift-storage-relation-changed when relating with p/1 (the non-leader) where it's trying to pull the swift-rings from the webserver and fails because the ring files are named <server>.ring.gz.deleted on that swift-proxy unit.

My supposition is that either swift-proxy after handshaking new keys with the leader needs to drop the rings into each proxy unit web server directory, or swift-storage needs to only try to pull rings from an is-leader=true unit.

in below log 10.101.140.110 is p/1 ad swift-storage-z1-23 is z1/4:

--2018-04-18 21:00:28-- http://10.101.140.110/swift-rings/account.ring.gz
Connecting to 10.101.140.110:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-04-18 21:00:28 ERROR 404: Not Found.

--2018-04-18 21:00:30-- http://10.101.140.110/swift-rings/account.ring.gz
Connecting to 10.101.140.110:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-04-18 21:00:30 ERROR 404: Not Found.

--2018-04-18 21:00:34-- http://10.101.140.110/swift-rings/account.ring.gz
Connecting to 10.101.140.110:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-04-18 21:00:34 ERROR 404: Not Found.

--2018-04-18 21:00:40-- http://10.101.140.110/swift-rings/account.ring.gz
Connecting to 10.101.140.110:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-04-18 21:00:40 ERROR 404: Not Found.

Traceback (most recent call last):
  File "hooks/swift-storage-relation-changed", line 268, in <module>
    main()
  File "hooks/swift-storage-relation-changed", line 259, in main
    hooks.execute(sys.argv)
  File "/var/lib/juju/agents/unit-swift-storage-z1-23/charm/hooks/charmhelpers/core/hookenv.py", line 800, in execute
    self._hooks[hook_name]()
  File "/var/lib/juju/agents/unit-swift-storage-z1-23/charm/hooks/charmhelpers/core/host.py", line 708, in wrapped_f
    restart_functions)
  File "/var/lib/juju/agents/unit-swift-storage-z1-23/charm/hooks/charmhelpers/core/host.py", line 730, in restart_on_change_helper
    r = lambda_f()
  File "/var/lib/juju/agents/unit-swift-storage-z1-23/charm/hooks/charmhelpers/core/host.py", line 707, in <lambda>
    (lambda: f(*args, **kwargs)), restart_map, stopstart,
  File "hooks/swift-storage-relation-changed", line 202, in swift_storage_relation_changed
    fetch_swift_rings(rings_url)
  File "/var/lib/juju/agents/unit-swift-storage-z1-23/charm/hooks/charmhelpers/core/decorators.py", line 40, in _retry_on_exception_inner_2
    return f(*args, **kwargs)
  File "/var/lib/juju/agents/unit-swift-storage-z1-23/charm/hooks/lib/swift_storage_utils.py", line 509, in fetch_swift_rings
    check_call(cmd)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['wget', u'http://10.101.140.110/swift-rings/account.ring.gz', '--retry-connrefused', '-t', '10', '-O', '/tmp/swiftringsSgVPZd/account.ring.gz']' returned non-zero exit status 8

Revision history for this message
Drew Freiberger (afreiberger) wrote :

oddly, what it seems like, when querying relation-get on the 3 proxy units is that p/0 just shared private-address, but both p/1 and p/2 are advertising all of the relation info including rings_url, rsync_allowed_hosts, swift_hash, timestamp, and trigger, and the triggers don't match.

Perhaps this is stuck relation info from a prior swift-proxy ring-balance disconnect.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

actually, in a healthy environment, it appears all proxy units broadcast their own IP for ring_url in the relation, so this bug appears valid to me.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

From what I can tell after repeating getting this when adding new swift-storage nodes, when swift-storage-relation-changed is called if a swift-proxy unit is not leader, it stops swift-proxy service and appears to move the rings to /var/www/html/swift-rings/*.ring.gz.deleted for safety while it waits for the leader to manage the relation and ring changes.

This means if the relation-changed kicks for the non-leaders before the leader has completed it's tasks for ring management, the swift-storage charm will hang the hook relation-changed to the non-leader proxies.

Reproducer might be:

deploy swift-proxy in HA format
add three swift-storage zone charms
add new unit to swift-storage zones
Should see new swift-storage units go into hook failure. issue in logs is 404 not found from wget from swift-storage-zX/$new to swift-proxy/$not_leader.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Drew, I think I see what;s going on here and there are a few things to be fixed. The swift-storage units need to catch the wget error and skip the ring update if that happens since what is actually happening is that they are pulling from a proxy unit that is no longer the leader but has not cleared the rings_url setting from its relations (but luckily it does mark its local copies as .deleted so there;s no risk of accidental update from an old copy). So this will likely require fixes to both swift-proxy (to clear settings when no longer ringleader) and swift-storage (to tolerate stale rings_url and wait for good one which should come eventually). Please note that when the swift-proxy leader switches there is inevitably a small lag until the ring manager re-establishes itself on the new leader unit - all units should eventually have sane copies of rings and builders but this of course takes time to sync following a Juju leader change.

Changed in charm-swift-proxy:
milestone: none → 19.04
Changed in charm-swift-storage:
milestone: none → 19.04
Changed in charm-swift-proxy:
importance: Undecided → High
Changed in charm-swift-storage:
importance: Undecided → High
Changed in charm-swift-proxy:
assignee: nobody → Edward Hope-Morley (hopem)
Changed in charm-swift-storage:
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-swift-proxy (master)

Fix proposed to branch: master
Review: https://review.openstack.org/623215

Changed in charm-swift-proxy:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-swift-storage (master)

Fix proposed to branch: master
Review: https://review.openstack.org/623220

Changed in charm-swift-storage:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-proxy (master)

Reviewed: https://review.openstack.org/623215
Committed: https://git.openstack.org/cgit/openstack/charm-swift-proxy/commit/?id=f25d2c2d7f721cc12e04c63e1a180eeda2d64a7b
Submitter: Zuul
Branch: master

commit f25d2c2d7f721cc12e04c63e1a180eeda2d64a7b
Author: Edward Hope-Morley <email address hidden>
Date: Thu Dec 6 14:08:53 2018 +0000

    Cleanup ring manager storage relation settings

    If local unit is no longer leader, clear rings_url on
    storage relations to avoid storage units getting
    rings from wrong proxy unit.

    Also send broker-timestamp to storage units when providing
    rings_url so that we have a means of knowing which is the most
    recent. Broker timestamp is the same for peer and storage
    sync so this enables identifying most recent.

    Change-Id: I2c7e9028f345791bad0a736cb89979284b144e33
    Closes-Bug: #1765203

Changed in charm-swift-proxy:
status: In Progress → Fix Committed
Changed in charm-swift-storage:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-storage (master)

Reviewed: https://review.openstack.org/623220
Committed: https://git.openstack.org/cgit/openstack/charm-swift-storage/commit/?id=ae6826734fb56a2358bc528550fff54e5ada5d8b
Submitter: Zuul
Branch: master

commit ae6826734fb56a2358bc528550fff54e5ada5d8b
Author: Edward Hope-Morley <email address hidden>
Date: Thu Dec 6 14:23:44 2018 +0000

    Catch exception of ring sync fails

    It is possible for swift-storage units to attempt
    to request rings from a proxy unit that is no longer
    serving them so instead of raising an exception we
    catch it and move on since there will likely be a
    another proxy notification waiting to be consumed.

    Change-Id: Ib2e634d2ed3509bfe2aa9b792cc17c2ed89029f1
    Closes-Bug: #1765203

tags: added: stable-backport
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-swift-proxy (stable/18.11)

Fix proposed to branch: stable/18.11
Review: https://review.openstack.org/624640

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-swift-storage (stable/18.11)

Fix proposed to branch: stable/18.11
Review: https://review.openstack.org/624642

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-storage (stable/18.11)

Reviewed: https://review.openstack.org/624642
Committed: https://git.openstack.org/cgit/openstack/charm-swift-storage/commit/?id=38f7e97fe6e992c1ada9b399b3d23343b4a42190
Submitter: Zuul
Branch: stable/18.11

commit 38f7e97fe6e992c1ada9b399b3d23343b4a42190
Author: Edward Hope-Morley <email address hidden>
Date: Thu Dec 6 14:23:44 2018 +0000

    Catch exception of ring sync fails

    It is possible for swift-storage units to attempt
    to request rings from a proxy unit that is no longer
    serving them so instead of raising an exception we
    catch it and move on since there will likely be a
    another proxy notification waiting to be consumed.

    Change-Id: Ib2e634d2ed3509bfe2aa9b792cc17c2ed89029f1
    Closes-Bug: #1765203
    (cherry picked from commit ae6826734fb56a2358bc528550fff54e5ada5d8b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-proxy (stable/18.11)

Reviewed: https://review.openstack.org/624640
Committed: https://git.openstack.org/cgit/openstack/charm-swift-proxy/commit/?id=658587d1144c42f249bee69aa67a579ca862554a
Submitter: Zuul
Branch: stable/18.11

commit 658587d1144c42f249bee69aa67a579ca862554a
Author: Edward Hope-Morley <email address hidden>
Date: Thu Dec 6 14:08:53 2018 +0000

    Cleanup ring manager storage relation settings

    If local unit is no longer leader, clear rings_url on
    storage relations to avoid storage units getting
    rings from wrong proxy unit.

    Also send broker-timestamp to storage units when providing
    rings_url so that we have a means of knowing which is the most
    recent. Broker timestamp is the same for peer and storage
    sync so this enables identifying most recent.

    Change-Id: I2c7e9028f345791bad0a736cb89979284b144e33
    Closes-Bug: #1765203
    (cherry picked from commit f25d2c2d7f721cc12e04c63e1a180eeda2d64a7b)

Changed in charm-swift-proxy:
status: Fix Committed → Fix Released
Changed in charm-swift-storage:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.