Disabling management net on a single swift proxy node leads to a very long swift response time

Bug #1459772 reported by Dmitry Mescheryakov on 2015-05-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bogdan Dobrelya
6.0.x
High
MOS Maintenance
Mirantis OpenStack
High
Vladimir Kuklin

Bug Description

Version: 6.1, ISO #474.
Full version available at http://paste.openstack.org/show/242594/

Steps to reproduce:
1. Install environment with Swift with 3 controllers and 1 compute node
2. Connect to some controller and disable management network here using the following command:
      iptables -I INPUT -i br-mgmt -j DROP && iptables -I OUTPUT -o br-mgmt -j DROP
3. Connect to _another_ controller and execute 10 times command 'swift list' here.

Sometimes the command takes much time - more than a minute. On average, when it happens, response returns in 70 seconds. It might happen every time, or each 2nd or 3rd time, depending on circumstances I do not understand.

Analysis:

The issue occurs when haproxy sends user's request for Swift to the firewalled node. The Swift on that node tries to check user's token and times out because it can not connect to Keystone's admin url (which is on management net). Haproxy waits for response for 1 minute, and then resends the request to the other node. As a result, request takes slightly more than minute to be processed.

A similar issue would happen with other OpenStack components, but haproxy detects that all services on the node except Swift are dead. Haproxy detects services failure by accessing their endpoint, which listens on management (br-mgmt) network, which is firewalled. Swift's endpoint listens on storage interface (br-storage), so haproxy thinks that Swift is alive on the firewalled node.

In general, the problem is in haproxy health checks beeing too 'weak' - it is not enough to check that service's port is accessible. Probably we need to temporarily disable a service on a node if it constantly fails.

Attached is a snapshot of environment, in which management interface of one node was firewalled (node-2). You can see in haproxy log of node-1 how swift requests were handled. Also, in swift-proxy log of node-2 you can find swift trying to connect to keystone. The snapshot could be downloaded by the link: https://drive.google.com/file/d/0B_TRgCViR_cIQVpLQXJ5aVlnUTQ/view?usp=sharing

Changed in mos:
importance: Undecided → High
milestone: none → 6.1
description: updated
description: updated
description: updated
Dmitry Mescheryakov (dmitrymex) wrote :

LIbrary people, please take a look into the issue, can you suggest a fix viable for 6.1? If not, I suggest to move issue to 7.0, as it is not very common failure scenario.

Changed in mos:
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
Mike Scherbakov (mihgen) wrote :

This can be a serious issue in real deployment. Imagine that you've got management wire broken/ port on switch fired.

Vladimir Kuklin (vkuklin) wrote :

Dima, thanks - this is a good test case. I added a tag to it. Actually, the issue is currently only with swift as our storage network is not broken in this case, but mgmt is - this means that sometimes requests will land to swift backend which actually works at L4 but does not work at L7. I think, we can fix it by adding httpchk to swift backend

tags: added: to-be-covered-by-system-tests
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 6.1
importance: Undecided → High
status: New → Triaged
no longer affects: mos
tags: added: low-hanging-fruit
Nastya Urlapova (aurlapova) wrote :

@Vova, maybe it will surprise for you, we covered this test in 4.x version https://github.com/stackforge/fuel-qa/blob/master/fuelweb_test/tests/tests_strength/test_failover_base.py#L186
Dima had investigated our env.

@Dima, thank you!

tags: removed: low-hanging-fruit to-be-covered-by-system-tests
Bogdan Dobrelya (bogdando) wrote :

@Nastya, if this can be fixed by adding httpchk to swift backend - this is a low-hanging-fruit. Otherwise, please also remove the triaged status

tags: added: low-hanging-fruit
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Bogdan Dobrelya (bogdando) wrote :

Confirmed the issue:
A) 3 of 3 online:
# time swift post Container_test:
real 0m1.585s
user 0m0.234s
sys 0m0.018s

# time swift upload Container_test: /etc/swift/*.conf
/etc/swift/object-server.conf
/etc/swift/proxy-server.conf
/etc/swift/container-server.conf
/etc/swift/account-server.conf
/etc/swift/swift.conf
real 0m1.850s
user 0m0.259s
sys 0m0.041s

B) 2 of 3 online only:
# time swift post Container_test:
real 0m21.304s
user 0m0.206s
sys 0m0.049s

# time swift upload Container_test: /etc/swift/*.conf
/etc/swift/swift.conf
/etc/swift/proxy-server.conf
/etc/swift/account-server.conf
/etc/swift/object-server.conf
/etc/swift/container-server.conf [after 2 attempts]
real 5m21.111s
user 0m0.482s
sys 0m0.157s

Bogdan Dobrelya (bogdando) wrote :

@Nastya, please confirm if the given test case verifies the swift as well? Until then, I put to be covered tag

tags: added: to-be-covered-by-tests
tags: added: swift
tags: removed: low-hanging-fruit

Fix proposed to branch: master
Review: https://review.openstack.org/186815

Changed in fuel:
status: Triaged → In Progress

How-to test:
A. how to check the swiftcheck script:
On the each swift proxy node, the command
 curl -XGET http://localhost:49001
should return OK, if the node:
- is being able to contact the management VIP by ICMP
- reports OK for the swift healthcheck via node's storate address
should return Error, if either of the checks listed above fails

B. how to check the swift proxy node control plane failover:
To get swift nodes status in HAproxy, use the command
 ip netns exec haproxy curl 'http://localhost:10000/;csv' | grep swift
executed on the node running management VIP.
- The command
 iptables -I INPUT 1 -i br-mgmt -j DROP && iptables -I OUTPUT 1 -o br-mgmt -j DROP
should mark the node down within 30 seconds.
- After that, the commands
 time swift post Container_test
 time swift upload Container_test /etc/swift/*.conf
 time swift list Container_test
 time swift delete Container_test
should report results with a reasonable delay (from seconds to few tens of seconds)
should *not* report results with a significant delay (from a minute to tens of minutes)

C. After the step B done, how to check the swift proxy node control plane failback:
- The command
 iptables -D INPUT 1 && iptables -D OUTPUT 1
should mark the node down within 30 seconds.
And ditto to the case B to the swift commands awaited results expectations.

Changed in mos:
status: New → Triaged
importance: Undecided → High
assignee: nobody → MOS Swift (mos-swift)
milestone: none → 7.0
Alexey Khivin (akhivin) wrote :

possibly it is because of changing
https://review.openstack.org/#/c/155487/

Bogdan Dobrelya (bogdando) wrote :

@Alex, the testing shows there are no issues with long response times after the bad node marked DOWN by HAProxy check

summary: - Disabling management net on a single swift node leads to a very long
- swift response time
+ Disabling management net on a single swift proxy node leads to a very
+ long swift response time
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin)

Reviewed: https://review.openstack.org/186815
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=42df864217042cd2be03fde25ba6c235835d4835
Submitter: Jenkins
Branch: master

commit 42df864217042cd2be03fde25ba6c235835d4835
Author: Vladimir Kuklin <email address hidden>
Date: Mon Jun 1 17:57:41 2015 +0300

    Make HAProxy check of swift proxy backends via management VIP

    W/o this fix, when the management interface on the controller node
    running a Swift proxy is down, HAProxy would fail to update
    its backend status at the storage network.

    This is a problem as we want swift backends not able to connect the
    swift endpoint via the management VIP to be marked down. Othewise,
    responces time for any requested swift commands would be drastically
    longer. Simple httpcheck option cannot resolve this because the swift
    healthcheck reports OK, if conntacted via the storage network.

    In order to fix this, simple healthcheck script is implemented.
    This script is running as HTTP xinetid service at TCP port 49001 and
    is accessible only from the localhost, 240.0.0.2, and storage plus
    management networks. The service verifies the node under check for the:
    a) management VIP is pingable via ICMP (by 3 packets)
    b) Swift endpoint is reachable by TCP-connect via the local storage address
    within 5 seconds connection timeout
    c) Swift healthcheck report via the local storage address endpoint is OK

    Reports an HTTP 200 OK, if all of the results are OK.
    Otherwise, it would report an HTTP 503 Error.
    Expected Swift node control plane failover time will be around 30 seconds.
    Swift data plane is not affected.

    DocImpact: Reference architecture, swift failover.

    Closes-bug: #1459772
    Related-bug: #1460623

    Change-Id: I55a35b45257763a20f33bd47cb5c57de53558ccf
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)
Changed in mos:
assignee: MOS Swift (mos-swift) → Fuel Library Team (fuel-library)
assignee: Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status: Triaged → Fix Committed
tags: added: on-verification
Alexander Arzhanov (aarzhanov) wrote :

Verified on ISO #286:

api: '1.0'
astute_sha: 8283dc2932c24caab852ae9de15f94605cc350c6
auth_required: true
build_id: '286'
build_number: '286'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: ff63a0bbc93a3a0fb78215c2fd0c77add8dfe589
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
nailgun_sha: 5c33995a2e6d9b1b8cdddfa2630689da5084506f
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 1ce8ecd8beb640f2f62f73435f4e18d1469979ac
release: '7.0'
release_versions:
  2015.1.0-7.0:
    VERSION:
      api: '1.0'
      astute_sha: 8283dc2932c24caab852ae9de15f94605cc350c6
      build_id: '286'
      build_number: '286'
      feature_groups:
      - mirantis
      fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
      fuel-library_sha: ff63a0bbc93a3a0fb78215c2fd0c77add8dfe589
      fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
      fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
      fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
      nailgun_sha: 5c33995a2e6d9b1b8cdddfa2630689da5084506f
      openstack_version: 2015.1.0-7.0
      production: docker
      python-fuelclient_sha: 1ce8ecd8beb640f2f62f73435f4e18d1469979ac
      release: '7.0'

Changed in mos:
status: Fix Committed → Fix Released
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers