Fuel for OpenStack

Disabling management net on a single swift proxy node leads to a very long swift response time

Bug #1459772 reported by Dmitry Mescheryakov on 2015-05-28

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1
6.0.x	Won't Fix	High	MOS Maintenance	Fuel for OpenStack 6.0-updates
Mirantis OpenStack	Fix Released	High	Vladimir Kuklin	Mirantis OpenStack 7.0

Bug Description

Version: 6.1, ISO #474.
Full version available at http://paste.openstack.org/show/242594/

Steps to reproduce:
1. Install environment with Swift with 3 controllers and 1 compute node
2. Connect to some controller and disable management network here using the following command:
iptables -I INPUT -i br-mgmt -j DROP && iptables -I OUTPUT -o br-mgmt -j DROP
3. Connect to _another_ controller and execute 10 times command 'swift list' here.

Sometimes the command takes much time - more than a minute. On average, when it happens, response returns in 70 seconds. It might happen every time, or each 2nd or 3rd time, depending on circumstances I do not understand.

Analysis:

The issue occurs when haproxy sends user's request for Swift to the firewalled node. The Swift on that node tries to check user's token and times out because it can not connect to Keystone's admin url (which is on management net). Haproxy waits for response for 1 minute, and then resends the request to the other node. As a result, request takes slightly more than minute to be processed.

A similar issue would happen with other OpenStack components, but haproxy detects that all services on the node except Swift are dead. Haproxy detects services failure by accessing their endpoint, which listens on management (br-mgmt) network, which is firewalled. Swift's endpoint listens on storage interface (br-storage), so haproxy thinks that Swift is alive on the firewalled node.

In general, the problem is in haproxy health checks beeing too 'weak' - it is not enough to check that service's port is accessible. Probably we need to temporarily disable a service on a node if it constantly fails.

Attached is a snapshot of environment, in which management interface of one node was firewalled (node-2). You can see in haproxy log of node-1 how swift requests were handled. Also, in swift-proxy log of node-2 you can find swift trying to connect to keystone. The snapshot could be downloaded by the link: https://drive.google.com/file/d/0B_TRgCViR_cIQVpLQXJ5aVlnUTQ/view?usp=sharing

See original description

Tags:

Dmitry Mescheryakov (dmitrymex) on 2015-05-28

Changed in mos:
importance:	Undecided → High
milestone:	none → 6.1

Dmitry Mescheryakov (dmitrymex) on 2015-05-28

description:	updated
description:	updated
description:	updated

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-05-28:

LIbrary people, please take a look into the issue, can you suggest a fix viable for 6.1? If not, I suggest to move issue to 7.0, as it is not very common failure scenario.

Changed in mos:
assignee:	nobody → Fuel Library Team (fuel-library)
status:	New → Confirmed

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-05-28:

This can be a serious issue in real deployment. Imagine that you've got management wire broken/ port on switch fired.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-29:

Dima, thanks - this is a good test case. I added a tag to it. Actually, the issue is currently only with swift as our storage network is not broken in this case, but mgmt is - this means that sometimes requests will land to swift backend which actually works at L4 but does not work at L7. I think, we can fix it by adding httpchk to swift backend

tags:	added: to-be-covered-by-system-tests
Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 6.1
importance:	Undecided → High
status:	New → Triaged
no longer affects:	mos

Bogdan Dobrelya (bogdando) on 2015-05-29

tags:

added: low-hanging-fruit

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2015-05-29:

@Vova, maybe it will surprise for you, we covered this test in 4.x version https://github.com/stackforge/fuel-qa/blob/master/fuelweb_test/tests/tests_strength/test_failover_base.py#L186
Dima had investigated our env.

@Dima, thank you!

tags:

removed: low-hanging-fruit to-be-covered-by-system-tests

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-29:

@Nastya, if this can be fixed by adding httpchk to swift backend - this is a low-hanging-fruit. Otherwise, please also remove the triaged status

tags:

added: low-hanging-fruit

Bogdan Dobrelya (bogdando) on 2015-05-29

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-29:

Confirmed the issue:
A) 3 of 3 online:
# time swift post Container_test:
real 0m1.585s
user 0m0.234s
sys 0m0.018s

# time swift upload Container_test: /etc/swift/*.conf
/etc/swift/object-server.conf
/etc/swift/proxy-server.conf
/etc/swift/container-server.conf
/etc/swift/account-server.conf
/etc/swift/swift.conf
real 0m1.850s
user 0m0.259s
sys 0m0.041s

B) 2 of 3 online only:
# time swift post Container_test:
real 0m21.304s
user 0m0.206s
sys 0m0.049s

# time swift upload Container_test: /etc/swift/*.conf
/etc/swift/swift.conf
/etc/swift/proxy-server.conf
/etc/swift/account-server.conf
/etc/swift/object-server.conf
/etc/swift/container-server.conf [after 2 attempts]
real 5m21.111s
user 0m0.482s
sys 0m0.157s

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-29:

@Nastya, please confirm if the given test case verifies the swift as well? Until then, I put to be covered tag

tags:	added: to-be-covered-by-tests
tags:	added: swift

Bogdan Dobrelya (bogdando) on 2015-05-29

tags:

removed: low-hanging-fruit

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-29: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/186815

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-01: Re: Disabling management net on a single swift node leads to a very long swift response time

#10

How-to test:
A. how to check the swiftcheck script:
On the each swift proxy node, the command
curl -XGET http://localhost:49001
should return OK, if the node:
- is being able to contact the management VIP by ICMP
- reports OK for the swift healthcheck via node's storate address
should return Error, if either of the checks listed above fails

B. how to check the swift proxy node control plane failover:
To get swift nodes status in HAproxy, use the command
ip netns exec haproxy curl 'http://localhost:10000/;csv' | grep swift
executed on the node running management VIP.
- The command
iptables -I INPUT 1 -i br-mgmt -j DROP && iptables -I OUTPUT 1 -o br-mgmt -j DROP
should mark the node down within 30 seconds.
- After that, the commands
time swift post Container_test
time swift upload Container_test /etc/swift/*.conf
time swift list Container_test
time swift delete Container_test
should report results with a reasonable delay (from seconds to few tens of seconds)
should *not* report results with a significant delay (from a minute to tens of minutes)

C. After the step B done, how to check the swift proxy node control plane failback:
- The command
iptables -D INPUT 1 && iptables -D OUTPUT 1
should mark the node down within 30 seconds.
And ditto to the case B to the swift commands awaited results expectations.

Vladimir Kuklin (vkuklin) on 2015-06-01

Changed in mos:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → MOS Swift (mos-swift)
milestone:	none → 7.0

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-01:

#11

possibly it is because of changing
https://review.openstack.org/#/c/155487/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-01:

#12

@Alex, the testing shows there are no issues with long response times after the bad node marked DOWN by HAProxy check

Bogdan Dobrelya (bogdando) on 2015-06-01

summary:

- Disabling management net on a single swift node leads to a very long
- swift response time
+ Disabling management net on a single swift proxy node leads to a very
+ long swift response time

OpenStack Infra (hudson-openstack) on 2015-06-01

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-01: Fix merged to fuel-library (master)

#13

Reviewed: https://review.openstack.org/186815
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=42df864217042cd2be03fde25ba6c235835d4835
Submitter: Jenkins
Branch: master

commit 42df864217042cd2be03fde25ba6c235835d4835
Author: Vladimir Kuklin <email address hidden>
Date: Mon Jun 1 17:57:41 2015 +0300

Make HAProxy check of swift proxy backends via management VIP

    W/o this fix, when the management interface on the controller node
    running a Swift proxy is down, HAProxy would fail to update
    its backend status at the storage network.

    This is a problem as we want swift backends not able to connect the
    swift endpoint via the management VIP to be marked down. Othewise,
    responces time for any requested swift commands would be drastically
    longer. Simple httpcheck option cannot resolve this because the swift
    healthcheck reports OK, if conntacted via the storage network.

    In order to fix this, simple healthcheck script is implemented.
    This script is running as HTTP xinetid service at TCP port 49001 and
    is accessible only from the localhost, 240.0.0.2, and storage plus
    management networks. The service verifies the node under check for the:
    a) management VIP is pingable via ICMP (by 3 packets)
    b) Swift endpoint is reachable by TCP-connect via the local storage address
    within 5 seconds connection timeout
    c) Swift healthcheck report via the local storage address endpoint is OK

    Reports an HTTP 200 OK, if all of the results are OK.
    Otherwise, it would report an HTTP 503 Error.
    Expected Swift node control plane failover time will be around 30 seconds.
    Swift data plane is not affected.

DocImpact: Reference architecture, swift failover.

Closes-bug: #1459772
Related-bug: #1460623

Change-Id: I55a35b45257763a20f33bd47cb5c57de53558ccf
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Bogdan Dobrelya (bogdando) on 2015-06-02

Changed in fuel:
assignee:	Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)

Timur Nurlygayanov (tnurlygayanov) on 2015-08-20

Changed in mos:
assignee:	MOS Swift (mos-swift) → Fuel Library Team (fuel-library)
assignee:	Fuel Library Team (fuel-library) → Vladimir Kuklin (vkuklin)
status:	Triaged → Fix Committed

Alexander Arzhanov (aarzhanov) on 2015-09-07

tags:

added: on-verification

Revision history for this message

Alexander Arzhanov (aarzhanov) wrote on 2015-09-08:

#14

Verified on ISO #286:

api: '1.0'
astute_sha: 8283dc2932c24caab852ae9de15f94605cc350c6
auth_required: true
build_id: '286'
build_number: '286'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: ff63a0bbc93a3a0fb78215c2fd0c77add8dfe589
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
nailgun_sha: 5c33995a2e6d9b1b8cdddfa2630689da5084506f
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 1ce8ecd8beb640f2f62f73435f4e18d1469979ac
release: '7.0'
release_versions:
  2015.1.0-7.0:
    VERSION:
      api: '1.0'
      astute_sha: 8283dc2932c24caab852ae9de15f94605cc350c6
      build_id: '286'
      build_number: '286'
      feature_groups:
      - mirantis
      fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
      fuel-library_sha: ff63a0bbc93a3a0fb78215c2fd0c77add8dfe589
      fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
      fuel-ostf_sha: 1f08e6e71021179b9881a824d9c999957fcc7045
      fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
      nailgun_sha: 5c33995a2e6d9b1b8cdddfa2630689da5084506f
      openstack_version: 2015.1.0-7.0
      production: docker
      python-fuelclient_sha: 1ce8ecd8beb640f2f62f73435f4e18d1469979ac
      release: '7.0'

Changed in mos:
status:	Fix Committed → Fix Released
tags:	removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.