Fuel for OpenStack

HAProxy marked swift as down for ~20 seconds after VIPs removal

Bug #1516978 reported by Artem Panchenko on 2015-11-17

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	Medium	Michael Polenchuk	Fuel for OpenStack 8.0

Bug Description

Fuel version info (8.0 build #169): http://paste.openstack.org/show/479097/

Health check 'Check state of haproxy backends on controllers' failed after deleting of public and management VIPs:

2015-11-17 05:09:36 DEBUG (test_haproxy) Dead backends ['swift node-2 Status: DOWN 1/3/L7OSessions: 0 Rate: 0 ']

Steps to reproduce:

            1. Delete 10 times public and management VIPs (ip netns exec haproxy ip addr del ${vip} dev b_management)
            2. Wait while it is being restored
            3. Verify it is restored
            4. Run OSTF

Expected result: health checks passed
Actual result: check 'Check state of haproxy backends on controllers' failed

According to OSTF logs it used node-3 controller for checking HAProxy status. Here is the part of haproxy logs on node-3:

<129>Nov 17 05:09:24 node-3 haproxy[27338]: Server swift/node-2 is DOWN, reason: Layer7 timeout, check duration: 10001ms. 2 active and 0 backup servers left.
<133>Nov 17 05:09:43 node-3 haproxy[27338]: Server swift/node-2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2030ms.

The same issue was also detected by haproxy on node-2:

<129>Nov 17 05:09:28 node-2 haproxy[28450]: Server swift/node-2 is DOWN, reason: Layer4 connection problem, info: "Invalid argument", check duration: 10001ms. 2 active and 0
active, 0 requeued, 0 remaining in queue.
<133>Nov 17 05:09:46 node-2 haproxy[28450]: Server swift/node-2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2089ms.

I checked atop logs on node-2 and the node wasn't overloaded at that time. According to swift logs on node-2, object replication was running there at 05:09, so probably it could be a cause of the issue:

<46>Nov 17 05:09:19 node-2 swift-object-server: Starting object replication pass.
<46>Nov 17 05:09:19 node-2 swift-object-server: 11/11 (100.00%) partitions replicated in 0.06s (184.16/sec, 0s remaining)
<46>Nov 17 05:09:19 node-2 swift-object-server: 11 suffixes checked - 0.00% hashed, 0.00% synced
<46>Nov 17 05:09:19 node-2 swift-object-server: Partition times: max 0.0061s, min 0.0039s, med 0.0047s
<46>Nov 17 05:09:19 node-2 swift-object-server: Object replication complete. (0.00 minutes)
<46>Nov 17 05:09:36 node-2 swift-container-server: Beginning replication run
<46>Nov 17 05:09:36 node-2 swift-container-server: Replication run OVER
<46>Nov 17 05:09:36 node-2 swift-container-server: Attempted to replicate 12 dbs in 0.32555 seconds (36.86035/s)
<46>Nov 17 05:09:36 node-2 swift-container-server: Removed 0 dbs
<46>Nov 17 05:09:36 node-2 swift-container-server: 24 successes, 0 failures
<46>Nov 17 05:09:36 node-2 swift-container-server: no_change:24 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
<46>Nov 17 05:09:43 node-2 swift-account-server: Beginning replication run
<46>Nov 17 05:09:43 node-2 swift-account-server: Replication run OVER
<46>Nov 17 05:09:43 node-2 swift-account-server: Attempted to replicate 1 dbs in 0.01444 seconds (69.23188/s)
<46>Nov 17 05:09:43 node-2 swift-account-server: Removed 0 dbs
<46>Nov 17 05:09:43 node-2 swift-account-server: 2 successes, 0 failures
<46>Nov 17 05:09:43 node-2 swift-account-server: no_change:2 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
<46>Nov 17 05:09:49 node-2 swift-object-server: Starting object replication pass.
<46>Nov 17 05:09:49 node-2 swift-object-server: 11/11 (100.00%) partitions replicated in 0.13s (84.30/sec, 0s remaining)
<46>Nov 17 05:09:49 node-2 swift-object-server: 11 suffixes checked - 0.00% hashed, 0.00% synced
<46>Nov 17 05:09:49 node-2 swift-object-server: Partition times: max 0.0127s, min 0.0063s, med 0.0100s
<46>Nov 17 05:09:49 node-2 swift-object-server: Object replication complete. (0.00 minutes)

Diagnostic snapshot is attached.

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-17:

fail_error_ha_neutron_delete_vips-fuel-snapshot-2015-11-17_05-16-23.tar.xz Edit (76.2 MiB, application/octet-stream)

Ilya Kutukov (ikutukov) on 2015-11-17

tags:	added: area-library
Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)
importance:	Undecided → Medium

Ilya Kutukov (ikutukov) on 2015-11-17

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-11-17:

Please elaborate what is the exact subject of the reported issue? Is it 20 seconds downtime of several swift backends? Looks like no issue at all, there was 2 active anyway

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-17:

@Bogdan,

yes, the issue is swift backend downtime which causes OSTF health check failure. I agree that there is nothing criminal here, the priority of the bug is medium or even low. But we can't just invalidate it, because 99% of our automatic system tests use OSTF for checking cloud health. So if some of checks fails then the whole system test fail.
That's why QA and developers must get an agreement here - in case if sporadic small downtime of swift backend is expected behaviour, then we need to adjust our tests. Otherwise improvement for swift is needed. In any case we need a confirmation from swift experts here.

Changed in fuel:
status:	Invalid → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-11-18:

Agreed, let's please adjust the tests then

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel QA Team (fuel-qa)
tags:	added: area-qa removed: area-library haproxy swift

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-18:

@Bogdan,

please provide us with details, what exactly should we expect while checking HAProxy stats for Swift backend? If some of backends are DOWN, but there are some UP, what is the status (pass/fail) of health check then?
Maybe Swift API goes down only during replication and normally it must be always up? If yes, in what cases (e.g. controller reset, VIP deletion) replication is usually triggered?

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-22:

Guys,

in order to adjust tests QA team needs to get answers on questions from comment #5.

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-11-23:

I suggest to accept the swift state as OK as far as you can issue create/list/delete commands with the SWIFT CLI

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel QA Team (fuel-qa)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-11-23:

Sorry, I don't agree with you. The test 'Check state of haproxy backends on controllers' is designed to check cloud HA health status. And generally all services backends *must* be UP all the time except some of nodes are offline or in maintenance state. So in that test we are trying to avoid a situation when some of backends don't work for some reason (failed to start on non-primary controller, incorrect firewall rules, etc.), but actually a service API/CLI works fine. We can't catch such issue just by checking service functionality, but it could significantly affect high availability of our clouds. That's why verification of HAProxy backends was added to OSTF tests.

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-12-01:

openstack/manifests/ha/swift.pp:
balancermember_options => 'check port 49001 inter 15s fastinter 2s downinter 8s rise 3 fall 3'

Tests expectations doesn't match with check interval of haproxy swift backend therefore there are 2 way to go: adjust either haproxy interval or tests check delay.

Michael Polenchuk (mpolenchuk) on 2015-12-01

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Michael Polenchuk (mpolenchuk)

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-12-01:

#10

Or it's network connectivity issue which looks like more plausible.

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-12-01:

#11

fuel-lib:files/fuel-ha-utils/tools/swiftcheck:

// "nc (netcat)" should be used instead of "ping" + retries
# Check for the management VIP avail.
ping -c3 $2 2>&1 >/dev/null

// use of HEAD would be enough/effective + (--retry <num>)
curl --connect-timeout ${3} -XGET ${url}/healthcheck

Dmitry Pyzhov (dpyzhov) on 2015-12-01

tags:	added: area-library removed: area-qa
tags:	added: area-qa removed: area-library

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-10: Fix proposed to fuel-library (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/255892

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-23: Fix merged to fuel-library (master)

#13

Reviewed: https://review.openstack.org/255892
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=69390d7d16403defee33caee10e3a990a7c3b795
Submitter: Jenkins
Branch: master

commit 69390d7d16403defee33caee10e3a990a7c3b795
Author: Michael Polenchuk <email address hidden>
Date: Thu Dec 10 16:55:21 2015 +0300

Pull apart swift haproxy health checker

    Setup custom script checker with additional auth endpoint availability
    scan if swift proxy listens to the same ip address with storage daemons
    otherwise use default internal health check method.

    Also introduce the following haproxy options:
    * <spread-checks>
      add some randomness in the check interval to avoid sending
      health checks to servers at exact interspaces.
    * <dontlognull>
      disable logging of null connections as these can pollute the logs.
    * <tcp-smart-accept, tcp-smart-connect>
      performance tweak, saving one ACK packet during the
      accept/connect sequence.

    DocImpact
    Change-Id: I70ebdc595e85294559d33cc03d4221a738b0bbc5
    Closes-Bug: #1516978