HA. Nova-compute is down after destroying primary controller

Bug #1289200 reported by Nikolay Fedotov on 2014-03-07
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bogdan Dobrelya
4.1.x
High
Bogdan Dobrelya
5.0.x
High
Bogdan Dobrelya

Bug Description

ISO: {"build_id": "2014-03-05_07-31-01", "mirantis": "yes", "build_number": "235", "nailgun_sha": "f58aad317829112913f364347b14f1f0518ad371", "ostf_sha": "dc54d99ddff2f497b131ad1a42362515f2a61afa", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "73313007c0914e602246ea41fa5e8ca2dfead9f8"}

Steps:
- Create environment. Cent OS, Neutron with GRE segmentation
- Add nodes: 3 controller, 2 compute
- Deploy changes
- Destroy primary controller
- Try to boot an instance. Check output of a "nova service-list" command

Result:
Unable to boot instance : No valid host was found
nova-compute is down

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 4.1.1
Bogdan Dobrelya (bogdando) wrote :

Ths issue could be also related to https://bugs.launchpad.net/fuel/+bug/1288831

Changed in fuel:
milestone: 4.1.1 → 5.0
tags: added: backports-4.1.1
Changed in fuel:
status: New → Confirmed
Andrew Woodward (xarses) on 2014-04-04
tags: added: ha
Vladimir Kuklin (vkuklin) wrote :

needs to be reconfirmed with newer rabbitmq and haproxy-in-namespace patch

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel QA Team (fuel-qa)
Andrew Woodward (xarses) wrote :

This is still a problem in 4.1.1 since the relevant patches are back-ported, I'm going to say that this is not fixed

Changed in fuel:
importance: Medium → High
Vladimir Kuklin (vkuklin) wrote :

Andrew, is this ok that we move this bug to 4.1.1 ?

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Sergey Vasilenko (xenolog)
Sergey Vasilenko (xenolog) wrote :

after removing 1st controller rabbitmq cluster take huge time for rebuild himself (about 180sec.)
[root@node-3 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-3']}]},
 {running_nodes,['rabbit@node-2','rabbit@node-3']},
 {partitions,[]}]
...done.

Horizon temporary feel sick

nova-compute go out permanently.

If we restart nova-conductor on ALL alive controllers _AND_ nova-compute -- nova-compute returns from darkness
If we have non-syncronized clock, we can see this fail:

nova-scheduler node-2.domain.tld internal enabled :-) 2014-04-30 17:46:51
nova-compute node-4.domain.tld nova enabled XXX 2014-04-30 17:44:13
nova-cert node-3.domain.tld internal enabled :-) 2014-04-30 17:46:51

In really it's a not fail, because timestamp of nova-compute was changed, and interpreted by nova-manage as fail because time not synched.

Andrew Woodward (xarses) wrote :

@bogdando via https://bugs.launchpad.net/fuel/+bug/1317488
Summary:
Fuel should provide TCP KA (keepalives) for rabitmq sessions in HA mode.
These TCP KA should be visible at the app layer as well as at the network stack layer.

related Oslo.messaging issue: https://bugs.launchpad.net/oslo.messaging/+bug/856764
related fuel-dev ML: https://lists.launchpad.net/fuel-dev/msg01024.html

Issues we have in the Fuel:
1) In 5.0 we upgraded rabbit up to 3.x and moved its connections management out of the HAproxy scope for most of the Openstack services (those ones who have synced rabbit_hosts support from Oslo.messages). ( Was also backported for 4.1.1)
Hence, we still have to provide a TCP KA for rabbitmq sessions in order to make Fuel HA arch more reliable.

2) Anyway, HAproxy provides TCP KA only for network layer, see in the docs:
"It is important to understand that keep-alive packets are neither emitted nor
  received at the application level. It is only the network stacks which sees
  them. For this reason, even if one side of the proxy already uses keep-alives
  to maintain its connection alive, those keep-alive packets will not be
  forwarded to the other side of the proxy."

3)We have it configured in the wrong way, see HAproxy docs:
"Using option "tcpka" enables the emission of TCP keep-alive probes on both
  the client and server sides of a connection. Note that this is meaningful
  only in "defaults" or "listen" sections. If this option is used in a
  frontend, only the client side will get keep-alives, and if this option is
  used in a backend, only the server side will get keep-alives. For this
  reason, it is strongly recommended to explicitly use "option clitcpka" and
  "option srvtcpka" when the configuration is split between frontends and
  backends."

Suggested solution:
Apply all patches from #856764 for Nova in MOS packages and test the RabbitMQ connections thoroughly. If it looks OK, sync the patches for other MOS packages.
Perhaps, this issue should be fixed in 5.1 but backporting should be considered as a critical for 4.1.1 release (due to the increasing number of existing tickets in zendesk) and as High for 5.1. I hope, the 5.0 backport is not needed due to the option to roll an upgrade 5.0 -> 5.1 would be existing.

Ryan Moe (rmoe) wrote :

The support for Rabbit heartbeat was reverted: https://review.openstack.org/#/c/36606/. With kombu you have to call heartbeat_check() once per second. Without a thread calling that function your connections will all die after heartbeat seconds.

The kombu reconnect changes here: https://review.openstack.org/#/c/76686/ along with the CCN changes are already in our packages. The config changes to rabbit here: https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19 sound helpful though and are worth testing.

Bogdan Dobrelya (bogdando) wrote :

Thank you for clarifying this. Ryan. Any comment on https://review.openstack.org/77276 destiny?

Bogdan Dobrelya (bogdando) wrote :

As far as I can see from the related oslo.bug and aforementioned comments (https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19) and from zendesk solutions for nova-compute troubleshooting (#1743), the first steps to address this issue should be:
1) TCP KA for Rabbit cluster (comments/19)
2) address use running_deleted_instance_action=reap for nova conf (#1743 and http://lists.openstack.org/pipermail/openstack-dev/2013-October/016153.html)

Sergey Vasilenko (xenolog) wrote :

Guys, at the start of this release cycle we had the same situation with mysql.
i.e. connection from host to the himself. To the local IP address.
For Mysql we fix this problem by remove haproxy to the network namespace. I.e. eliminated connections to the himself, through localhost.
In the current RabbitMQ implementation we have two "local" endpoints: 127.0.0.1 and local controller IP address.

Can anybody try to reproduce this case, when openstack services will connect to rabbitmq through haproxy ?

Bogdan Dobrelya (bogdando) wrote :

Sergey, MySQL and RabbitMQ clusters are not comparable, if we are talking about localhost connections. It is normal for RabbitMQ with mirrored queues to get or publish messages via localhost, cuz it would auto-resend all messages to the main rabbit. But MySQL could use some benefits from localhost connections only for read only slaves as well as for configured multi muster writes, perhaps (but Fuel uses neither of them)

Sergey Vasilenko (xenolog) wrote :

from TCP/IP implementation in the linux kernel has no differences between mysql and rabbitmq connections.

Bogdan Dobrelya (bogdando) wrote :

Basically, I was meaning that using the direct localhost & other rabit nodes connections in HA are not a problem for RabbitMQ and perhaps, unlikely to the MySQL+VIP case, doesn't require to be fixed. Keepalives for application level could handle app layer just fine, hence we don't have to go through the layers below.

Fix proposed to branch: master
Review: https://review.openstack.org/93815

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Vladimir Kuklin (vkuklin)
status: Confirmed → In Progress
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)

Reviewed: https://review.openstack.org/93815
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=549173c10173708da952f5a94f9a0bb9f1434220
Submitter: Jenkins
Branch: master

commit 549173c10173708da952f5a94f9a0bb9f1434220
Author: Vladimir Kuklin <email address hidden>
Date: Thu May 15 17:41:15 2014 -0400

    Let kernel kill dead TCP connections

    1) Kill dead tcp connections in a more effective way:
    configure the keepalive routines to wait for 30 secs before sending
    the first keepalive probe, and then resend it every 8 seconds.
    If no ACK response is received for 3 tries, mark the connection
    as broken.
    (The defaults are 7200, 75, 9 respectively and provide a *very* poor logic
    for dead connections tracking and failover as well)
    http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
    2) Set report_interval to 60 seconds and service_down_time to 180
    seconds for Nova to let kernel kill dead connections.
    3) Fix missing report_interval param usage in nova class.
    4) Provide a new openstack::keepalive class in order to configure
    networking related sysctls during the 'netconfig' stage.

    Change-Id: Ic9d491f4904a5e665278027fc37254003c4b5172
    Closes-Bug: #1289200

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
milestone: 5.0 → 4.1.1
status: Fix Committed → In Progress

Reviewed: https://review.openstack.org/94147
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=9a0e985b3a7936f993e3ef3c30c7388d98791523
Submitter: Jenkins
Branch: stable/4.1

commit 9a0e985b3a7936f993e3ef3c30c7388d98791523
Author: Vladimir Kuklin <email address hidden>
Date: Thu May 15 17:41:15 2014 -0400

    Let kernel kill dead TCP connections

    1) Kill dead tcp connections in a more effective way:
    configure the keepalive routines to wait for 30 secs before sending
    the first keepalive probe, and then resend it every 8 seconds.
    If no ACK response is received for 3 tries, mark the connection
    as broken.
    (The defaults are 7200, 75, 9 respectively and provide a *very* poor logic
    for dead connections tracking and failover as well)
    http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
    2) Set report_interval to 60 seconds and service_down_time to 180
    seconds for Nova to let kernel kill dead connections.
    3) Fix missing report_interval param usage in nova class.
    4) Provide a new openstack::keepalive class in order to configure
    networking related sysctls during the 'netconfig' stage.
    poke ci

    Change-Id: Ic9d491f4904a5e665278027fc37254003c4b5172
    Closes-Bug: #1289200
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Dmitry Borodaenko (angdraug) wrote :

Below is a Release Notes friendly description of the current state of this bug with all fixes merged to master so far and 3 additional fixes:
- https://review.openstack.org/93411 rabbitmq-keepalive
- https://review.openstack.org/93883 rabbitmq-hosts-shuffle
- https://bugs.launchpad.net/fuel/+bug/1321451 python-kombu-and-amqp-upgrade

Controller failover may cause Nova to fail to start VM instances 2014-05-20
---------------------------------------------------------------------------

If one of the Controller nodes abruptly goes offline, it is possible that some
of the TCP connections from Nova services on Compute nodes to RabbitMQ on the
failed Controller node will not be immediately terminated.

When that happens, RPC communication between the nova-compute service and Nova
services on Controller nodes stops, and Nova becomes unable to manage VM
instances on the affected Compute nodes. Instances that were previously
launched on these nodes continue running but cannot be stopped or modified, and
new instances scheduled to the affected nodes will fail to launch.

After 2 hours (sooner if the failed controller node is brought back online),
zombie TCP connections are terminated; after that, Nova services on affected
Compute nodes reconnect to RabbitMQ on one of the operational Controller nodes
and RPC communication is restored. Manual restart of nova-compute service on
affected nodes also results in immediate recovery.

See `LP1289200 <https://bugs.launchpad.net/fuel/+bug/1289200>`_.

Bogdan Dobrelya (bogdando) wrote :

Dmitry, please note, I was able to reproduce this (#21) for default TCPKA sysctls in host OS (7200, 75, 9). But It wasn't reproduceable anymore with https://review.openstack.org/#/c/94147/.
And that is how I was testing this:

0) Given
- node-1 192.168.0.3 primary controller hosting the VIP 192.168.0.2
- node-5 192.168.0.7 compute node under test (it should be connected via AQMP to the given node-1 as well)
- rabbitmq accept rule for controller has a number #16
- spawned some instances at compute under test

*CASE A.*
1)add iptables block rules from node-5 to node-1:5673 (take care for conntrack as well!)
[root@node-1 ~]# iptables -I INPUT 16 -s 192.168.0.7/32 -p tcp --dport 5673 -j DROP
[root@node-1 ~]# iptables -I FORWARD 1 -s 192.168.0.7/32 -p tcp ! --syn -m state --state NEW -j DROP

2.a)send a task to spawn some and delete some others instances at the compute under test
2.b)watch for accumulated messages in the queue for compute under test
[root@node-1 ~]# rabbitmqctl list_queues name messages | grep compute| egrep -v "0|^$"

3.a)wait for AMQP reconnections from the compute under test (i.e. just grep its logs for "Reconnecting to AMQP")
3.b) watch for established connections continuously:
lsof: [root@node-5 ~]# lsof -i -n -P -itcp -iudp | egrep 'nova-comp'
ss: [root@node-5 ~]# ss -untap | grep -E '5673\s+u.*compute'

4)remove iptables drop rules (teardown)

The expected result was:
a) Reconnection attemtps from coumpute under test within a ~2 min after the traffic was blocked at the controller side with iptables
b) Hanged requests for instances (non consumed messages in queues) should 'alive' after compute under test being successful reconnected to new AMQP node.

*CASE B*.
The same as CASE a but instead of iptables rules issue a kill -9 beam pid on node-1 which has an open connection with the compute under test, e.g. (for given data in section (0):
[root@node-1 ~]# ss -untap | grep '\.2:5673.*\.6'
tcp ESTAB 0 0 ::ffff:192.168.0.2:5673 ::ffff:192.168.0.6:32790 users:(("beam",28198,35))
...
[root@node-1 ~]# kill -9 28198

So, Dmitry, could you please elaborate:
1) either your test cases or results was different.
2) what was your expected results - a)reconnection attempts from compute and/or b)absence of rmq related operations with the dead sockets in strace outputs, c)some other rmq traffic related criterias as well?

Bogdan Dobrelya (bogdando) wrote :

forgot to add, I'm running at the 'stock'
amqp==1.0.12
amqplib==1.0.0
kombu==2.5.12
provided with the current Fuel ISO

Bogdan Dobrelya (bogdando) wrote :

Sorry, have lost in tracking history...
https://review.openstack.org/#/c/93411/ is in review. Once merged, I will backport it and the bug could be considered as closed (or reassigned)

Reviewed: https://review.openstack.org/93411
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b41ceb676b4441abdfd214cf8f4edf200f37f563
Submitter: Jenkins
Branch: master

commit b41ceb676b4441abdfd214cf8f4edf200f37f563
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 13 13:30:03 2014 +0300

    Use TCPKA for Rabbit cluster

    See https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19
    poke ci

    Related-Bug: 1289200
    Change-Id: I39684e0a57d05451ecb4786250529518fe142b1d
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Bogdan Dobrelya (bogdando) wrote :

A small update for #22:
[root@node-1 ~]# iptables -I FORWARD 1 -s 192.168.0.7/32 -p tcp ! --syn -m state --state NEW -j DROP
looks like a wrong one and won't block an established/related sessions in conntrack as well
# iptables -I INPUT 3 -s 192.168.0.7/32 -p tcp --dport 5673 -m state --state ESTABLISHED,RELATED -j DROP - here is a right one.
(or you could use 'conntrack -F' if you want)

Reviewed: https://review.openstack.org/94602
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2bd2d3b87a06ad832273c262c288a0527e2e39da
Submitter: Jenkins
Branch: stable/4.1

commit 2bd2d3b87a06ad832273c262c288a0527e2e39da
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 13 13:30:03 2014 +0300

    Use TCPKA for Rabbit cluster

    See https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19

    Related-Bug: 1289200
    Change-Id: I39684e0a57d05451ecb4786250529518fe142b1d
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Bogdan Dobrelya (bogdando) wrote :

https://bugs.launchpad.net/fuel/+bug/1321451 python-kombu-and-amqp-upgrade looks good - verified by the test case described at #22

Issue was reproduced on release iso {"build_id": "2014-05-23_03-53-39", "mirantis": "yes", "build_number": "19", "ostf_sha": "5c479f04c35127576d35526650ec83b104f9a33d", "nailgun_sha": "bd09f89ef56176f64ad5decd4128933c96cb20f4", "production": "docker", "api": "1.0", "fuelmain_sha": "db2d153e62cb2b3034d33359d7e3db9d4742c811", "astute_sha": "9a0d86918724c1153b5f70bdae008dea8572fd3e", "release": "5.0", "fuellib_sha": "2ed4fbe1e04b85e83f1010ca23be7f5da34bd492"}

Steps:
1. Create next cluster - Ubuntu, HA, KVM, Flat nova network, 3 controllers, 1 compute, 1 cinder
2. Deploy cluster
3. Destroy primary controller
4. Wait some time and run OSTF tests

Actual result - all tests failed with 'Keystone client is not available'
Seems that some problems with rabbitmq
2014-05-23T12:25:21.054537 node-5 ./node-5.test.domain.local/nova-nova.openstack.common.periodic_task.log:2014-05-23T12:25:21.054537+00
:00 err: ERROR: Error during FlatDHCPManager._disassociate_stale_fixed_ips: Timed out waiting for a reply to message ID b4f36e134f9249d0a53e2441aec
0cd2e

Logs are attached

Primary controller was shut down at:

2014-05-23 12:23:39.228+0000: shutting down

Bogdan Dobrelya (bogdando) wrote :

The fix will be for https://bugs.launchpad.net/fuel/+bug/1322259 as well (it duplicates 1289200)

Bogdan Dobrelya (bogdando) wrote :

Please note, there is an another one heartbeat WIP patch https://review.openstack.org/#/c/94656
(they plan to get rid of the direct eventlet usage in order to improve it, but there are +1 from the OSt operators, tho)

Mike Scherbakov (mihgen) on 2014-07-17
Changed in fuel:
milestone: 5.0 → 5.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers