HA. Nova-compute is down after destroying primary controller
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Fuel for OpenStack |
High
|
Bogdan Dobrelya | ||
| 4.1.x |
High
|
Bogdan Dobrelya | ||
| 5.0.x |
High
|
Bogdan Dobrelya |
Bug Description
ISO: {"build_id": "2014-03-
Steps:
- Create environment. Cent OS, Neutron with GRE segmentation
- Add nodes: 3 controller, 2 compute
- Deploy changes
- Destroy primary controller
- Try to boot an instance. Check output of a "nova service-list" command
Result:
Unable to boot instance : No valid host was found
nova-compute is down
Nikolay Fedotov (nfedotov) wrote : | #1 |
Changed in fuel: | |
assignee: | nobody → Fuel Library Team (fuel-library) |
milestone: | none → 4.1.1 |
Bogdan Dobrelya (bogdando) wrote : | #2 |
Changed in fuel: | |
milestone: | 4.1.1 → 5.0 |
tags: | added: backports-4.1.1 |
Changed in fuel: | |
status: | New → Confirmed |
tags: | added: ha |
Vladimir Kuklin (vkuklin) wrote : | #3 |
needs to be reconfirmed with newer rabbitmq and haproxy-
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Fuel QA Team (fuel-qa) |
Andrew Woodward (xarses) wrote : | #4 |
This is still a problem in 4.1.1 since the relevant patches are back-ported, I'm going to say that this is not fixed
Changed in fuel: | |
importance: | Medium → High |
Vladimir Kuklin (vkuklin) wrote : | #5 |
Andrew, is this ok that we move this bug to 4.1.1 ?
Changed in fuel: | |
assignee: | Fuel QA Team (fuel-qa) → Sergey Vasilenko (xenolog) |
Sergey Vasilenko (xenolog) wrote : | #6 |
after removing 1st controller rabbitmq cluster take huge time for rebuild himself (about 180sec.)
[root@node-3 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,
{running_
{partitions,[]}]
...done.
Horizon temporary feel sick
nova-compute go out permanently.
If we restart nova-conductor on ALL alive controllers _AND_ nova-compute -- nova-compute returns from darkness
If we have non-syncronized clock, we can see this fail:
nova-scheduler node-2.domain.tld internal enabled :-) 2014-04-30 17:46:51
nova-compute node-4.domain.tld nova enabled XXX 2014-04-30 17:44:13
nova-cert node-3.domain.tld internal enabled :-) 2014-04-30 17:46:51
In really it's a not fail, because timestamp of nova-compute was changed, and interpreted by nova-manage as fail because time not synched.
Andrew Woodward (xarses) wrote : | #7 |
@bogdando via https:/
Summary:
Fuel should provide TCP KA (keepalives) for rabitmq sessions in HA mode.
These TCP KA should be visible at the app layer as well as at the network stack layer.
related Oslo.messaging issue: https:/
related fuel-dev ML: https:/
Issues we have in the Fuel:
1) In 5.0 we upgraded rabbit up to 3.x and moved its connections management out of the HAproxy scope for most of the Openstack services (those ones who have synced rabbit_hosts support from Oslo.messages). ( Was also backported for 4.1.1)
Hence, we still have to provide a TCP KA for rabbitmq sessions in order to make Fuel HA arch more reliable.
2) Anyway, HAproxy provides TCP KA only for network layer, see in the docs:
"It is important to understand that keep-alive packets are neither emitted nor
received at the application level. It is only the network stacks which sees
them. For this reason, even if one side of the proxy already uses keep-alives
to maintain its connection alive, those keep-alive packets will not be
forwarded to the other side of the proxy."
3)We have it configured in the wrong way, see HAproxy docs:
"Using option "tcpka" enables the emission of TCP keep-alive probes on both
the client and server sides of a connection. Note that this is meaningful
only in "defaults" or "listen" sections. If this option is used in a
frontend, only the client side will get keep-alives, and if this option is
used in a backend, only the server side will get keep-alives. For this
reason, it is strongly recommended to explicitly use "option clitcpka" and
"option srvtcpka" when the configuration is split between frontends and
backends."
Suggested solution:
Apply all patches from #856764 for Nova in MOS packages and test the RabbitMQ connections thoroughly. If it looks OK, sync the patches for other MOS packages.
Perhaps, this issue should be fixed in 5.1 but backporting should be considered as a critical for 4.1.1 release (due to the increasing number of existing tickets in zendesk) and as High for 5.1. I hope, the 5.0 backport is not needed due to the option to roll an upgrade 5.0 -> 5.1 would be existing.
Ryan Moe (rmoe) wrote : | #8 |
The support for Rabbit heartbeat was reverted: https:/
The kombu reconnect changes here: https:/
Bogdan Dobrelya (bogdando) wrote : | #9 |
Thank you for clarifying this. Ryan. Any comment on https:/
Bogdan Dobrelya (bogdando) wrote : | #10 |
As far as I can see from the related oslo.bug and aforementioned comments (https:/
1) TCP KA for Rabbit cluster (comments/19)
2) address use running_
Sergey Vasilenko (xenolog) wrote : | #11 |
Guys, at the start of this release cycle we had the same situation with mysql.
i.e. connection from host to the himself. To the local IP address.
For Mysql we fix this problem by remove haproxy to the network namespace. I.e. eliminated connections to the himself, through localhost.
In the current RabbitMQ implementation we have two "local" endpoints: 127.0.0.1 and local controller IP address.
Can anybody try to reproduce this case, when openstack services will connect to rabbitmq through haproxy ?
Related fix proposed to branch: master
Review: https:/
Bogdan Dobrelya (bogdando) wrote : | #13 |
Sergey, MySQL and RabbitMQ clusters are not comparable, if we are talking about localhost connections. It is normal for RabbitMQ with mirrored queues to get or publish messages via localhost, cuz it would auto-resend all messages to the main rabbit. But MySQL could use some benefits from localhost connections only for read only slaves as well as for configured multi muster writes, perhaps (but Fuel uses neither of them)
Sergey Vasilenko (xenolog) wrote : | #14 |
from TCP/IP implementation in the linux kernel has no differences between mysql and rabbitmq connections.
Bogdan Dobrelya (bogdando) wrote : | #15 |
Basically, I was meaning that using the direct localhost & other rabit nodes connections in HA are not a problem for RabbitMQ and perhaps, unlikely to the MySQL+VIP case, doesn't require to be fixed. Keepalives for application level could handle app layer just fine, hence we don't have to go through the layers below.
Bogdan Dobrelya (bogdando) wrote : | #16 |
related issue https:/
Fix proposed to branch: master
Review: https:/
Changed in fuel: | |
assignee: | Sergey Vasilenko (xenolog) → Vladimir Kuklin (vkuklin) |
status: | Confirmed → In Progress |
Changed in fuel: | |
assignee: | Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando) |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 549173c10173708
Author: Vladimir Kuklin <email address hidden>
Date: Thu May 15 17:41:15 2014 -0400
Let kernel kill dead TCP connections
1) Kill dead tcp connections in a more effective way:
configure the keepalive routines to wait for 30 secs before sending
the first keepalive probe, and then resend it every 8 seconds.
If no ACK response is received for 3 tries, mark the connection
as broken.
(The defaults are 7200, 75, 9 respectively and provide a *very* poor logic
for dead connections tracking and failover as well)
http://
2) Set report_interval to 60 seconds and service_down_time to 180
seconds for Nova to let kernel kill dead connections.
3) Fix missing report_interval param usage in nova class.
4) Provide a new openstack:
networking related sysctls during the 'netconfig' stage.
Change-Id: Ic9d491f4904a5e
Closes-Bug: #1289200
Changed in fuel: | |
status: | In Progress → Fix Committed |
Fix proposed to branch: stable/4.1
Review: https:/
Changed in fuel: | |
milestone: | 5.0 → 4.1.1 |
status: | Fix Committed → In Progress |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/4.1
commit 9a0e985b3a7936f
Author: Vladimir Kuklin <email address hidden>
Date: Thu May 15 17:41:15 2014 -0400
Let kernel kill dead TCP connections
1) Kill dead tcp connections in a more effective way:
configure the keepalive routines to wait for 30 secs before sending
the first keepalive probe, and then resend it every 8 seconds.
If no ACK response is received for 3 tries, mark the connection
as broken.
(The defaults are 7200, 75, 9 respectively and provide a *very* poor logic
for dead connections tracking and failover as well)
http://
2) Set report_interval to 60 seconds and service_down_time to 180
seconds for Nova to let kernel kill dead connections.
3) Fix missing report_interval param usage in nova class.
4) Provide a new openstack:
networking related sysctls during the 'netconfig' stage.
poke ci
Change-Id: Ic9d491f4904a5e
Closes-Bug: #1289200
Signed-off-by: Bogdan Dobrelya <email address hidden>
Dmitry Borodaenko (angdraug) wrote : | #21 |
Below is a Release Notes friendly description of the current state of this bug with all fixes merged to master so far and 3 additional fixes:
- https:/
- https:/
- https:/
Controller failover may cause Nova to fail to start VM instances 2014-05-20
-------
If one of the Controller nodes abruptly goes offline, it is possible that some
of the TCP connections from Nova services on Compute nodes to RabbitMQ on the
failed Controller node will not be immediately terminated.
When that happens, RPC communication between the nova-compute service and Nova
services on Controller nodes stops, and Nova becomes unable to manage VM
instances on the affected Compute nodes. Instances that were previously
launched on these nodes continue running but cannot be stopped or modified, and
new instances scheduled to the affected nodes will fail to launch.
After 2 hours (sooner if the failed controller node is brought back online),
zombie TCP connections are terminated; after that, Nova services on affected
Compute nodes reconnect to RabbitMQ on one of the operational Controller nodes
and RPC communication is restored. Manual restart of nova-compute service on
affected nodes also results in immediate recovery.
See `LP1289200 <https:/
Bogdan Dobrelya (bogdando) wrote : | #22 |
Dmitry, please note, I was able to reproduce this (#21) for default TCPKA sysctls in host OS (7200, 75, 9). But It wasn't reproduceable anymore with https:/
And that is how I was testing this:
0) Given
- node-1 192.168.0.3 primary controller hosting the VIP 192.168.0.2
- node-5 192.168.0.7 compute node under test (it should be connected via AQMP to the given node-1 as well)
- rabbitmq accept rule for controller has a number #16
- spawned some instances at compute under test
*CASE A.*
1)add iptables block rules from node-5 to node-1:5673 (take care for conntrack as well!)
[root@node-1 ~]# iptables -I INPUT 16 -s 192.168.0.7/32 -p tcp --dport 5673 -j DROP
[root@node-1 ~]# iptables -I FORWARD 1 -s 192.168.0.7/32 -p tcp ! --syn -m state --state NEW -j DROP
2.a)send a task to spawn some and delete some others instances at the compute under test
2.b)watch for accumulated messages in the queue for compute under test
[root@node-1 ~]# rabbitmqctl list_queues name messages | grep compute| egrep -v "0|^$"
3.a)wait for AMQP reconnections from the compute under test (i.e. just grep its logs for "Reconnecting to AMQP")
3.b) watch for established connections continuously:
lsof: [root@node-5 ~]# lsof -i -n -P -itcp -iudp | egrep 'nova-comp'
ss: [root@node-5 ~]# ss -untap | grep -E '5673\s+u.*compute'
4)remove iptables drop rules (teardown)
The expected result was:
a) Reconnection attemtps from coumpute under test within a ~2 min after the traffic was blocked at the controller side with iptables
b) Hanged requests for instances (non consumed messages in queues) should 'alive' after compute under test being successful reconnected to new AMQP node.
*CASE B*.
The same as CASE a but instead of iptables rules issue a kill -9 beam pid on node-1 which has an open connection with the compute under test, e.g. (for given data in section (0):
[root@node-1 ~]# ss -untap | grep '\.2:5673.*\.6'
tcp ESTAB 0 0 ::ffff:
...
[root@node-1 ~]# kill -9 28198
So, Dmitry, could you please elaborate:
1) either your test cases or results was different.
2) what was your expected results - a)reconnection attempts from compute and/or b)absence of rmq related operations with the dead sockets in strace outputs, c)some other rmq traffic related criterias as well?
Bogdan Dobrelya (bogdando) wrote : | #23 |
forgot to add, I'm running at the 'stock'
amqp==1.0.12
amqplib==1.0.0
kombu==2.5.12
provided with the current Fuel ISO
Bogdan Dobrelya (bogdando) wrote : | #24 |
Sorry, have lost in tracking history...
https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit b41ceb676b4441a
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 13 13:30:03 2014 +0300
Use TCPKA for Rabbit cluster
See https:/
poke ci
Related-Bug: 1289200
Change-Id: I39684e0a57d054
Signed-off-by: Bogdan Dobrelya <email address hidden>
Related fix proposed to branch: stable/4.1
Review: https:/
Bogdan Dobrelya (bogdando) wrote : | #27 |
A small update for #22:
[root@node-1 ~]# iptables -I FORWARD 1 -s 192.168.0.7/32 -p tcp ! --syn -m state --state NEW -j DROP
looks like a wrong one and won't block an established/related sessions in conntrack as well
# iptables -I INPUT 3 -s 192.168.0.7/32 -p tcp --dport 5673 -m state --state ESTABLISHED,RELATED -j DROP - here is a right one.
(or you could use 'conntrack -F' if you want)
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/4.1
commit 2bd2d3b87a06ad8
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 13 13:30:03 2014 +0300
Use TCPKA for Rabbit cluster
See https:/
Related-Bug: 1289200
Change-Id: I39684e0a57d054
Signed-off-by: Bogdan Dobrelya <email address hidden>
Bogdan Dobrelya (bogdando) wrote : | #29 |
https:/
Andrey Sledzinskiy (asledzinskiy) wrote : | #30 |
Issue was reproduced on release iso {"build_id": "2014-05-
Steps:
1. Create next cluster - Ubuntu, HA, KVM, Flat nova network, 3 controllers, 1 compute, 1 cinder
2. Deploy cluster
3. Destroy primary controller
4. Wait some time and run OSTF tests
Actual result - all tests failed with 'Keystone client is not available'
Seems that some problems with rabbitmq
2014-05-
:00 err: ERROR: Error during FlatDHCPManager
0cd2e
Logs are attached
Andrey Sledzinskiy (asledzinskiy) wrote : | #31 |
Andrey Sledzinskiy (asledzinskiy) wrote : | #32 |
Primary controller was shut down at:
2014-05-23 12:23:39.228+0000: shutting down
Bogdan Dobrelya (bogdando) wrote : | #33 |
The fix will be for https:/
Bogdan Dobrelya (bogdando) wrote : | #34 |
Please note, there is an another one heartbeat WIP patch https:/
(they plan to get rid of the direct eventlet usage in order to improve it, but there are +1 from the OSt operators, tho)
Changed in fuel: | |
milestone: | 5.0 → 5.1 |
Ths issue could be also related to https:/ /bugs.launchpad .net/fuel/ +bug/1288831