Fuel for OpenStack

Nova-compute service is down on compute nodes after destroying controller with running l3 agent

Bug #1333143 reported by Andrey Sledzinskiy on 2014-06-23

This bug report is a duplicate of: Bug #1323277: [library] vip__management recovered with error. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	In Progress	High	Bogdan Dobrelya	Fuel for OpenStack 5.1

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_5/94/testReport/junit/%28root%29/deploy_ha_neutron/deploy_ha_neutron/

Test steps:
1. Create cluster - HA, Neutron with GRE segmentation, CentOS, KVM
2. Add 3 nodes with controller roles
3. Add 2 nodes with compute roles
4. Add 1 node with cinder role
5. Deploy the cluster
6. Destroy controller with running l3 agent
7. Wait for OFFLINE of the controller at fuel UI
8. Run instance connectivity OSTF tests

Expected - ostf tests passed
Actual - nova-compute is down on all compute nodes (node-4, node-5)

Errors in nova-compute.log for compute nodes:

2014-06-23T07:06:59.556640+01:00 err: 2014-06-23 06:06:59.553 26274 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/api.py", line 218, in service_update
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/nova/conductor/rpcapi.py", line 330, in service_update
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db service=service_p, values=values)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 150, in call
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db wait_for_reply=True, timeout=timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db timeout=timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 409, in send
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 400, in _send
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 267, in wait
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db reply, ending = self._poll_connection(msg_id, timeout)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/amqpdriver.py", line 217, in _poll_connection
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db % msg_id)
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f8c8161b08aa47828af194ad2f9432f1
2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db

Logs are attached

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2014-06-23:

fail_deploy_ha_neutron-2014_06_23__06_11_36.tar.gz Edit (8.6 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-23:

Reproducing, RCA in progress

Changed in fuel:
status:	New → In Progress
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-23:

Reproduced on {"build_id": "2014-06-16_00-31-15", "mirantis": "yes", "build_number": "255", "ostf_sha": "67b61ed3788297fa5d985afec32498d8c0f812db", "nailgun_sha": "984aa7a86487f1488c2f83c052904abd9f589b7f", "production": "docker", "api": "1.0", "fuelmain_sha": "6f355160366475d52050d7898a1080a95ecb9cbf", "astute_sha": "17b1afa5f0dc8f4fca5ed4eb03ec566fbfb5ed19", "release": "5.1", "fuellib_sha": "99d74172887ab81d38132655d6e5d180e8726437"}
with additional notice:
All OSTF on HA section have passed, all other - failed.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-23:

Download full text (6.1 KiB)

Some detail from the env with issue reproduced:

I have node-5 (192.168.0.3) with l3-agent and it was shut down while reproducing.
After shutdown,
- there are 2 controllers left: node-6 (192.168.0.4) and node-7 (192.168.0.5),
- p_neutron-l3-agent (ocf::mirantis:neutron-agent-l3): Started node-6.test.domain.local
- vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-6.test.domain.local
- vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-7.test.domain.local
but:

1) OSt services trying to contact mysql via management VIP (192.168.0.2) but it cannot accept connections to 3307, tcp.
[root@node-7 ~]# less /var/log/nova/conductor.log
...
2014-06-23 14:33:53.349 11110 TRACE oslo.messaging.rpc.dispatcher OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on '192.168.0.2' (111)") None None
[root@node-7 ~]# nova-manage service list
...
WARNING nova.openstack.common.db.sqlalchemy.session [req-<id> ] SQL connection failed"
[root@node-7 ~]# telnet 192.168.0.2 3307
Trying 192.168.0.2...
telnet: connect to address 192.168.0.2: Connection refused
[root@node-7 ~]# ip netns exec haproxy telnet 192.168.0.2 3307
Trying 192.168.0.2...
telnet: connect to address 192.168.0.2: Connection refused
(icmp is OK tho)
[root@node-7 ~]# ping -c 3 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=63 time=0.682 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=63 time=0.123 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=63 time=0.295 ms
[root@node-7 ~]# ip netns exec haproxy ping -c 3 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=62 time=0.564 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=62 time=0.134 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=62 time=0.200 ms

Some detail from the env with issue reproduced:

I have node-5 (192.168.0.3) with l3-agent and it was shut down while reproducing.
After shutdown, 
- there are 2 controllers left: node-6 (192.168.0.4) and node-7 (192.168.0.5),
- p_neutron-l3-agent      (ocf::mirantis:neutron-agent-l3):       Started node-6.test.domain.local
- vip__management_old     (ocf::mirantis:ns_IPaddr2):     Started node-6.test.domain.local
- vip__public_old (ocf::mirantis:ns_IPaddr2):     Started node-7.test.domain.local
but:

1) OSt services trying to contact mysql via management VIP (192.168.0.2) but it cannot accept connections to 3307, tcp.
[root@node-7 ~]# less /var/log/nova/conductor.log
	...
	2014-06-23 14:33:53.349 11110 TRACE oslo.messaging.rpc.dispatcher OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on '192.168.0.2' (111)") None None
[root@node-7 ~]# nova-manage service list
	...
	WARNING nova.openstack.common.db.sqlalchemy.session [req-<id> ] SQL connection failed"
[root@node-7 ~]# telnet 192.168.0.2 3307
	Trying 192.168.0.2...
	telnet: connect to address 192.168.0.2: Connection refused
[root@node-7 ~]# ip netns exec haproxy telnet 192.168.0.2 3307
	Trying 192.168.0.2...
	telnet: connect to address 192.168.0.2: Connection refused
(icmp is OK tho)
[root@node-7 ~]# ping -c 3 192.168.0.2
	PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
	64 bytes from 192.168.0.2: icmp_seq=1 ttl=63 time=0.682 ms
	64 bytes from 192.168.0.2: icmp_seq=2 ttl=63 time=0.123 ms
	64 bytes from 192.168.0.2: icmp_seq=3 ttl=63 time=0.295 ms
[root@node-7 ~]# ip netns exec haproxy ping -c 3 192.168.0.2
	PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
	64 bytes from 192.168.0.2: icmp_seq=1 ttl=62 time=0.564 ms
	64 bytes from 192.168.0.2: icmp_seq=2 ttl=62 time=0.134 ms
	64 bytes from 192.168.0.2: icmp_seq=3 ttl=62 time=0.200 ms

2) there are variable number of half-open TCP connections from OSt services public_vip (172.16.0.2) to node-5 (192.168.0.3) initating from the controller hosting public_vip (inside haproxy netns)
[root@node-7 ~]# ip netns exec haproxy ss -untap | grep 'SYN-SENT'
	tcp    SYN-SENT   0      1             172.16.0.2:58439       192.168.0.3:35357  users:(("haproxy",11520,43))
	tcp    SYN-SENT   0      1             172.16.0.2:44653       192.168.0.3:9696   users:(("haproxy",11520,34))
	tcp    SYN-SENT   0      1             172.16.0.2:42812       192.168.0.3:3307   users:(("haproxy",11520,1))
	tcp    SYN-SENT   0      1             172.16.0.2:36520       192.168.0.3:80     users:(("haproxy",11520,39))
	tcp    SYN-SENT   0      1             172.16.0.2:53966       192.168.0.3:8004   users:(("haproxy",11520,37))
	tcp    SYN-SENT   0      1             172.16.0.2:59298       192.168.0.3:8773   users:(("haproxy",11520,44))
	tcp    SYN-SENT   0      1             172.16.0.2:49034       192.168.0.3:9191   users:(("haproxy",11520,35))
	tcp    SYN-SENT   0      1             172.16.0.2:35413       192.168.0.3:8774   users:(("haproxy",11520,42))
	tcp    SYN-SENT   0      1             172.16.0.2:59207       192.168.0.3:5000   users:(("haproxy",11520,38))
	tcp    SYN-SENT   0      1             172.16.0.2:36513       192.168.0.3:8775   users:(("haproxy",11520,36))
	tcp    SYN-SENT   0      1             172.16.0.2:35265       192.168.0.3:5673   users:(("haproxy",11520,41))
	tcp    SYN-SENT   0      1             172.16.0.2:37575       192.168.0.3:9292   users:(("haproxy",11520,7))
	tcp    SYN-SENT   0      1             172.16.0.2:58193       192.168.0.3:6080   users:(("haproxy",11520,2))
	tcp    SYN-SENT   0      1             172.16.0.2:39796       192.168.1.2:8080   users:(("haproxy",11520,40))
[root@node-7 ~]# ip netns exec haproxy "lsof | grep 'haproxy.*TCP.*SYN_SENT'"
	[root@node-7 ~]# ip netns exec haproxy lsof | grep 'haproxy.*TCP.*SYN_SENT'
	haproxy   11520   haproxy    1u     IPv4          161409207        0t0        TCP 172.16.0.2:59460->node-5:opsession-prxy (SYN_SENT)
	haproxy   11520   haproxy    2u     IPv4          161412645        0t0        TCP 172.16.0.2:47731->node-5:commplex-main (SYN_SENT)
	haproxy   11520   haproxy    7u     IPv4          161411985        0t0        TCP 172.16.0.2:52040->node-5:jms (SYN_SENT)
	haproxy   11520   haproxy   34u     IPv4          161411986        0t0        TCP 172.16.0.2:47761->node-5:8773 (SYN_SENT)
	haproxy   11520   haproxy   35u     IPv4          161412646        0t0        TCP 172.16.0.2:46933->node-5:35357 (SYN_SENT)
	haproxy   11520   haproxy   36u     IPv4          161412647        0t0        TCP 172.16.0.2:52602->node-5:8776 (SYN_SENT)
	haproxy   11520   haproxy   37u     IPv4          161412652        0t0        TCP 172.16.0.2:53275->node-5:8775 (SYN_SENT)
	haproxy   11520   haproxy   38u     IPv4          161412653        0t0        TCP 172.16.0.2:54341->node-5:armtechdaemon (SYN_SENT)
	haproxy   11520   haproxy   63u     IPv4          161412493        0t0        TCP 172.16.0.2:56523->192.168.1.2:webcache (SYN_SENT)

3) and HAProxy shows
[root@node-7 ~]# echo "show stat" | socat unix-connect:/var/lib/haproxy/stats stdio | grep mysql
	mysqld,FRONTEND,,,0,0,8000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,12,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,
	mysqld,node-5,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,0,1,450549,450549,,1,12,1,,0,,2,0,,0,L4CON,,3000,,,,,,,0,,,,0,0,
	mysqld,node-6,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,1,15981,434568,,1,12,2,,0,,2,0,,0,L7OK,0,1,,,,,,,0,,,,0,0,
	mysqld,node-7,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,1,15981,434568,,1,12,3,,0,,2,0,,0,L7OK,0,1,,,,,,,0,,,,0,0,
	mysqld,BACKEND,0,0,0,0,8000,0,0,0,0,0,,0,0,0,0,UP,1,0,2,,1,15981,434568,,1,12,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,
[root@node-6 ~]# echo "show stat" | socat unix-connect:/var/lib/haproxy/stats stdio | grep mysql
	mysqld,FRONTEND,,,0,0,8000,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,12,0,,,,0,0,0,0,,,,,,,,,,,0,0,0,,,
	mysqld,node-5,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,2,1,18012,18012,,1,12,1,,0,,2,0,,0,L4TOUT,,10000,,,,,,,0,,,,0,0,
	mysqld,node-6,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,0,443691,0,,1,12,2,,0,,2,0,,0,L7OK,0,0,,,,,,,0,,,,0,0,
	mysqld,node-7,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,0,443691,0,,1,12,3,,0,,2,0,,0,L7OK,0,1,,,,,,,0,,,,0,0,
	mysqld,BACKEND,0,0,0,0,8000,0,0,0,0,0,,0,0,0,0,UP,1,0,2,,0,443691,0,,1,12,0,,0,,1,0,,0,,,,,,,,,,,,,,0,0,

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-06-23:

I'm able to reproduce this and one thing I notice is that the failure is related to killing the node with the management VIP, not the l3-agent.

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-06-24:

I tested this same deployment on 5.0 release and I see the same issue. After ~9 minutes everything recovered on its own and OSTF passed.

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-06-24:

Initial problem is with rabbitmq:

2014-06-23 06:06:59.553 26274 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID f8c8161b08aa47828af194ad2f9432f1

Seems to be the same as described here:

https://bugs.launchpad.net/fuel/+bug/1323277
( see https://bugs.launchpad.net/fuel/+bug/1323277/comments/15 )

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-24:

could be related https://bugs.launchpad.net/fuel/+bug/1323277

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-24:

my bad with #4
actually I was checking [root@node-7 ~]# telnet 192.168.0.2 3306
cuz we have to verify haproxy frontend at management_vip:3306, 3307 is for backends only

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-24:

#10

I reproduced the issue once more but with slightly different results, again...
OSTF tests are still failing, but this time much less of them. Anyway, I cannot spawn an instances or create volumes after killing a node with l3 agent.

Here is a log snapshot, please make sure you analyzing only the time frame 2014-06-24T12:51- 2014-06-24T13:51, all other records are non relevant (the env was deployed some time ago and reverted for this issue reproducing)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-24:

#11

logs Edit (48.6 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-01:

#12

The issue has nothing with RabbitMQ clustering and https://bugs.launchpad.net/fuel/+bug/1323277/comments/15. In all reproduced cases there was no problems neither with Galera nor with RabbitMQ clusters.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-01:

#13

I can confirm there are messages " nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID" in logs at the envs with this issue reproduced, though

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-01:

#14

Will check the solution for https://bugs.launchpad.net/fuel/+bug/1323277/comments/15
"after killing problem rabbitmq on node-2 (when only one working rabbitmq left on node-4), nova-compute services successfully recovered and I was able to create instances and pass OSTF."

If it applies here as well, this bug could be considered as a dup of https://bugs.launchpad.net/fuel/+bug/1323277

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-01:

#15

Ok, as far as I can see this issue is a dup indeed and it is not l3agent location related, byt vips location and rmq clustering instead.
Here is full log could be found for steps was done to ensure it is a dup http://pastebin.com/qqCVj7i1
Some notes:
1) After the failover of vips, SYN-SYNT sessions keep going on for ever (see step 3). So, looks like it is not an issue?..
2) Looks like the issue was resolved by rmq service restart *which have caused the compute service to reestablish its amqp sessions* and is a dup of 1323277, indeed
3) OSTF tests for RabbitMQ HA should verify if rmq services are gracefully restartable as well (restarting it one by one will allow to detect some hang glithes)

no longer affects:	fuel/4.1.x
no longer affects:	fuel/5.0.x

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-07-01:

#16

Also had to add
10) Recheck queues
[root@node-3 ~]# rabbitmqctl list_queues name messages | egrep -v "0|^$"
Listing queues ...
q-agent-notifier-security_group-update_fanout_8d465979f67e4bc785b1c6257ddb99e8 39
...done.
Notifications consumed ok but q-agent-notifier-security_group-update keep growing up